Files
railiance-platform/docs/credential-lane-lifecycle-runbook.md
2026-07-02 14:52:08 +02:00

4.6 KiB

Credential Lane Lifecycle Runbook

Status: active (RAILIANCE-WP-0009-T07 / RAILIANCE-WP-0010-T07) Date: 2026-07-02

Covers deactivation, rotation, and compromise response for the workload KV lanes established by CCR-2026-0002 (issue-core) and CCR-2026-0003 (llm-connect). The canonical, always-current procedure is generated from the CCR itself — this runbook adds only the lane-specific consumer facts the generator cannot know.

scripts/credential-change.py lifecycle-plan <CCR-ID> --action {deactivate|rotate|compromise}
# then execute the rendered steps and record:
scripts/credential-change.py lifecycle-event <CCR-ID> --action <action> \
  --actor <operator> --reason "<non-secret>" --detail "<non-secret>" --record-state-hub

All three actions share the same invariants: the front door goes non-resolvable first, OpenBao metadata changes use approved operator or delegated-applier authority (never platform-admin handoffs), audit evidence is preserved (never delete the audit device or its entries), and no secret value ever appears in Git, State Hub, chat, prompts, or shell history.

Lane: issue-core runtime ingestion (CCR-2026-0002)

Item Value
KV path platform/workloads/issue-core/issue-core/issue-core-runtime
Fields ISSUE_CORE_API_KEY, GITEA_BACKEND_TOKEN
Policy / auth role workload-kv-read-issue-core-runtime / auth/kubernetes/role/external-secrets-issue-core
Primary consumer ExternalSecret issue-core/issue-core-runtime (CoulombCore cluster, 1h refresh)
ops-warden catalog issue-core-ingestion-api-key

Consumer facts the generated plan does not cover:

  • Deactivating the policy/role stops the ExternalSecret from refreshing, but the materialized Kubernetes Secret persists with the last value — a real deactivation or compromise response must also delete secret/issue-core-runtime in the issue-core namespace (ESO will not recreate it while the lane is down) and restart the issue-core Deployment.
  • ISSUE_CORE_API_KEY has a second consumer: railiance01's activity-core/actcore-runtime-secret holds an operator-injected copy (2026-07-02, ISSUE-WP-0003-T06). Rotation and compromise response MUST re-inject the new value there (stdin-only pipe from OpenBao) and restart deploy/actcore-worker, or activity-core emission silently starts failing with 401s on the next run.
  • GITEA_BACKEND_TOKEN is a scoped Gitea token for service user issue-core-svc; rotating it means minting a new token in Gitea first, then updating OpenBao — order matters, or ingestion breaks between steps.

Lane: llm-connect OpenRouter provider key (CCR-2026-0003)

Item Value
KV path platform/workloads/activity-core/llm-connect/llm-connect-provider-secrets
Field OPENROUTER_API_KEY
Policy / auth role workload-kv-read-llm-connect-provider-secrets / auth/kubernetes/role/external-secrets-activity-core
Primary consumer ExternalSecret activity-core/llm-connect-provider-secrets (CoulombCore cluster, 1h refresh)
ops-warden catalog openrouter-llm-connect

Consumer facts the generated plan does not cover:

  • llm-connect consumes the Secret via envFrom, so a rotated value reaches the runtime only after kubectl -n activity-core rollout restart deploy/llm-connect (CoulombCore). Wait for the ExternalSecret refresh (or force-sync annotate) before restarting.
  • The railiance01 llm-connect instance is out of scope of this lane: it uses a bootstrap-provisioned Secret from activity-core/k8s/railiance/bootstrap-secrets.sh. Rotating the OpenRouter key upstream (at OpenRouter) invalidates both copies — a provider-side rotation therefore always requires the railiance01 manual update too, or the daily triage runs start failing with provider auth errors.
  • Compromise response for a provider key has an extra step the plan cannot render: revoke the key at OpenRouter itself (provider console) before or immediately after disabling the front door; OpenBao custody actions alone do not stop a leaked provider key from working.

Verification after rotate

Return the lane to active only with fresh positive + negative evidence, same shape as activation (2026-07-02 precedent):

  • positive: ExternalSecret SecretSynced=True with a new refresh timestamp, consumer pod healthy after restart;
  • negative: a default-policy token denied on the KV data path, matched in the file audit device by path and timestamp;
  • record via lifecycle-event ... --record-state-hub and notify ops-warden to flip the catalog entry back to active.