RAILIANCE-WP-0009/0010 T07: credential lane lifecycle runbook

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-07-02 14:52:08 +02:00
parent f803bf167b
commit 38c6b11103
3 changed files with 117 additions and 2 deletions

View File

@@ -0,0 +1,89 @@
# Credential Lane Lifecycle Runbook
Status: active (RAILIANCE-WP-0009-T07 / RAILIANCE-WP-0010-T07)
Date: 2026-07-02
Covers deactivation, rotation, and compromise response for the workload KV
lanes established by `CCR-2026-0002` (issue-core) and `CCR-2026-0003`
(llm-connect). The **canonical, always-current procedure** is generated from
the CCR itself — this runbook adds only the lane-specific consumer facts the
generator cannot know.
```bash
scripts/credential-change.py lifecycle-plan <CCR-ID> --action {deactivate|rotate|compromise}
# then execute the rendered steps and record:
scripts/credential-change.py lifecycle-event <CCR-ID> --action <action> \
--actor <operator> --reason "<non-secret>" --detail "<non-secret>" --record-state-hub
```
All three actions share the same invariants: the front door goes
non-resolvable *first*, OpenBao metadata changes use approved operator or
delegated-applier authority (never `platform-admin` handoffs), audit
evidence is preserved (never delete the audit device or its entries), and no
secret value ever appears in Git, State Hub, chat, prompts, or shell history.
## Lane: issue-core runtime ingestion (`CCR-2026-0002`)
| Item | Value |
| --- | --- |
| KV path | `platform/workloads/issue-core/issue-core/issue-core-runtime` |
| Fields | `ISSUE_CORE_API_KEY`, `GITEA_BACKEND_TOKEN` |
| Policy / auth role | `workload-kv-read-issue-core-runtime` / `auth/kubernetes/role/external-secrets-issue-core` |
| Primary consumer | ExternalSecret `issue-core/issue-core-runtime` (CoulombCore cluster, 1h refresh) |
| ops-warden catalog | `issue-core-ingestion-api-key` |
**Consumer facts the generated plan does not cover:**
- Deactivating the policy/role stops the ExternalSecret from *refreshing*,
but the materialized Kubernetes Secret **persists** with the last value —
a real deactivation or compromise response must also delete
`secret/issue-core-runtime` in the `issue-core` namespace (ESO will not
recreate it while the lane is down) and restart the issue-core Deployment.
- **`ISSUE_CORE_API_KEY` has a second consumer**: railiance01's
`activity-core/actcore-runtime-secret` holds an operator-injected copy
(2026-07-02, ISSUE-WP-0003-T06). Rotation and compromise response MUST
re-inject the new value there (stdin-only pipe from OpenBao) and restart
`deploy/actcore-worker`, or activity-core emission silently starts failing
with 401s on the next run.
- `GITEA_BACKEND_TOKEN` is a scoped Gitea token for service user
`issue-core-svc`; rotating it means minting a new token in Gitea first,
then updating OpenBao — order matters, or ingestion breaks between steps.
## Lane: llm-connect OpenRouter provider key (`CCR-2026-0003`)
| Item | Value |
| --- | --- |
| KV path | `platform/workloads/activity-core/llm-connect/llm-connect-provider-secrets` |
| Field | `OPENROUTER_API_KEY` |
| Policy / auth role | `workload-kv-read-llm-connect-provider-secrets` / `auth/kubernetes/role/external-secrets-activity-core` |
| Primary consumer | ExternalSecret `activity-core/llm-connect-provider-secrets` (CoulombCore cluster, 1h refresh) |
| ops-warden catalog | `openrouter-llm-connect` |
**Consumer facts the generated plan does not cover:**
- llm-connect consumes the Secret via `envFrom`, so a rotated value reaches
the runtime only after `kubectl -n activity-core rollout restart
deploy/llm-connect` (CoulombCore). Wait for the ExternalSecret refresh (or
`force-sync` annotate) *before* restarting.
- **The railiance01 llm-connect instance is out of scope of this lane**: it
uses a bootstrap-provisioned Secret from
`activity-core/k8s/railiance/bootstrap-secrets.sh`. Rotating the OpenRouter
key upstream (at OpenRouter) invalidates *both* copies — a provider-side
rotation therefore always requires the railiance01 manual update too, or
the daily triage runs start failing with provider auth errors.
- Compromise response for a provider key has an extra step the plan cannot
render: **revoke the key at OpenRouter itself** (provider console) before
or immediately after disabling the front door; OpenBao custody actions
alone do not stop a leaked provider key from working.
## Verification after rotate
Return the lane to `active` only with fresh positive + negative evidence,
same shape as activation (2026-07-02 precedent):
- positive: ExternalSecret `SecretSynced=True` with a new refresh timestamp,
consumer pod healthy after restart;
- negative: a `default`-policy token denied on the KV data path, matched in
the file audit device by path and timestamp;
- record via `lifecycle-event ... --record-state-hub` and notify ops-warden
to flip the catalog entry back to active.

View File

@@ -249,7 +249,7 @@ Acceptance:
```task
id: RAILIANCE-WP-0009-T07
status: wait
status: done
priority: medium
state_hub_task_id: "c85d1139-1f7d-4ed4-a2fc-5ea4ecbdf0c6"
```
@@ -293,3 +293,16 @@ the field-set decision to keep `ISSUE_CORE_API_KEY` and `GITEA_BACKEND_TOKEN`.
`/openbao/audit/openbao-audit.log`.
- T06 progress: front-door handoff sent to ops-warden (State Hub message
`5d47caaa-dd3f-496f-94ba-a488722f8d82`); waiting on catalog confirmation.
## T07 completed 2026-07-02
Lifecycle operations documented in
`docs/credential-lane-lifecycle-runbook.md`: the canonical per-action
procedure is generated by `scripts/credential-change.py lifecycle-plan
<CCR> --action {deactivate|rotate|compromise}`, and the runbook adds the
lane-specific consumer facts (materialized-Secret persistence, second
consumers, restart requirements, provider-side revocation for the OpenRouter
key) plus the post-rotate verification contract. Front-door disable comes
first in every action; audit evidence is never deleted; values stay in
OpenBao/operator custody.

View File

@@ -263,7 +263,7 @@ Acceptance:
```task
id: RAILIANCE-WP-0010-T07
status: wait
status: done
priority: medium
state_hub_task_id: "130155a5-e0f9-49f8-ba27-b48098746f02"
```
@@ -326,3 +326,16 @@ activity-core-owner); T01 closes on that approval with the
llm-connect instance on the railiance01 k3s cluster still consumes its
bootstrap-provisioned Secret; migrating it is railiance01-cluster work, not
part of CCR-2026-0003.
## T07 completed 2026-07-02
Lifecycle operations documented in
`docs/credential-lane-lifecycle-runbook.md`: the canonical per-action
procedure is generated by `scripts/credential-change.py lifecycle-plan
<CCR> --action {deactivate|rotate|compromise}`, and the runbook adds the
lane-specific consumer facts (materialized-Secret persistence, second
consumers, restart requirements, provider-side revocation for the OpenRouter
key) plus the post-rotate verification contract. Front-door disable comes
first in every action; audit evidence is never deleted; values stay in
OpenBao/operator custody.