Files
railiance-cluster/docs/operator-runbook.md
2026-07-02 10:44:06 +02:00

42 lines
2.1 KiB
Markdown

# Operator runbook — production-touching commands
All targets below change state on the production k3s cluster (railiance01 /
COULOMBCORE, 92.205.130.254) or its backups. Agent sessions running in auto
mode are denied these by the permission classifier — that is intentional.
## How to run a production-touching target
- **Interactively in a Claude Code session:** type `! <command>` so the
command runs under the operator's authority and the output lands in the
conversation for the agent to act on.
- **Directly:** run from this repo root on the workstation; cluster access is
`ssh railiance01` (key-based, configured in `~/.ssh/config`).
## Production-touching targets
| Target | Effect |
|---|---|
| `sudo make backup` | writes age-encrypted backup to `/opt/backup/railiance/cluster/` |
| `make k3s-install` | (re)installs k3s baseline — destructive, preflight first |
| `make test-ha-failover` | kills the primary PG pod to assert recovery |
| `make verify-activity-core` | reconciles activity-core runtime on railiance01 |
| `make reconcile-activity-core-llm-connect` | patches ConfigMap, applies llm-connect overlay, runs smoke pod |
| `make deploy-activity-core-triage-robustness` | deploys ACTIVITY-WP-0016 code/schema/runtime as a coupled bundle and triggers daily triage |
| `make admin-sync-smoke` | calls activity-core `POST /admin/sync` and proves worker pod identity/restart count did not change |
## Read-only / safe targets
`make help`, `make preflight`, `make smoke`, `make restore` (prints guide
only). These are safe to allowlist for agent sessions.
## Evidence convention
Reconcile/verify targets post non-secret evidence notes to the State Hub
(`STATE_HUB_EVIDENCE_WORKSTREAM_ID` / `STATE_HUB_EVIDENCE_TASK_ID` env vars
attach them to a workstream/task). Never record Secret values — key counts
and readiness states only.
For `make admin-sync-smoke`, set `ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND`
when you need a specific enabled-flip/rename fixture before the sync call. The
command records whether a fixture ran; leaving it unset proves endpoint and
no-restart behavior only.