Files
railiance-cluster/docs/operator-runbook.md
2026-07-02 10:44:06 +02:00

2.1 KiB

Operator runbook — production-touching commands

All targets below change state on the production k3s cluster (railiance01 / COULOMBCORE, 92.205.130.254) or its backups. Agent sessions running in auto mode are denied these by the permission classifier — that is intentional.

How to run a production-touching target

  • Interactively in a Claude Code session: type ! <command> so the command runs under the operator's authority and the output lands in the conversation for the agent to act on.
  • Directly: run from this repo root on the workstation; cluster access is ssh railiance01 (key-based, configured in ~/.ssh/config).

Production-touching targets

Target Effect
sudo make backup writes age-encrypted backup to /opt/backup/railiance/cluster/
make k3s-install (re)installs k3s baseline — destructive, preflight first
make test-ha-failover kills the primary PG pod to assert recovery
make verify-activity-core reconciles activity-core runtime on railiance01
make reconcile-activity-core-llm-connect patches ConfigMap, applies llm-connect overlay, runs smoke pod
make deploy-activity-core-triage-robustness deploys ACTIVITY-WP-0016 code/schema/runtime as a coupled bundle and triggers daily triage
make admin-sync-smoke calls activity-core POST /admin/sync and proves worker pod identity/restart count did not change

Read-only / safe targets

make help, make preflight, make smoke, make restore (prints guide only). These are safe to allowlist for agent sessions.

Evidence convention

Reconcile/verify targets post non-secret evidence notes to the State Hub (STATE_HUB_EVIDENCE_WORKSTREAM_ID / STATE_HUB_EVIDENCE_TASK_ID env vars attach them to a workstream/task). Never record Secret values — key counts and readiness states only.

For make admin-sync-smoke, set ACTIVITY_CORE_ADMIN_SYNC_FIXTURE_COMMAND when you need a specific enabled-flip/rename fixture before the sync call. The command records whether a fixture ran; leaving it unset proves endpoint and no-restart behavior only.