# State Hub Consistency Sweep Runbook ## Purpose This runbook answers whether the 15-minute State Hub consistency sync ran without relying on the local `custodian-sync.timer` (retired 2026-06-21). **Steady state** (`STATE-WP-0064` cutover complete): - activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and ActivityRun audit trail. - State Hub on the workstation owns `scripts/consistency_check.py`, lock semantics, reconciliation, and the `consistency_sweep_remote_all` progress event. - The local systemd timer is **disabled**; cluster is the sole scheduler. ## API Surface Manual or cluster-triggered invocation: ```bash curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \ -H "Content-Type: application/json" \ -d '{"max_seconds": 300}' | python3 -m json.tool ``` From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL` configured for activity-core (for example the `actcore-state-hub-bridge` service target). ## Schedule Check From the activity-core host, confirm the definition is synced and the Temporal schedule exists: Run on **Railiance01** (the laptop `.env` points at docker-compose hostnames like `app-db` and will time out from WSL): ```bash export KUBECONFIG=~/.kube/config-hosteurope # 1. Apply runtime manifest when definitions change kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml # 2. Sync definitions into Postgres kubectl -n activity-core delete job actcore-sync --ignore-not-found kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s # 3. Reconcile Temporal schedules kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules ``` After changing application code, rebuild and import `activity-core:railiance01-prod` per `activity-core/k8s/railiance/README.md`, then restart `actcore-worker`, `actcore-api`, and `actcore-event-router`. Ensure `state-hub-railiance01` ops-bridge tunnel is `connected` before cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for POST requests. Expected definition: - name: `State Hub Consistency Sweep` - trigger: `*/15 * * * *` - timezone: `UTC` - misfire policy: `skip` - enabled: `true` ## Progress Event Check Query State Hub for the latest sweep event: ```bash curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \ | python3 -m json.tool ``` Healthy evidence includes: - `detail.source: activity-core` on scheduled runs - `lock_skipped: false` on normal runs - `repos_processed` entries only for repos that needed action - `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when applicable - `exit_code: 0` when automation completed (assessment failures are OK) - `automation_error: true` only for infrastructure faults (API down, C-00, etc.) - `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up A `lock_skipped: true` response is normal when a sweep is already in flight. Assessment failures do not fail the scheduler; automation errors do. ## ActivityRun Check Query the activity-core database for the most recent run of the sweep definition: ```sql select run_id, fired_at, scheduled_for, context_snapshot->'consistency_sweep_remote_all' as sweep_result from activity_runs where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b' order by fired_at desc limit 5; ``` ## Manual Canary Before enabling or after changing the cluster schedule: 1. Confirm `state-hub-railiance01` tunnel health from ops-bridge. 2. Trigger one manual ActivityRun or POST the API through the bridge URL. 3. Verify the progress event and ActivityRun context snapshot. ## Observability Summarise recent sweep events by source: ```bash cd ~/state-hub uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24 ``` After cutover, expect only `activity-core` (and manual) sources — no new `local-timer` events. ## Local fallback (emergency only) If cluster scheduling is broken, temporarily re-enable the archived systemd units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md). Disable again once cluster scheduling is restored.