Files
state-hub/docs/consistency-sweep-runbook.md
tegwick 821b5d6c89 fix(STATE-WP-0064): parse consistency sweep stdout with skip prefixes
Extract the JSON payload from mixed script output and document Railiance01
kubectl sync steps. Mark T02 done after cluster bridge and resolver canaries.
2026-06-21 20:56:35 +02:00

3.7 KiB

State Hub Consistency Sweep Runbook

Purpose

This runbook answers whether the 15-minute State Hub consistency sync ran without relying on the local custodian-sync.timer.

The intended steady state after STATE-WP-0064 cutover is:

  • activity-core on Railiance01 owns the */15 * * * * UTC schedule and ActivityRun audit trail.
  • State Hub on the workstation owns scripts/consistency_check.py, lock semantics, reconciliation, and the consistency_sweep_remote_all progress event.
  • The local systemd timer is disabled after the parallel week passes.

API Surface

Manual or cluster-triggered invocation:

curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
  -H "Content-Type: application/json" \
  -d '{"max_seconds": 300}' | python3 -m json.tool

From Railiance01 through the bridge tunnel, use the STATE_HUB_URL configured for activity-core (for example the actcore-state-hub-bridge service target).

Schedule Check

From the activity-core host, confirm the definition is synced and the Temporal schedule exists:

Run on Railiance01 (the laptop .env points at docker-compose hostnames like app-db and will time out from WSL):

export KUBECONFIG=~/.kube/config-hosteurope

# 1. Apply runtime manifest when definitions change
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml

# 2. Sync definitions into Postgres
kubectl -n activity-core delete job actcore-sync --ignore-not-found
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s

# 3. Reconcile Temporal schedules
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules

After changing application code, rebuild and import activity-core:railiance01-prod per activity-core/k8s/railiance/README.md, then restart actcore-worker, actcore-api, and actcore-event-router.

Ensure state-hub-railiance01 ops-bridge tunnel is connected before cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for POST requests.

Expected definition:

  • name: State Hub Consistency Sweep
  • trigger: */15 * * * *
  • timezone: UTC
  • misfire policy: skip
  • enabled: false until manual canary passes, then true after cutover

Progress Event Check

Query State Hub for the latest sweep event:

curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
  | python3 -m json.tool

Healthy evidence includes:

  • lock_skipped: false on normal runs
  • repos_processed entries only for repos that needed action
  • skipped_clean, skipped_missing, and skipped_budget metadata when applicable
  • exit_code: 0 for warn-only remote-all sweeps

A lock_skipped: true response is normal when the local timer and the cluster schedule overlap during the parallel week.

ActivityRun Check

Query the activity-core database for the most recent run of the sweep definition:

select
  run_id,
  fired_at,
  scheduled_for,
  context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;

Manual Canary

Before enabling the cluster schedule:

  1. Confirm state-hub-railiance01 tunnel health from ops-bridge.
  2. Trigger one manual ActivityRun or POST the API through the bridge URL.
  3. Verify the progress event and ActivityRun context snapshot.
  4. Confirm idempotence when the local timer also fires (lock skip is OK).

Cutover

After one parallel week (STATE-WP-0064-T03):

systemctl --user disable --now custodian-sync.timer

Then enable the activity-core definition and treat the cluster schedule as the sole primary runner.