Files
state-hub/docs/consistency-sweep-runbook.md
tegwick ab14e77e77 feat(STATE-WP-0064): start parallel week with source-tagged sweep runners
Tag consistency_sweep_remote_all progress events by source, route the local
timer through the API, add a parallel-week comparison script, and document
the 2026-06-21 to 2026-06-28 observation window for T03.
2026-06-21 21:46:43 +02:00

4.3 KiB

State Hub Consistency Sweep Runbook

Purpose

This runbook answers whether the 15-minute State Hub consistency sync ran without relying on the local custodian-sync.timer.

The intended steady state after STATE-WP-0064 cutover is:

  • activity-core on Railiance01 owns the */15 * * * * UTC schedule and ActivityRun audit trail.
  • State Hub on the workstation owns scripts/consistency_check.py, lock semantics, reconciliation, and the consistency_sweep_remote_all progress event.
  • The local systemd timer is disabled after the parallel week passes.

API Surface

Manual or cluster-triggered invocation:

curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
  -H "Content-Type: application/json" \
  -d '{"max_seconds": 300}' | python3 -m json.tool

From Railiance01 through the bridge tunnel, use the STATE_HUB_URL configured for activity-core (for example the actcore-state-hub-bridge service target).

Schedule Check

From the activity-core host, confirm the definition is synced and the Temporal schedule exists:

Run on Railiance01 (the laptop .env points at docker-compose hostnames like app-db and will time out from WSL):

export KUBECONFIG=~/.kube/config-hosteurope

# 1. Apply runtime manifest when definitions change
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml

# 2. Sync definitions into Postgres
kubectl -n activity-core delete job actcore-sync --ignore-not-found
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s

# 3. Reconcile Temporal schedules
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules

After changing application code, rebuild and import activity-core:railiance01-prod per activity-core/k8s/railiance/README.md, then restart actcore-worker, actcore-api, and actcore-event-router.

Ensure state-hub-railiance01 ops-bridge tunnel is connected before cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for POST requests.

Expected definition:

  • name: State Hub Consistency Sweep
  • trigger: */15 * * * *
  • timezone: UTC
  • misfire policy: skip
  • enabled: true during parallel week (T03); local timer retired after T04

Progress Event Check

Query State Hub for the latest sweep event:

curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
  | python3 -m json.tool

Healthy evidence includes:

  • lock_skipped: false on normal runs
  • repos_processed entries only for repos that needed action
  • skipped_clean, skipped_missing, and skipped_budget metadata when applicable
  • exit_code: 0 for warn-only remote-all sweeps

A lock_skipped: true response is normal when the local timer and the cluster schedule overlap during the parallel week.

ActivityRun Check

Query the activity-core database for the most recent run of the sweep definition:

select
  run_id,
  fired_at,
  scheduled_for,
  context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;

Manual Canary

Before enabling the cluster schedule:

  1. Confirm state-hub-railiance01 tunnel health from ops-bridge.
  2. Trigger one manual ActivityRun or POST the API through the bridge URL.
  3. Verify the progress event and ActivityRun context snapshot.
  4. Confirm idempotence when the local timer also fires (lock skip is OK).

Parallel week observability (T03)

Both runners call the same API and tag progress events with detail.source:

Source Runner
local-timer custodian-sync.timer on the workstation
activity-core Railiance01 Temporal schedule

Summarise evidence:

cd ~/state-hub
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24

Expect some lock_skipped: true events when both schedules overlap — that is healthy idempotence, not duplicate work.

Parallel window: 2026-06-21 → 2026-06-28 (review before T04 cutover).

Cutover

After one parallel week (STATE-WP-0064-T03):

systemctl --user disable --now custodian-sync.timer

The cluster definition stays enabled; disable only the local timer.