Extract the JSON payload from mixed script output and document Railiance01 kubectl sync steps. Mark T02 done after cluster bridge and resolver canaries.
3.7 KiB
State Hub Consistency Sweep Runbook
Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local custodian-sync.timer.
The intended steady state after STATE-WP-0064 cutover is:
- activity-core on Railiance01 owns the
*/15 * * * *UTC schedule and ActivityRun audit trail. - State Hub on the workstation owns
scripts/consistency_check.py, lock semantics, reconciliation, and theconsistency_sweep_remote_allprogress event. - The local systemd timer is disabled after the parallel week passes.
API Surface
Manual or cluster-triggered invocation:
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
-H "Content-Type: application/json" \
-d '{"max_seconds": 300}' | python3 -m json.tool
From Railiance01 through the bridge tunnel, use the STATE_HUB_URL
configured for activity-core (for example the actcore-state-hub-bridge
service target).
Schedule Check
From the activity-core host, confirm the definition is synced and the Temporal schedule exists:
Run on Railiance01 (the laptop .env points at docker-compose hostnames
like app-db and will time out from WSL):
export KUBECONFIG=~/.kube/config-hosteurope
# 1. Apply runtime manifest when definitions change
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
# 2. Sync definitions into Postgres
kubectl -n activity-core delete job actcore-sync --ignore-not-found
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s
# 3. Reconcile Temporal schedules
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules
After changing application code, rebuild and import activity-core:railiance01-prod
per activity-core/k8s/railiance/README.md, then restart
actcore-worker, actcore-api, and actcore-event-router.
Ensure state-hub-railiance01 ops-bridge tunnel is connected before
cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for
POST requests.
Expected definition:
- name:
State Hub Consistency Sweep - trigger:
*/15 * * * * - timezone:
UTC - misfire policy:
skip - enabled:
falseuntil manual canary passes, thentrueafter cutover
Progress Event Check
Query State Hub for the latest sweep event:
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
| python3 -m json.tool
Healthy evidence includes:
lock_skipped: falseon normal runsrepos_processedentries only for repos that needed actionskipped_clean,skipped_missing, andskipped_budgetmetadata when applicableexit_code: 0for warn-only remote-all sweeps
A lock_skipped: true response is normal when the local timer and the
cluster schedule overlap during the parallel week.
ActivityRun Check
Query the activity-core database for the most recent run of the sweep definition:
select
run_id,
fired_at,
scheduled_for,
context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;
Manual Canary
Before enabling the cluster schedule:
- Confirm
state-hub-railiance01tunnel health from ops-bridge. - Trigger one manual ActivityRun or POST the API through the bridge URL.
- Verify the progress event and ActivityRun context snapshot.
- Confirm idempotence when the local timer also fires (lock skip is OK).
Cutover
After one parallel week (STATE-WP-0064-T03):
systemctl --user disable --now custodian-sync.timer
Then enable the activity-core definition and treat the cluster schedule as the sole primary runner.