Replace nonexistent make sync-schedules with admin/sync and the activity_core.sync_schedules CLI documented in activity-core runbook.
3.2 KiB
State Hub Consistency Sweep Runbook
Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local custodian-sync.timer.
The intended steady state after STATE-WP-0064 cutover is:
- activity-core on Railiance01 owns the
*/15 * * * *UTC schedule and ActivityRun audit trail. - State Hub on the workstation owns
scripts/consistency_check.py, lock semantics, reconciliation, and theconsistency_sweep_remote_allprogress event. - The local systemd timer is disabled after the parallel week passes.
API Surface
Manual or cluster-triggered invocation:
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
-H "Content-Type: application/json" \
-d '{"max_seconds": 300}' | python3 -m json.tool
From Railiance01 through the bridge tunnel, use the STATE_HUB_URL
configured for activity-core (for example the actcore-state-hub-bridge
service target).
Schedule Check
From the activity-core host, confirm the definition is synced and the Temporal schedule exists:
cd ~/activity-core
ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian make sync-activity-definitions
Reconcile Temporal schedules (pick one):
# Preferred when activity-core API is up (no worker restart)
curl -s -X POST 'http://localhost:8010/admin/sync?definitions=true&schedules=true'
# CLI fallback
ACTCORE_DB_URL=... TEMPORAL_HOST=... uv run python -m activity_core.sync_schedules
On Railiance01, use the in-cluster activity-core API URL and env from the
deployment instead of localhost:8010.
Expected definition:
- name:
State Hub Consistency Sweep - trigger:
*/15 * * * * - timezone:
UTC - misfire policy:
skip - enabled:
falseuntil manual canary passes, thentrueafter cutover
Progress Event Check
Query State Hub for the latest sweep event:
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
| python3 -m json.tool
Healthy evidence includes:
lock_skipped: falseon normal runsrepos_processedentries only for repos that needed actionskipped_clean,skipped_missing, andskipped_budgetmetadata when applicableexit_code: 0for warn-only remote-all sweeps
A lock_skipped: true response is normal when the local timer and the
cluster schedule overlap during the parallel week.
ActivityRun Check
Query the activity-core database for the most recent run of the sweep definition:
select
run_id,
fired_at,
scheduled_for,
context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;
Manual Canary
Before enabling the cluster schedule:
- Confirm
state-hub-railiance01tunnel health from ops-bridge. - Trigger one manual ActivityRun or POST the API through the bridge URL.
- Verify the progress event and ActivityRun context snapshot.
- Confirm idempotence when the local timer also fires (lock skip is OK).
Cutover
After one parallel week (STATE-WP-0064-T03):
systemctl --user disable --now custodian-sync.timer
Then enable the activity-core definition and treat the cluster schedule as the sole primary runner.