Files
state-hub/docs/consistency-sweep-runbook.md
tegwick acc5bea15b docs: fix schedule sync commands in consistency sweep runbook
Replace nonexistent make sync-schedules with admin/sync and the
activity_core.sync_schedules CLI documented in activity-core runbook.
2026-06-21 20:28:04 +02:00

3.2 KiB

State Hub Consistency Sweep Runbook

Purpose

This runbook answers whether the 15-minute State Hub consistency sync ran without relying on the local custodian-sync.timer.

The intended steady state after STATE-WP-0064 cutover is:

  • activity-core on Railiance01 owns the */15 * * * * UTC schedule and ActivityRun audit trail.
  • State Hub on the workstation owns scripts/consistency_check.py, lock semantics, reconciliation, and the consistency_sweep_remote_all progress event.
  • The local systemd timer is disabled after the parallel week passes.

API Surface

Manual or cluster-triggered invocation:

curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
  -H "Content-Type: application/json" \
  -d '{"max_seconds": 300}' | python3 -m json.tool

From Railiance01 through the bridge tunnel, use the STATE_HUB_URL configured for activity-core (for example the actcore-state-hub-bridge service target).

Schedule Check

From the activity-core host, confirm the definition is synced and the Temporal schedule exists:

cd ~/activity-core
ACTIVITY_DEFINITION_DIRS=/home/worsch/the-custodian make sync-activity-definitions

Reconcile Temporal schedules (pick one):

# Preferred when activity-core API is up (no worker restart)
curl -s -X POST 'http://localhost:8010/admin/sync?definitions=true&schedules=true'

# CLI fallback
ACTCORE_DB_URL=... TEMPORAL_HOST=... uv run python -m activity_core.sync_schedules

On Railiance01, use the in-cluster activity-core API URL and env from the deployment instead of localhost:8010.

Expected definition:

  • name: State Hub Consistency Sweep
  • trigger: */15 * * * *
  • timezone: UTC
  • misfire policy: skip
  • enabled: false until manual canary passes, then true after cutover

Progress Event Check

Query State Hub for the latest sweep event:

curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
  | python3 -m json.tool

Healthy evidence includes:

  • lock_skipped: false on normal runs
  • repos_processed entries only for repos that needed action
  • skipped_clean, skipped_missing, and skipped_budget metadata when applicable
  • exit_code: 0 for warn-only remote-all sweeps

A lock_skipped: true response is normal when the local timer and the cluster schedule overlap during the parallel week.

ActivityRun Check

Query the activity-core database for the most recent run of the sweep definition:

select
  run_id,
  fired_at,
  scheduled_for,
  context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;

Manual Canary

Before enabling the cluster schedule:

  1. Confirm state-hub-railiance01 tunnel health from ops-bridge.
  2. Trigger one manual ActivityRun or POST the API through the bridge URL.
  3. Verify the progress event and ActivityRun context snapshot.
  4. Confirm idempotence when the local timer also fires (lock skip is OK).

Cutover

After one parallel week (STATE-WP-0064-T03):

systemctl --user disable --now custodian-sync.timer

Then enable the activity-core definition and treat the cluster schedule as the sole primary runner.