STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
4.2 KiB
State Hub Consistency Sweep Runbook
Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local custodian-sync.timer (retired 2026-06-21).
Steady state (STATE-WP-0064 cutover complete):
- activity-core on Railiance01 owns the
*/15 * * * *UTC schedule and ActivityRun audit trail. - State Hub on the workstation owns
scripts/consistency_check.py, lock semantics, reconciliation, and theconsistency_sweep_remote_allprogress event. - The local systemd timer is disabled; cluster is the sole scheduler.
API Surface
Manual or cluster-triggered invocation:
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
-H "Content-Type: application/json" \
-d '{"max_seconds": 300}' | python3 -m json.tool
From Railiance01 through the bridge tunnel, use the STATE_HUB_URL
configured for activity-core (for example the actcore-state-hub-bridge
service target).
Schedule Check
From the activity-core host, confirm the definition is synced and the Temporal schedule exists:
Run on Railiance01 (the laptop .env points at docker-compose hostnames
like app-db and will time out from WSL):
export KUBECONFIG=~/.kube/config-hosteurope
# 1. Apply runtime manifest when definitions change
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
# 2. Sync definitions into Postgres
kubectl -n activity-core delete job actcore-sync --ignore-not-found
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s
# 3. Reconcile Temporal schedules
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules
After changing application code, rebuild and import activity-core:railiance01-prod
per activity-core/k8s/railiance/README.md, then restart
actcore-worker, actcore-api, and actcore-event-router.
Ensure state-hub-railiance01 ops-bridge tunnel is connected before
cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for
POST requests.
Expected definition:
- name:
State Hub Consistency Sweep - trigger:
*/15 * * * * - timezone:
UTC - misfire policy:
skip - enabled:
true
Progress Event Check
Query State Hub for the latest sweep event:
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
| python3 -m json.tool
Healthy evidence includes:
detail.source: activity-coreon scheduled runslock_skipped: falseon normal runsrepos_processedentries only for repos that needed actionskipped_clean,skipped_missing, andskipped_budgetmetadata when applicableexit_code: 0when automation completed (assessment failures are OK)automation_error: trueonly for infrastructure faults (API down, C-00, etc.)assessment_failurescounts repos with hygiene gaps (C-01..C-23) for follow-up
A lock_skipped: true response is normal when a sweep is already in flight.
Assessment failures do not fail the scheduler; automation errors do.
ActivityRun Check
Query the activity-core database for the most recent run of the sweep definition:
select
run_id,
fired_at,
scheduled_for,
context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;
Manual Canary
Before enabling or after changing the cluster schedule:
- Confirm
state-hub-railiance01tunnel health from ops-bridge. - Trigger one manual ActivityRun or POST the API through the bridge URL.
- Verify the progress event and ActivityRun context snapshot.
Observability
Summarise recent sweep events by source:
cd ~/state-hub
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
After cutover, expect only activity-core (and manual) sources — no new
local-timer events.
Local fallback (emergency only)
If cluster scheduling is broken, temporarily re-enable the archived systemd
units per infra/systemd/archived/README.md.
Disable again once cluster scheduling is restored.