Files
state-hub/docs/consistency-sweep-runbook.md
tegwick 39ed5459b9 finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
2026-06-22 01:20:59 +02:00

4.2 KiB

State Hub Consistency Sweep Runbook

Purpose

This runbook answers whether the 15-minute State Hub consistency sync ran without relying on the local custodian-sync.timer (retired 2026-06-21).

Steady state (STATE-WP-0064 cutover complete):

  • activity-core on Railiance01 owns the */15 * * * * UTC schedule and ActivityRun audit trail.
  • State Hub on the workstation owns scripts/consistency_check.py, lock semantics, reconciliation, and the consistency_sweep_remote_all progress event.
  • The local systemd timer is disabled; cluster is the sole scheduler.

API Surface

Manual or cluster-triggered invocation:

curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
  -H "Content-Type: application/json" \
  -d '{"max_seconds": 300}' | python3 -m json.tool

From Railiance01 through the bridge tunnel, use the STATE_HUB_URL configured for activity-core (for example the actcore-state-hub-bridge service target).

Schedule Check

From the activity-core host, confirm the definition is synced and the Temporal schedule exists:

Run on Railiance01 (the laptop .env points at docker-compose hostnames like app-db and will time out from WSL):

export KUBECONFIG=~/.kube/config-hosteurope

# 1. Apply runtime manifest when definitions change
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml

# 2. Sync definitions into Postgres
kubectl -n activity-core delete job actcore-sync --ignore-not-found
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s

# 3. Reconcile Temporal schedules
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules

After changing application code, rebuild and import activity-core:railiance01-prod per activity-core/k8s/railiance/README.md, then restart actcore-worker, actcore-api, and actcore-event-router.

Ensure state-hub-railiance01 ops-bridge tunnel is connected before cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for POST requests.

Expected definition:

  • name: State Hub Consistency Sweep
  • trigger: */15 * * * *
  • timezone: UTC
  • misfire policy: skip
  • enabled: true

Progress Event Check

Query State Hub for the latest sweep event:

curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
  | python3 -m json.tool

Healthy evidence includes:

  • detail.source: activity-core on scheduled runs
  • lock_skipped: false on normal runs
  • repos_processed entries only for repos that needed action
  • skipped_clean, skipped_missing, and skipped_budget metadata when applicable
  • exit_code: 0 when automation completed (assessment failures are OK)
  • automation_error: true only for infrastructure faults (API down, C-00, etc.)
  • assessment_failures counts repos with hygiene gaps (C-01..C-23) for follow-up

A lock_skipped: true response is normal when a sweep is already in flight. Assessment failures do not fail the scheduler; automation errors do.

ActivityRun Check

Query the activity-core database for the most recent run of the sweep definition:

select
  run_id,
  fired_at,
  scheduled_for,
  context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;

Manual Canary

Before enabling or after changing the cluster schedule:

  1. Confirm state-hub-railiance01 tunnel health from ops-bridge.
  2. Trigger one manual ActivityRun or POST the API through the bridge URL.
  3. Verify the progress event and ActivityRun context snapshot.

Observability

Summarise recent sweep events by source:

cd ~/state-hub
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24

After cutover, expect only activity-core (and manual) sources — no new local-timer events.

Local fallback (emergency only)

If cluster scheduling is broken, temporarily re-enable the archived systemd units per infra/systemd/archived/README.md. Disable again once cluster scheduling is restored.