Files
state-hub/docs/consistency-sweep-runbook.md
tegwick ab14e77e77 feat(STATE-WP-0064): start parallel week with source-tagged sweep runners
Tag consistency_sweep_remote_all progress events by source, route the local
timer through the API, add a parallel-week comparison script, and document
the 2026-06-21 to 2026-06-28 observation window for T03.
2026-06-21 21:46:43 +02:00

145 lines
4.3 KiB
Markdown

# State Hub Consistency Sweep Runbook
## Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local `custodian-sync.timer`.
The intended steady state after `STATE-WP-0064` cutover is:
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
ActivityRun audit trail.
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
semantics, reconciliation, and the `consistency_sweep_remote_all`
progress event.
- The local systemd timer is disabled after the parallel week passes.
## API Surface
Manual or cluster-triggered invocation:
```bash
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
-H "Content-Type: application/json" \
-d '{"max_seconds": 300}' | python3 -m json.tool
```
From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL`
configured for activity-core (for example the `actcore-state-hub-bridge`
service target).
## Schedule Check
From the activity-core host, confirm the definition is synced and the
Temporal schedule exists:
Run on **Railiance01** (the laptop `.env` points at docker-compose hostnames
like `app-db` and will time out from WSL):
```bash
export KUBECONFIG=~/.kube/config-hosteurope
# 1. Apply runtime manifest when definitions change
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
# 2. Sync definitions into Postgres
kubectl -n activity-core delete job actcore-sync --ignore-not-found
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s
# 3. Reconcile Temporal schedules
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules
```
After changing application code, rebuild and import `activity-core:railiance01-prod`
per `activity-core/k8s/railiance/README.md`, then restart
`actcore-worker`, `actcore-api`, and `actcore-event-router`.
Ensure `state-hub-railiance01` ops-bridge tunnel is `connected` before
cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for
POST requests.
Expected definition:
- name: `State Hub Consistency Sweep`
- trigger: `*/15 * * * *`
- timezone: `UTC`
- misfire policy: `skip`
- enabled: `true` during parallel week (T03); local timer retired after T04
## Progress Event Check
Query State Hub for the latest sweep event:
```bash
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
| python3 -m json.tool
```
Healthy evidence includes:
- `lock_skipped: false` on normal runs
- `repos_processed` entries only for repos that needed action
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
applicable
- `exit_code: 0` for warn-only remote-all sweeps
A `lock_skipped: true` response is normal when the local timer and the
cluster schedule overlap during the parallel week.
## ActivityRun Check
Query the activity-core database for the most recent run of the sweep
definition:
```sql
select
run_id,
fired_at,
scheduled_for,
context_snapshot->'consistency_sweep_remote_all' as sweep_result
from activity_runs
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
order by fired_at desc
limit 5;
```
## Manual Canary
Before enabling the cluster schedule:
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
3. Verify the progress event and ActivityRun context snapshot.
4. Confirm idempotence when the local timer also fires (lock skip is OK).
## Parallel week observability (T03)
Both runners call the same API and tag progress events with `detail.source`:
| Source | Runner |
|--------|--------|
| `local-timer` | `custodian-sync.timer` on the workstation |
| `activity-core` | Railiance01 Temporal schedule |
Summarise evidence:
```bash
cd ~/state-hub
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
```
Expect some `lock_skipped: true` events when both schedules overlap — that is
healthy idempotence, not duplicate work.
Parallel window: **2026-06-21 → 2026-06-28** (review before T04 cutover).
## Cutover
After one parallel week (`STATE-WP-0064-T03`):
```bash
systemctl --user disable --now custodian-sync.timer
```
The cluster definition stays enabled; disable only the local timer.