generated from coulomb/repo-seed
STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
134 lines
4.2 KiB
Markdown
134 lines
4.2 KiB
Markdown
# State Hub Consistency Sweep Runbook
|
|
|
|
## Purpose
|
|
|
|
This runbook answers whether the 15-minute State Hub consistency sync ran
|
|
without relying on the local `custodian-sync.timer` (retired 2026-06-21).
|
|
|
|
**Steady state** (`STATE-WP-0064` cutover complete):
|
|
|
|
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
|
|
ActivityRun audit trail.
|
|
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
|
|
semantics, reconciliation, and the `consistency_sweep_remote_all`
|
|
progress event.
|
|
- The local systemd timer is **disabled**; cluster is the sole scheduler.
|
|
|
|
## API Surface
|
|
|
|
Manual or cluster-triggered invocation:
|
|
|
|
```bash
|
|
curl -s -X POST http://127.0.0.1:8000/consistency/sweep/remote-all \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"max_seconds": 300}' | python3 -m json.tool
|
|
```
|
|
|
|
From Railiance01 through the bridge tunnel, use the `STATE_HUB_URL`
|
|
configured for activity-core (for example the `actcore-state-hub-bridge`
|
|
service target).
|
|
|
|
## Schedule Check
|
|
|
|
From the activity-core host, confirm the definition is synced and the
|
|
Temporal schedule exists:
|
|
|
|
Run on **Railiance01** (the laptop `.env` points at docker-compose hostnames
|
|
like `app-db` and will time out from WSL):
|
|
|
|
```bash
|
|
export KUBECONFIG=~/.kube/config-hosteurope
|
|
|
|
# 1. Apply runtime manifest when definitions change
|
|
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
|
|
|
|
# 2. Sync definitions into Postgres
|
|
kubectl -n activity-core delete job actcore-sync --ignore-not-found
|
|
kubectl apply -f ~/activity-core/k8s/railiance/20-runtime.yaml
|
|
kubectl -n activity-core wait --for=condition=complete job/actcore-sync --timeout=180s
|
|
|
|
# 3. Reconcile Temporal schedules
|
|
kubectl -n activity-core exec deploy/actcore-worker -- python -m activity_core.sync_schedules
|
|
```
|
|
|
|
After changing application code, rebuild and import `activity-core:railiance01-prod`
|
|
per `activity-core/k8s/railiance/README.md`, then restart
|
|
`actcore-worker`, `actcore-api`, and `actcore-event-router`.
|
|
|
|
Ensure `state-hub-railiance01` ops-bridge tunnel is `connected` before
|
|
cluster-triggered sweeps; the in-cluster bridge proxy allows up to 360s for
|
|
POST requests.
|
|
|
|
Expected definition:
|
|
|
|
- name: `State Hub Consistency Sweep`
|
|
- trigger: `*/15 * * * *`
|
|
- timezone: `UTC`
|
|
- misfire policy: `skip`
|
|
- enabled: `true`
|
|
|
|
## Progress Event Check
|
|
|
|
Query State Hub for the latest sweep event:
|
|
|
|
```bash
|
|
curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all&limit=5" \
|
|
| python3 -m json.tool
|
|
```
|
|
|
|
Healthy evidence includes:
|
|
|
|
- `detail.source: activity-core` on scheduled runs
|
|
- `lock_skipped: false` on normal runs
|
|
- `repos_processed` entries only for repos that needed action
|
|
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
|
|
applicable
|
|
- `exit_code: 0` when automation completed (assessment failures are OK)
|
|
- `automation_error: true` only for infrastructure faults (API down, C-00, etc.)
|
|
- `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up
|
|
|
|
A `lock_skipped: true` response is normal when a sweep is already in flight.
|
|
Assessment failures do not fail the scheduler; automation errors do.
|
|
|
|
## ActivityRun Check
|
|
|
|
Query the activity-core database for the most recent run of the sweep
|
|
definition:
|
|
|
|
```sql
|
|
select
|
|
run_id,
|
|
fired_at,
|
|
scheduled_for,
|
|
context_snapshot->'consistency_sweep_remote_all' as sweep_result
|
|
from activity_runs
|
|
where definition_id = '7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b'
|
|
order by fired_at desc
|
|
limit 5;
|
|
```
|
|
|
|
## Manual Canary
|
|
|
|
Before enabling or after changing the cluster schedule:
|
|
|
|
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
|
|
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
|
|
3. Verify the progress event and ActivityRun context snapshot.
|
|
|
|
## Observability
|
|
|
|
Summarise recent sweep events by source:
|
|
|
|
```bash
|
|
cd ~/state-hub
|
|
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
|
|
```
|
|
|
|
After cutover, expect only `activity-core` (and manual) sources — no new
|
|
`local-timer` events.
|
|
|
|
## Local fallback (emergency only)
|
|
|
|
If cluster scheduling is broken, temporarily re-enable the archived systemd
|
|
units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md).
|
|
Disable again once cluster scheduling is restored. |