finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures

STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
This commit is contained in:
2026-06-22 01:20:59 +02:00
parent 270033a50d
commit 39ed5459b9
14 changed files with 221 additions and 180 deletions

View File

@@ -3,16 +3,16 @@
## Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local `custodian-sync.timer`.
without relying on the local `custodian-sync.timer` (retired 2026-06-21).
The intended steady state after `STATE-WP-0064` cutover is:
**Steady state** (`STATE-WP-0064` cutover complete):
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
ActivityRun audit trail.
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
semantics, reconciliation, and the `consistency_sweep_remote_all`
progress event.
- The local systemd timer is disabled after the parallel week passes.
- The local systemd timer is **disabled**; cluster is the sole scheduler.
## API Surface
@@ -65,7 +65,7 @@ Expected definition:
- trigger: `*/15 * * * *`
- timezone: `UTC`
- misfire policy: `skip`
- enabled: `true` during parallel week (T03); local timer retired after T04
- enabled: `true`
## Progress Event Check
@@ -78,14 +78,17 @@ curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all
Healthy evidence includes:
- `detail.source: activity-core` on scheduled runs
- `lock_skipped: false` on normal runs
- `repos_processed` entries only for repos that needed action
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
applicable
- `exit_code: 0` for warn-only remote-all sweeps
- `exit_code: 0` when automation completed (assessment failures are OK)
- `automation_error: true` only for infrastructure faults (API down, C-00, etc.)
- `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up
A `lock_skipped: true` response is normal when the local timer and the
cluster schedule overlap during the parallel week.
A `lock_skipped: true` response is normal when a sweep is already in flight.
Assessment failures do not fail the scheduler; automation errors do.
## ActivityRun Check
@@ -106,40 +109,26 @@ limit 5;
## Manual Canary
Before enabling the cluster schedule:
Before enabling or after changing the cluster schedule:
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
3. Verify the progress event and ActivityRun context snapshot.
4. Confirm idempotence when the local timer also fires (lock skip is OK).
## Parallel week observability (T03)
## Observability
Both runners call the same API and tag progress events with `detail.source`:
| Source | Runner |
|--------|--------|
| `local-timer` | `custodian-sync.timer` on the workstation |
| `activity-core` | Railiance01 Temporal schedule |
Summarise evidence:
Summarise recent sweep events by source:
```bash
cd ~/state-hub
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
```
Expect some `lock_skipped: true` events when both schedules overlap — that is
healthy idempotence, not duplicate work.
After cutover, expect only `activity-core` (and manual) sources — no new
`local-timer` events.
Parallel window: **2026-06-21 → 2026-06-28** (review before T04 cutover).
## Local fallback (emergency only)
## Cutover
After one parallel week (`STATE-WP-0064-T03`):
```bash
systemctl --user disable --now custodian-sync.timer
```
The cluster definition stays enabled; disable only the local timer.
If cluster scheduling is broken, temporarily re-enable the archived systemd
units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md).
Disable again once cluster scheduling is restored.