finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures

STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
2026-06-22 01:20:59 +02:00
parent 270033a50d
commit 39ed5459b9
14 changed files with 221 additions and 180 deletions
--- a/docs/consistency-sweep-runbook.md
+++ b/docs/consistency-sweep-runbook.md
@@ -3,16 +3,16 @@
 ## Purpose

 This runbook answers whether the 15-minute State Hub consistency sync ran
-without relying on the local `custodian-sync.timer`.
+without relying on the local `custodian-sync.timer` (retired 2026-06-21).

-The intended steady state after `STATE-WP-0064` cutover is:
+**Steady state** (`STATE-WP-0064` cutover complete):

 - activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
  ActivityRun audit trail.
 - State Hub on the workstation owns `scripts/consistency_check.py`, lock
  semantics, reconciliation, and the `consistency_sweep_remote_all`
  progress event.
- The local systemd timer is disabled after the parallel week passes.
+- The local systemd timer is **disabled**; cluster is the sole scheduler.

 ## API Surface

@@ -65,7 +65,7 @@ Expected definition:
 - trigger: `*/15 * * * *`
 - timezone: `UTC`
 - misfire policy: `skip`
- enabled: `true` during parallel week (T03); local timer retired after T04
+- enabled: `true`

 ## Progress Event Check

@@ -78,14 +78,17 @@ curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all

 Healthy evidence includes:

+- `detail.source: activity-core` on scheduled runs
 - `lock_skipped: false` on normal runs
 - `repos_processed` entries only for repos that needed action
 - `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
  applicable
- `exit_code: 0` for warn-only remote-all sweeps
+- `exit_code: 0` when automation completed (assessment failures are OK)
+- `automation_error: true` only for infrastructure faults (API down, C-00, etc.)
+- `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up

-A `lock_skipped: true` response is normal when the local timer and the
-cluster schedule overlap during the parallel week.
+A `lock_skipped: true` response is normal when a sweep is already in flight.
+Assessment failures do not fail the scheduler; automation errors do.

 ## ActivityRun Check

@@ -106,40 +109,26 @@ limit 5;

 ## Manual Canary

-Before enabling the cluster schedule:
+Before enabling or after changing the cluster schedule:

 1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
 2. Trigger one manual ActivityRun or POST the API through the bridge URL.
 3. Verify the progress event and ActivityRun context snapshot.
-4. Confirm idempotence when the local timer also fires (lock skip is OK).

-## Parallel week observability (T03)
+## Observability

-Both runners call the same API and tag progress events with `detail.source`:
-
-| Source | Runner |
-|--------|--------|
-| `local-timer` | `custodian-sync.timer` on the workstation |
-| `activity-core` | Railiance01 Temporal schedule |
-
-Summarise evidence:
+Summarise recent sweep events by source:

 ```bash
 cd ~/state-hub
 uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
 ```

-Expect some `lock_skipped: true` events when both schedules overlap — that is
-healthy idempotence, not duplicate work.
+After cutover, expect only `activity-core` (and manual) sources — no new
+`local-timer` events.

-Parallel window: **2026-06-21 → 2026-06-28** (review before T04 cutover).
+## Local fallback (emergency only)

-## Cutover
-
-After one parallel week (`STATE-WP-0064-T03`):
-
-```bash
-systemctl --user disable --now custodian-sync.timer
-```
-
-The cluster definition stays enabled; disable only the local timer.
+If cluster scheduling is broken, temporarily re-enable the archived systemd
+units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md).
+Disable again once cluster scheduling is restored.