generated from coulomb/repo-seed
finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
This commit is contained in:
@@ -3,16 +3,16 @@
|
||||
## Purpose
|
||||
|
||||
This runbook answers whether the 15-minute State Hub consistency sync ran
|
||||
without relying on the local `custodian-sync.timer`.
|
||||
without relying on the local `custodian-sync.timer` (retired 2026-06-21).
|
||||
|
||||
The intended steady state after `STATE-WP-0064` cutover is:
|
||||
**Steady state** (`STATE-WP-0064` cutover complete):
|
||||
|
||||
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
|
||||
ActivityRun audit trail.
|
||||
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
|
||||
semantics, reconciliation, and the `consistency_sweep_remote_all`
|
||||
progress event.
|
||||
- The local systemd timer is disabled after the parallel week passes.
|
||||
- The local systemd timer is **disabled**; cluster is the sole scheduler.
|
||||
|
||||
## API Surface
|
||||
|
||||
@@ -65,7 +65,7 @@ Expected definition:
|
||||
- trigger: `*/15 * * * *`
|
||||
- timezone: `UTC`
|
||||
- misfire policy: `skip`
|
||||
- enabled: `true` during parallel week (T03); local timer retired after T04
|
||||
- enabled: `true`
|
||||
|
||||
## Progress Event Check
|
||||
|
||||
@@ -78,14 +78,17 @@ curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all
|
||||
|
||||
Healthy evidence includes:
|
||||
|
||||
- `detail.source: activity-core` on scheduled runs
|
||||
- `lock_skipped: false` on normal runs
|
||||
- `repos_processed` entries only for repos that needed action
|
||||
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
|
||||
applicable
|
||||
- `exit_code: 0` for warn-only remote-all sweeps
|
||||
- `exit_code: 0` when automation completed (assessment failures are OK)
|
||||
- `automation_error: true` only for infrastructure faults (API down, C-00, etc.)
|
||||
- `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up
|
||||
|
||||
A `lock_skipped: true` response is normal when the local timer and the
|
||||
cluster schedule overlap during the parallel week.
|
||||
A `lock_skipped: true` response is normal when a sweep is already in flight.
|
||||
Assessment failures do not fail the scheduler; automation errors do.
|
||||
|
||||
## ActivityRun Check
|
||||
|
||||
@@ -106,40 +109,26 @@ limit 5;
|
||||
|
||||
## Manual Canary
|
||||
|
||||
Before enabling the cluster schedule:
|
||||
Before enabling or after changing the cluster schedule:
|
||||
|
||||
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
|
||||
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
|
||||
3. Verify the progress event and ActivityRun context snapshot.
|
||||
4. Confirm idempotence when the local timer also fires (lock skip is OK).
|
||||
|
||||
## Parallel week observability (T03)
|
||||
## Observability
|
||||
|
||||
Both runners call the same API and tag progress events with `detail.source`:
|
||||
|
||||
| Source | Runner |
|
||||
|--------|--------|
|
||||
| `local-timer` | `custodian-sync.timer` on the workstation |
|
||||
| `activity-core` | Railiance01 Temporal schedule |
|
||||
|
||||
Summarise evidence:
|
||||
Summarise recent sweep events by source:
|
||||
|
||||
```bash
|
||||
cd ~/state-hub
|
||||
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
|
||||
```
|
||||
|
||||
Expect some `lock_skipped: true` events when both schedules overlap — that is
|
||||
healthy idempotence, not duplicate work.
|
||||
After cutover, expect only `activity-core` (and manual) sources — no new
|
||||
`local-timer` events.
|
||||
|
||||
Parallel window: **2026-06-21 → 2026-06-28** (review before T04 cutover).
|
||||
## Local fallback (emergency only)
|
||||
|
||||
## Cutover
|
||||
|
||||
After one parallel week (`STATE-WP-0064-T03`):
|
||||
|
||||
```bash
|
||||
systemctl --user disable --now custodian-sync.timer
|
||||
```
|
||||
|
||||
The cluster definition stays enabled; disable only the local timer.
|
||||
If cluster scheduling is broken, temporarily re-enable the archived systemd
|
||||
units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md).
|
||||
Disable again once cluster scheduling is restored.
|
||||
Reference in New Issue
Block a user