finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures

STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
This commit is contained in:
2026-06-22 01:20:59 +02:00
parent 270033a50d
commit 39ed5459b9
14 changed files with 221 additions and 180 deletions

View File

@@ -84,7 +84,9 @@ unset.
the rule lives in activity-core.
See [`docs/cron-migration.md`](cron-migration.md) for the
ActivityDefinition drafts and cutover plan.
ActivityDefinition drafts and cutover plan. The consistency sweep schedule
is live on Railiance01 — operator runbook:
[`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
## What must never happen

View File

@@ -3,16 +3,16 @@
## Purpose
This runbook answers whether the 15-minute State Hub consistency sync ran
without relying on the local `custodian-sync.timer`.
without relying on the local `custodian-sync.timer` (retired 2026-06-21).
The intended steady state after `STATE-WP-0064` cutover is:
**Steady state** (`STATE-WP-0064` cutover complete):
- activity-core on Railiance01 owns the `*/15 * * * *` UTC schedule and
ActivityRun audit trail.
- State Hub on the workstation owns `scripts/consistency_check.py`, lock
semantics, reconciliation, and the `consistency_sweep_remote_all`
progress event.
- The local systemd timer is disabled after the parallel week passes.
- The local systemd timer is **disabled**; cluster is the sole scheduler.
## API Surface
@@ -65,7 +65,7 @@ Expected definition:
- trigger: `*/15 * * * *`
- timezone: `UTC`
- misfire policy: `skip`
- enabled: `true` during parallel week (T03); local timer retired after T04
- enabled: `true`
## Progress Event Check
@@ -78,14 +78,17 @@ curl -s "http://127.0.0.1:8000/progress/?event_type=consistency_sweep_remote_all
Healthy evidence includes:
- `detail.source: activity-core` on scheduled runs
- `lock_skipped: false` on normal runs
- `repos_processed` entries only for repos that needed action
- `skipped_clean`, `skipped_missing`, and `skipped_budget` metadata when
applicable
- `exit_code: 0` for warn-only remote-all sweeps
- `exit_code: 0` when automation completed (assessment failures are OK)
- `automation_error: true` only for infrastructure faults (API down, C-00, etc.)
- `assessment_failures` counts repos with hygiene gaps (C-01..C-23) for follow-up
A `lock_skipped: true` response is normal when the local timer and the
cluster schedule overlap during the parallel week.
A `lock_skipped: true` response is normal when a sweep is already in flight.
Assessment failures do not fail the scheduler; automation errors do.
## ActivityRun Check
@@ -106,40 +109,26 @@ limit 5;
## Manual Canary
Before enabling the cluster schedule:
Before enabling or after changing the cluster schedule:
1. Confirm `state-hub-railiance01` tunnel health from ops-bridge.
2. Trigger one manual ActivityRun or POST the API through the bridge URL.
3. Verify the progress event and ActivityRun context snapshot.
4. Confirm idempotence when the local timer also fires (lock skip is OK).
## Parallel week observability (T03)
## Observability
Both runners call the same API and tag progress events with `detail.source`:
| Source | Runner |
|--------|--------|
| `local-timer` | `custodian-sync.timer` on the workstation |
| `activity-core` | Railiance01 Temporal schedule |
Summarise evidence:
Summarise recent sweep events by source:
```bash
cd ~/state-hub
uv run python scripts/compare_consistency_sweep_parallel.py --since-hours 24
```
Expect some `lock_skipped: true` events when both schedules overlap — that is
healthy idempotence, not duplicate work.
After cutover, expect only `activity-core` (and manual) sources — no new
`local-timer` events.
Parallel window: **2026-06-21 → 2026-06-28** (review before T04 cutover).
## Local fallback (emergency only)
## Cutover
After one parallel week (`STATE-WP-0064-T03`):
```bash
systemctl --user disable --now custodian-sync.timer
```
The cluster definition stays enabled; disable only the local timer.
If cluster scheduling is broken, temporarily re-enable the archived systemd
units per [`infra/systemd/archived/README.md`](../infra/systemd/archived/README.md).
Disable again once cluster scheduling is restored.

View File

@@ -1,9 +1,8 @@
# State Hub Cron → activity-core ActivityDefinition Migration
> CUST-WP-0040 T04. **Partially implemented** as of `STATE-WP-0064`.
> The consistency sweep API surface and ActivityDefinition are landed;
> cluster cutover still requires manual canary, parallel week, and local
> timer retirement.
> CUST-WP-0040 T04. **Consistency sweep cut over** as of `STATE-WP-0064`
> (2026-06-21). Scheduling is on activity-core (Railiance01); the local
> `custodian-sync.timer` is retired. Stale-task cleanup (B) is still pending.
The state hub currently runs two recurring maintenance jobs and one
per-repo event hook. Once activity-core is ready, each becomes an
@@ -16,7 +15,7 @@ keeps the underlying scripts; only the *scheduling* moves.
| # | Source | Trigger today | Script invoked | What it does |
| - | ------------------- | -------------------------------------------------------- | -------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
| 1 | systemd user timer | every 15 min | `scripts/consistency_check.py --remote --all` | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
| 1 | activity-core cron | every 15 min (Railiance01) | `POST /consistency/sweep/remote-all``consistency_check.py --remote --all` | Pull every registered repo, reconcile workplan files ↔ DB, run C-15 writeback + C-16 pull gate |
| 2 | manual / daily cron | `make cleanup-stale` (suggested `0 3 * * *`) | `scripts/cleanup_stale_tasks.py` | Cancel tasks still open in finished/archived workstreams; emits `org.statehub.task.stale` |
| 3 | git post-commit | every commit in a registered repo | `make fix-consistency REPO=<slug>` | Per-repo workplan ↔ DB sync immediately after a commit |
@@ -40,7 +39,7 @@ run them on a schedule.
### A. `state-hub-consistency-sweep` (implemented)
Landed in `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
with `enabled: false` until canary and cutover.
with `enabled: true` on Railiance01 since 2026-06-21 cutover.
Invocation path (matches the hourly RecentlyOnScope pattern):
@@ -56,11 +55,10 @@ checkout from the cluster.
Operator runbook: [`docs/consistency-sweep-runbook.md`](consistency-sweep-runbook.md).
Notes:
- Replaces the `custodian-sync.service` + `custodian-sync.timer` pair
after parallel week and cutover.
- Replaced the `custodian-sync.service` + `custodian-sync.timer` pair
(local timer disabled 2026-06-21; units archived under `infra/systemd/archived/`).
- Lock semantics (`/tmp/custodian-consistency-remote-all.lock`) stay in
the script — activity-core just sets the cadence.
- Local timer retirement is tracked in `STATE-WP-0064-T04`.
### B. `state-hub-stale-task-cleanup`
@@ -130,8 +128,8 @@ Still optional for B and future splits:
| activity-core shell instruction kind with on_failure semantics | activity-core | activity-core/`src/...` |
| state-hub adapter exposing `state-hub.health` as a context source | activity-core | activity-core/adapters/ |
Until B lands and A is cut over, the state hub continues to schedule the
consistency sweep via the local systemd timer.
A is cut over. Until B lands, stale-task cleanup remains on-demand via
`make cleanup-stale` (or a manual daily cron).
---
@@ -142,11 +140,9 @@ consistency sweep via the local systemd timer.
same DB / NATS effects as the current cron entries.
3. Run both in parallel for one week (cron + ActivityDefinition). The
scripts are idempotent — duplicate runs are no-ops on a clean state.
4. Disable the systemd timer:
`systemctl --user disable --now custodian-sync.timer`
5. Remove the cleanup-stale cron entry from `crontab -e`.
6. Update `infra/README.md` to point at the ActivityDefinitions and
archive the systemd unit files.
4. ~~Disable the systemd timer~~**done** 2026-06-21 (`STATE-WP-0064`).
5. Remove the cleanup-stale cron entry from `crontab -e` (when B is enabled).
6. ~~Update `infra/README.md` and archive systemd unit files~~**done**.
7. Per-commit hook stays until a `repo.commit.pushed` event exists.
---