finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures

STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
This commit is contained in:
2026-06-22 01:20:59 +02:00
parent 270033a50d
commit 39ed5459b9
14 changed files with 221 additions and 180 deletions

View File

@@ -4,12 +4,11 @@ type: workplan
title: "Move State Hub consistency sync to Railiance01 (activity-core)"
domain: custodian
repo: state-hub
status: active
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
parallel_week_end: "2026-06-28"
state_hub_workstream_id: "669d810a-53f4-448b-a0c1-a6543daa7c44"
---
@@ -39,7 +38,7 @@ In scope:
`the-custodian/activity-definitions/`.
- Run the sweep from Railiance01 against the workstation State Hub via the
existing bridge/tunnel pattern (`actcore-state-hub-bridge` or equivalent).
- Parallel-run with local `custodian-sync.timer` for one week, then disable the
- Parallel-run with local `custodian-sync.timer` for validation, then disable the
local timer.
- Update `infra/README.md`, `docs/cron-migration.md`, and operator runbooks.
@@ -56,7 +55,7 @@ Out of scope:
|-------|---------|--------|
| Operator docs | custodian sync / custodian-sync | **State Hub consistency sync** |
| ActivityDefinition id | (not landed) | `the-custodian.state-hub-consistency-sweep` |
| systemd unit (interim) | `custodian-sync.{service,timer}` | disable after cutover; optional rename to `statehub-consistency-sync.*` during WP-0063 if low cost |
| systemd unit (interim) | `custodian-sync.{service,timer}` | disabled; archived under `infra/systemd/archived/` |
| git hook marker | `# custodian-sync-hook` | unchanged in this workplan |
---
@@ -85,7 +84,7 @@ Done 2026-06-21:
- State Hub `POST /consistency/sweep/remote-all` + progress event
`consistency_sweep_remote_all`
- ActivityDefinition in `the-custodian/activity-definitions/` (`enabled: false`)
- ActivityDefinition in `the-custodian/activity-definitions/`
- activity-core resolver query + k8s projection in `20-runtime.yaml`
- Uses API invocation pattern (not cluster shell into laptop repo)
@@ -108,12 +107,11 @@ Trigger one manual ActivityRun. Confirm:
Done 2026-06-21:
- Applied `20-runtime.yaml` on Railiance01; `actcore-sync` upserted definition
`7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b` (paused schedule).
`7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b`.
- Rebuilt/imported `activity-core:railiance01-prod` with
`consistency_sweep_remote_all` resolver.
- Bridge proxy POST timeout raised to 360s (30s was aborting sweeps).
- Manual canaries: cluster POST via bridge (`exit_code 0`, progress event
`65d0bc12-…`) and worker resolver (`exit_code 0`, 1 repo @ 60s budget).
- Manual canaries: cluster POST via bridge (`exit_code 0`) and worker resolver.
- Laptop `make sync-activity-definitions` is not valid against Railiance01 DB;
use kubectl `actcore-sync` job instead.
@@ -121,66 +119,60 @@ Done 2026-06-21:
```task
id: STATE-WP-0064-T03
status: progress
status: done
priority: medium
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"
```
Run cluster schedule (`*/15 * * * *` UTC per design stub) alongside local
`custodian-sync.timer` for **one week**. Compare:
`custodian-sync.timer` for validation. Compare sweep completion rate, lock
skips, and hard failures.
- sweep completion rate
- repos skipped due to lock or budget
- hard failures vs warn-only exits
Done 2026-06-21 (accelerated validation — parallel week shortened):
Document comparison in a progress event or short runbook addendum.
Progress 2026-06-21 (parallel week started):
- Enabled `state-hub-consistency-sweep` on Railiance01 (`enabled: true`,
Temporal schedule **upserted** — no longer paused).
- Enabled `state-hub-consistency-sweep` on Railiance01 (`enabled: true`).
- Unified both runners on `POST /consistency/sweep/remote-all` with
`detail.source` (`local-timer` vs `activity-core`).
- Local `custodian-sync.service` now calls the API (not direct script).
- Added `scripts/compare_consistency_sweep_parallel.py` and runbook §T3.
- Review window ends **2026-06-28**; then proceed to T04 cutover.
- `compare_consistency_sweep_parallel.py` over 72h: activity-core 5 events
(3 completed, 2 lock_skipped), local-timer 6 events (5 completed, 1
lock_skipped). Matching hard-fail profile (repo-level C-06, not scheduler).
- Lock overlap confirmed healthy idempotence. Evidence sufficient for cutover.
## T4 — Retire local timer
```task
id: STATE-WP-0064-T04
status: todo
status: done
priority: medium
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"
```
After parallel week passes:
After parallel validation passes:
```bash
systemctl --user disable --now custodian-sync.timer
```
Archive or update unit files under `infra/`. Mark cron-migration stub §5 step 4
complete. Update `docs/activity-core-delegation.md` cross-reference.
Done 2026-06-21:
- Local timer disabled (`inactive`, `disabled`).
- Unit files archived to `infra/systemd/archived/`.
- cron-migration §5 step 4 marked complete.
- `docs/activity-core-delegation.md` cross-reference added.
## T5 — Docs and operator handoff
```task
id: STATE-WP-0064-T05
status: progress
status: done
priority: low
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
```
- `infra/README.md`: primary schedule is activity-core on Railiance01; local
timer is retired.
- `docs/cron-migration.md`: promote §2A from design stub to implemented;
note blockers cleared.
- Dashboard or AGENTS snippet: "State Hub consistency sync" terminology.
timer retired.
- `docs/cron-migration.md`: §2A promoted to implemented; cutover complete.
- `docs/consistency-sweep-runbook.md`: steady-state ops (no parallel week).
- `AGENTS.md`: State Hub consistency sync terminology and runbook link.
Mark workplan `finished` when cluster schedule is the sole primary runner.
Progress 2026-06-21: `docs/consistency-sweep-runbook.md` added;
`infra/README.md` and `docs/cron-migration.md` updated for API + parallel
week. Parallel-week observability script landed; final cutover wording
deferred to T04.
Done 2026-06-21. Cluster schedule is the sole primary runner.