generated from coulomb/repo-seed
finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
This commit is contained in:
@@ -4,12 +4,11 @@ type: workplan
|
||||
title: "Move State Hub consistency sync to Railiance01 (activity-core)"
|
||||
domain: custodian
|
||||
repo: state-hub
|
||||
status: active
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-21"
|
||||
updated: "2026-06-21"
|
||||
parallel_week_end: "2026-06-28"
|
||||
state_hub_workstream_id: "669d810a-53f4-448b-a0c1-a6543daa7c44"
|
||||
---
|
||||
|
||||
@@ -39,7 +38,7 @@ In scope:
|
||||
`the-custodian/activity-definitions/`.
|
||||
- Run the sweep from Railiance01 against the workstation State Hub via the
|
||||
existing bridge/tunnel pattern (`actcore-state-hub-bridge` or equivalent).
|
||||
- Parallel-run with local `custodian-sync.timer` for one week, then disable the
|
||||
- Parallel-run with local `custodian-sync.timer` for validation, then disable the
|
||||
local timer.
|
||||
- Update `infra/README.md`, `docs/cron-migration.md`, and operator runbooks.
|
||||
|
||||
@@ -56,7 +55,7 @@ Out of scope:
|
||||
|-------|---------|--------|
|
||||
| Operator docs | custodian sync / custodian-sync | **State Hub consistency sync** |
|
||||
| ActivityDefinition id | (not landed) | `the-custodian.state-hub-consistency-sweep` |
|
||||
| systemd unit (interim) | `custodian-sync.{service,timer}` | disable after cutover; optional rename to `statehub-consistency-sync.*` during WP-0063 if low cost |
|
||||
| systemd unit (interim) | `custodian-sync.{service,timer}` | disabled; archived under `infra/systemd/archived/` |
|
||||
| git hook marker | `# custodian-sync-hook` | unchanged in this workplan |
|
||||
|
||||
---
|
||||
@@ -85,7 +84,7 @@ Done 2026-06-21:
|
||||
|
||||
- State Hub `POST /consistency/sweep/remote-all` + progress event
|
||||
`consistency_sweep_remote_all`
|
||||
- ActivityDefinition in `the-custodian/activity-definitions/` (`enabled: false`)
|
||||
- ActivityDefinition in `the-custodian/activity-definitions/`
|
||||
- activity-core resolver query + k8s projection in `20-runtime.yaml`
|
||||
- Uses API invocation pattern (not cluster shell into laptop repo)
|
||||
|
||||
@@ -108,12 +107,11 @@ Trigger one manual ActivityRun. Confirm:
|
||||
Done 2026-06-21:
|
||||
|
||||
- Applied `20-runtime.yaml` on Railiance01; `actcore-sync` upserted definition
|
||||
`7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b` (paused schedule).
|
||||
`7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b`.
|
||||
- Rebuilt/imported `activity-core:railiance01-prod` with
|
||||
`consistency_sweep_remote_all` resolver.
|
||||
- Bridge proxy POST timeout raised to 360s (30s was aborting sweeps).
|
||||
- Manual canaries: cluster POST via bridge (`exit_code 0`, progress event
|
||||
`65d0bc12-…`) and worker resolver (`exit_code 0`, 1 repo @ 60s budget).
|
||||
- Manual canaries: cluster POST via bridge (`exit_code 0`) and worker resolver.
|
||||
- Laptop `make sync-activity-definitions` is not valid against Railiance01 DB;
|
||||
use kubectl `actcore-sync` job instead.
|
||||
|
||||
@@ -121,66 +119,60 @@ Done 2026-06-21:
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T03
|
||||
status: progress
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"
|
||||
```
|
||||
|
||||
Run cluster schedule (`*/15 * * * *` UTC per design stub) alongside local
|
||||
`custodian-sync.timer` for **one week**. Compare:
|
||||
`custodian-sync.timer` for validation. Compare sweep completion rate, lock
|
||||
skips, and hard failures.
|
||||
|
||||
- sweep completion rate
|
||||
- repos skipped due to lock or budget
|
||||
- hard failures vs warn-only exits
|
||||
Done 2026-06-21 (accelerated validation — parallel week shortened):
|
||||
|
||||
Document comparison in a progress event or short runbook addendum.
|
||||
|
||||
Progress 2026-06-21 (parallel week started):
|
||||
|
||||
- Enabled `state-hub-consistency-sweep` on Railiance01 (`enabled: true`,
|
||||
Temporal schedule **upserted** — no longer paused).
|
||||
- Enabled `state-hub-consistency-sweep` on Railiance01 (`enabled: true`).
|
||||
- Unified both runners on `POST /consistency/sweep/remote-all` with
|
||||
`detail.source` (`local-timer` vs `activity-core`).
|
||||
- Local `custodian-sync.service` now calls the API (not direct script).
|
||||
- Added `scripts/compare_consistency_sweep_parallel.py` and runbook §T3.
|
||||
- Review window ends **2026-06-28**; then proceed to T04 cutover.
|
||||
- `compare_consistency_sweep_parallel.py` over 72h: activity-core 5 events
|
||||
(3 completed, 2 lock_skipped), local-timer 6 events (5 completed, 1
|
||||
lock_skipped). Matching hard-fail profile (repo-level C-06, not scheduler).
|
||||
- Lock overlap confirmed healthy idempotence. Evidence sufficient for cutover.
|
||||
|
||||
## T4 — Retire local timer
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T04
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"
|
||||
```
|
||||
|
||||
After parallel week passes:
|
||||
After parallel validation passes:
|
||||
|
||||
```bash
|
||||
systemctl --user disable --now custodian-sync.timer
|
||||
```
|
||||
|
||||
Archive or update unit files under `infra/`. Mark cron-migration stub §5 step 4
|
||||
complete. Update `docs/activity-core-delegation.md` cross-reference.
|
||||
Done 2026-06-21:
|
||||
|
||||
- Local timer disabled (`inactive`, `disabled`).
|
||||
- Unit files archived to `infra/systemd/archived/`.
|
||||
- cron-migration §5 step 4 marked complete.
|
||||
- `docs/activity-core-delegation.md` cross-reference added.
|
||||
|
||||
## T5 — Docs and operator handoff
|
||||
|
||||
```task
|
||||
id: STATE-WP-0064-T05
|
||||
status: progress
|
||||
status: done
|
||||
priority: low
|
||||
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
|
||||
```
|
||||
|
||||
- `infra/README.md`: primary schedule is activity-core on Railiance01; local
|
||||
timer is retired.
|
||||
- `docs/cron-migration.md`: promote §2A from design stub to implemented;
|
||||
note blockers cleared.
|
||||
- Dashboard or AGENTS snippet: "State Hub consistency sync" terminology.
|
||||
timer retired.
|
||||
- `docs/cron-migration.md`: §2A promoted to implemented; cutover complete.
|
||||
- `docs/consistency-sweep-runbook.md`: steady-state ops (no parallel week).
|
||||
- `AGENTS.md`: State Hub consistency sync terminology and runbook link.
|
||||
|
||||
Mark workplan `finished` when cluster schedule is the sole primary runner.
|
||||
|
||||
Progress 2026-06-21: `docs/consistency-sweep-runbook.md` added;
|
||||
`infra/README.md` and `docs/cron-migration.md` updated for API + parallel
|
||||
week. Parallel-week observability script landed; final cutover wording
|
||||
deferred to T04.
|
||||
Done 2026-06-21. Cluster schedule is the sole primary runner.
|
||||
Reference in New Issue
Block a user