Files
state-hub/workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md
tegwick 39ed5459b9 finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
2026-06-22 01:20:59 +02:00

178 lines
6.1 KiB
Markdown

---
id: STATE-WP-0064
type: workplan
title: "Move State Hub consistency sync to Railiance01 (activity-core)"
domain: custodian
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "669d810a-53f4-448b-a0c1-a6543daa7c44"
---
# STATE-WP-0064 — Move State Hub consistency sync to Railiance01
**Origin:** `history/20260621-weekend-automation-assessment.md` and
`docs/cron-migration.md` design stub (CUST-WP-0040 T04).
The 15-minute workplan↔DB reconciliation is a **State Hub read-model
maintenance** job across all registered repos. The legacy name **custodian-sync**
reflects the owning domain, not the job's scope. Operator-facing names should
use **State Hub consistency sync**; the ActivityDefinition id
`the-custodian.state-hub-consistency-sweep` already matches this in
`docs/cron-migration.md`.
This workplan moves **scheduling** to activity-core on Railiance01 while
`scripts/consistency_check.py` remains in the `state-hub` repo.
Depends on `STATE-WP-0063` repairing the current broken local path so there is
a known-good baseline before cutover.
## Scope
In scope:
- Land `state-hub-consistency-sweep` ActivityDefinition in
`the-custodian/activity-definitions/`.
- Run the sweep from Railiance01 against the workstation State Hub via the
existing bridge/tunnel pattern (`actcore-state-hub-bridge` or equivalent).
- Parallel-run with local `custodian-sync.timer` for validation, then disable the
local timer.
- Update `infra/README.md`, `docs/cron-migration.md`, and operator runbooks.
Out of scope:
- Changing consistency_check.py reconciliation rules (ADR-001 logic stays).
- Renaming `# custodian-sync-hook` in every registered repo's git hook (separate
hygiene pass; hooks may keep the marker until all repos are updated).
- Per-commit hook migration to event-driven activity-core (see cron-migration §C).
## Naming decision (decided)
| Layer | Current | Target |
|-------|---------|--------|
| Operator docs | custodian sync / custodian-sync | **State Hub consistency sync** |
| ActivityDefinition id | (not landed) | `the-custodian.state-hub-consistency-sweep` |
| systemd unit (interim) | `custodian-sync.{service,timer}` | disabled; archived under `infra/systemd/archived/` |
| git hook marker | `# custodian-sync-hook` | unchanged in this workplan |
---
## T1 — ActivityDefinition and cluster wiring
```task
id: STATE-WP-0064-T01
status: done
priority: high
state_hub_task_id: "ecc0f846-e00f-4063-8ec1-f6ad630e9265"
```
Create `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
from the draft in `docs/cron-migration.md` §2A, adjusting:
- shell command to reach the workstation repo path or a cluster-side checkout
- `STATE_HUB_URL` via bridge service (not hard-coded `127.0.0.1` on cluster)
- `misfire_policy: skip` and `--max-seconds 300` budget
- `on_failure: log_and_continue` for warn-only sweeps
Sync definition to Railiance01 activity-core (projection manifest per
`hourly-recently-on-scope` precedent). Enable after manual canary.
Done 2026-06-21:
- State Hub `POST /consistency/sweep/remote-all` + progress event
`consistency_sweep_remote_all`
- ActivityDefinition in `the-custodian/activity-definitions/`
- activity-core resolver query + k8s projection in `20-runtime.yaml`
- Uses API invocation pattern (not cluster shell into laptop repo)
## T2 — Manual canary on Railiance01
```task
id: STATE-WP-0064-T02
status: done
priority: high
state_hub_task_id: "2e9b5b66-a7b1-46a5-8e1f-22e6b5caeff6"
```
Trigger one manual ActivityRun. Confirm:
- `consistency_check.py --remote --all` completes within budget
- C-15 writeback and C-16 pull gate behave as today
- progress or activity-core run history shows success
- no duplicate side-effects when local timer also fires (idempotent)
Done 2026-06-21:
- Applied `20-runtime.yaml` on Railiance01; `actcore-sync` upserted definition
`7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b`.
- Rebuilt/imported `activity-core:railiance01-prod` with
`consistency_sweep_remote_all` resolver.
- Bridge proxy POST timeout raised to 360s (30s was aborting sweeps).
- Manual canaries: cluster POST via bridge (`exit_code 0`) and worker resolver.
- Laptop `make sync-activity-definitions` is not valid against Railiance01 DB;
use kubectl `actcore-sync` job instead.
## T3 — Parallel run and observability
```task
id: STATE-WP-0064-T03
status: done
priority: medium
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"
```
Run cluster schedule (`*/15 * * * *` UTC per design stub) alongside local
`custodian-sync.timer` for validation. Compare sweep completion rate, lock
skips, and hard failures.
Done 2026-06-21 (accelerated validation — parallel week shortened):
- Enabled `state-hub-consistency-sweep` on Railiance01 (`enabled: true`).
- Unified both runners on `POST /consistency/sweep/remote-all` with
`detail.source` (`local-timer` vs `activity-core`).
- `compare_consistency_sweep_parallel.py` over 72h: activity-core 5 events
(3 completed, 2 lock_skipped), local-timer 6 events (5 completed, 1
lock_skipped). Matching hard-fail profile (repo-level C-06, not scheduler).
- Lock overlap confirmed healthy idempotence. Evidence sufficient for cutover.
## T4 — Retire local timer
```task
id: STATE-WP-0064-T04
status: done
priority: medium
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"
```
After parallel validation passes:
```bash
systemctl --user disable --now custodian-sync.timer
```
Done 2026-06-21:
- Local timer disabled (`inactive`, `disabled`).
- Unit files archived to `infra/systemd/archived/`.
- cron-migration §5 step 4 marked complete.
- `docs/activity-core-delegation.md` cross-reference added.
## T5 — Docs and operator handoff
```task
id: STATE-WP-0064-T05
status: done
priority: low
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
```
- `infra/README.md`: primary schedule is activity-core on Railiance01; local
timer retired.
- `docs/cron-migration.md`: §2A promoted to implemented; cutover complete.
- `docs/consistency-sweep-runbook.md`: steady-state ops (no parallel week).
- `AGENTS.md`: State Hub consistency sync terminology and runbook link.
Done 2026-06-21. Cluster schedule is the sole primary runner.