generated from coulomb/repo-seed
STATE-WP-0064 cutover (state-hub only): - Retire local custodian-sync.timer; archive units under infra/systemd/archived/ - Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md - Point activity-core-delegation at the consistency-sweep runbook Consistency engine — automation error vs assessment failure: - C-00 is an automation error; C-01..C-23 assessment failures are recorded for follow-up but no longer fail --remote --all scheduled sweeps (exit 0) - Skip workplans/README.md in the workplan glob (human index, not a workplan) - Progress events and compare script expose automation_error and assessment_failures separately from exit_code
178 lines
6.1 KiB
Markdown
178 lines
6.1 KiB
Markdown
---
|
|
id: STATE-WP-0064
|
|
type: workplan
|
|
title: "Move State Hub consistency sync to Railiance01 (activity-core)"
|
|
domain: custodian
|
|
repo: state-hub
|
|
status: finished
|
|
owner: codex
|
|
topic_slug: custodian
|
|
created: "2026-06-21"
|
|
updated: "2026-06-21"
|
|
state_hub_workstream_id: "669d810a-53f4-448b-a0c1-a6543daa7c44"
|
|
---
|
|
|
|
# STATE-WP-0064 — Move State Hub consistency sync to Railiance01
|
|
|
|
**Origin:** `history/20260621-weekend-automation-assessment.md` and
|
|
`docs/cron-migration.md` design stub (CUST-WP-0040 T04).
|
|
|
|
The 15-minute workplan↔DB reconciliation is a **State Hub read-model
|
|
maintenance** job across all registered repos. The legacy name **custodian-sync**
|
|
reflects the owning domain, not the job's scope. Operator-facing names should
|
|
use **State Hub consistency sync**; the ActivityDefinition id
|
|
`the-custodian.state-hub-consistency-sweep` already matches this in
|
|
`docs/cron-migration.md`.
|
|
|
|
This workplan moves **scheduling** to activity-core on Railiance01 while
|
|
`scripts/consistency_check.py` remains in the `state-hub` repo.
|
|
|
|
Depends on `STATE-WP-0063` repairing the current broken local path so there is
|
|
a known-good baseline before cutover.
|
|
|
|
## Scope
|
|
|
|
In scope:
|
|
|
|
- Land `state-hub-consistency-sweep` ActivityDefinition in
|
|
`the-custodian/activity-definitions/`.
|
|
- Run the sweep from Railiance01 against the workstation State Hub via the
|
|
existing bridge/tunnel pattern (`actcore-state-hub-bridge` or equivalent).
|
|
- Parallel-run with local `custodian-sync.timer` for validation, then disable the
|
|
local timer.
|
|
- Update `infra/README.md`, `docs/cron-migration.md`, and operator runbooks.
|
|
|
|
Out of scope:
|
|
|
|
- Changing consistency_check.py reconciliation rules (ADR-001 logic stays).
|
|
- Renaming `# custodian-sync-hook` in every registered repo's git hook (separate
|
|
hygiene pass; hooks may keep the marker until all repos are updated).
|
|
- Per-commit hook migration to event-driven activity-core (see cron-migration §C).
|
|
|
|
## Naming decision (decided)
|
|
|
|
| Layer | Current | Target |
|
|
|-------|---------|--------|
|
|
| Operator docs | custodian sync / custodian-sync | **State Hub consistency sync** |
|
|
| ActivityDefinition id | (not landed) | `the-custodian.state-hub-consistency-sweep` |
|
|
| systemd unit (interim) | `custodian-sync.{service,timer}` | disabled; archived under `infra/systemd/archived/` |
|
|
| git hook marker | `# custodian-sync-hook` | unchanged in this workplan |
|
|
|
|
---
|
|
|
|
## T1 — ActivityDefinition and cluster wiring
|
|
|
|
```task
|
|
id: STATE-WP-0064-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "ecc0f846-e00f-4063-8ec1-f6ad630e9265"
|
|
```
|
|
|
|
Create `the-custodian/activity-definitions/state-hub-consistency-sweep.md`
|
|
from the draft in `docs/cron-migration.md` §2A, adjusting:
|
|
|
|
- shell command to reach the workstation repo path or a cluster-side checkout
|
|
- `STATE_HUB_URL` via bridge service (not hard-coded `127.0.0.1` on cluster)
|
|
- `misfire_policy: skip` and `--max-seconds 300` budget
|
|
- `on_failure: log_and_continue` for warn-only sweeps
|
|
|
|
Sync definition to Railiance01 activity-core (projection manifest per
|
|
`hourly-recently-on-scope` precedent). Enable after manual canary.
|
|
|
|
Done 2026-06-21:
|
|
|
|
- State Hub `POST /consistency/sweep/remote-all` + progress event
|
|
`consistency_sweep_remote_all`
|
|
- ActivityDefinition in `the-custodian/activity-definitions/`
|
|
- activity-core resolver query + k8s projection in `20-runtime.yaml`
|
|
- Uses API invocation pattern (not cluster shell into laptop repo)
|
|
|
|
## T2 — Manual canary on Railiance01
|
|
|
|
```task
|
|
id: STATE-WP-0064-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "2e9b5b66-a7b1-46a5-8e1f-22e6b5caeff6"
|
|
```
|
|
|
|
Trigger one manual ActivityRun. Confirm:
|
|
|
|
- `consistency_check.py --remote --all` completes within budget
|
|
- C-15 writeback and C-16 pull gate behave as today
|
|
- progress or activity-core run history shows success
|
|
- no duplicate side-effects when local timer also fires (idempotent)
|
|
|
|
Done 2026-06-21:
|
|
|
|
- Applied `20-runtime.yaml` on Railiance01; `actcore-sync` upserted definition
|
|
`7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b`.
|
|
- Rebuilt/imported `activity-core:railiance01-prod` with
|
|
`consistency_sweep_remote_all` resolver.
|
|
- Bridge proxy POST timeout raised to 360s (30s was aborting sweeps).
|
|
- Manual canaries: cluster POST via bridge (`exit_code 0`) and worker resolver.
|
|
- Laptop `make sync-activity-definitions` is not valid against Railiance01 DB;
|
|
use kubectl `actcore-sync` job instead.
|
|
|
|
## T3 — Parallel run and observability
|
|
|
|
```task
|
|
id: STATE-WP-0064-T03
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"
|
|
```
|
|
|
|
Run cluster schedule (`*/15 * * * *` UTC per design stub) alongside local
|
|
`custodian-sync.timer` for validation. Compare sweep completion rate, lock
|
|
skips, and hard failures.
|
|
|
|
Done 2026-06-21 (accelerated validation — parallel week shortened):
|
|
|
|
- Enabled `state-hub-consistency-sweep` on Railiance01 (`enabled: true`).
|
|
- Unified both runners on `POST /consistency/sweep/remote-all` with
|
|
`detail.source` (`local-timer` vs `activity-core`).
|
|
- `compare_consistency_sweep_parallel.py` over 72h: activity-core 5 events
|
|
(3 completed, 2 lock_skipped), local-timer 6 events (5 completed, 1
|
|
lock_skipped). Matching hard-fail profile (repo-level C-06, not scheduler).
|
|
- Lock overlap confirmed healthy idempotence. Evidence sufficient for cutover.
|
|
|
|
## T4 — Retire local timer
|
|
|
|
```task
|
|
id: STATE-WP-0064-T04
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"
|
|
```
|
|
|
|
After parallel validation passes:
|
|
|
|
```bash
|
|
systemctl --user disable --now custodian-sync.timer
|
|
```
|
|
|
|
Done 2026-06-21:
|
|
|
|
- Local timer disabled (`inactive`, `disabled`).
|
|
- Unit files archived to `infra/systemd/archived/`.
|
|
- cron-migration §5 step 4 marked complete.
|
|
- `docs/activity-core-delegation.md` cross-reference added.
|
|
|
|
## T5 — Docs and operator handoff
|
|
|
|
```task
|
|
id: STATE-WP-0064-T05
|
|
status: done
|
|
priority: low
|
|
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
|
|
```
|
|
|
|
- `infra/README.md`: primary schedule is activity-core on Railiance01; local
|
|
timer retired.
|
|
- `docs/cron-migration.md`: §2A promoted to implemented; cutover complete.
|
|
- `docs/consistency-sweep-runbook.md`: steady-state ops (no parallel week).
|
|
- `AGENTS.md`: State Hub consistency sync terminology and runbook link.
|
|
|
|
Done 2026-06-21. Cluster schedule is the sole primary runner. |