Files
state-hub/workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md
tegwick ab14e77e77 feat(STATE-WP-0064): start parallel week with source-tagged sweep runners
Tag consistency_sweep_remote_all progress events by source, route the local
timer through the API, add a parallel-week comparison script, and document
the 2026-06-21 to 2026-06-28 observation window for T03.
2026-06-21 21:46:43 +02:00

6.5 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, parallel_week_end, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated parallel_week_end state_hub_workstream_id
STATE-WP-0064 workplan Move State Hub consistency sync to Railiance01 (activity-core) custodian state-hub active codex custodian 2026-06-21 2026-06-21 2026-06-28 669d810a-53f4-448b-a0c1-a6543daa7c44

STATE-WP-0064 — Move State Hub consistency sync to Railiance01

Origin: history/20260621-weekend-automation-assessment.md and docs/cron-migration.md design stub (CUST-WP-0040 T04).

The 15-minute workplan↔DB reconciliation is a State Hub read-model maintenance job across all registered repos. The legacy name custodian-sync reflects the owning domain, not the job's scope. Operator-facing names should use State Hub consistency sync; the ActivityDefinition id the-custodian.state-hub-consistency-sweep already matches this in docs/cron-migration.md.

This workplan moves scheduling to activity-core on Railiance01 while scripts/consistency_check.py remains in the state-hub repo.

Depends on STATE-WP-0063 repairing the current broken local path so there is a known-good baseline before cutover.

Scope

In scope:

  • Land state-hub-consistency-sweep ActivityDefinition in the-custodian/activity-definitions/.
  • Run the sweep from Railiance01 against the workstation State Hub via the existing bridge/tunnel pattern (actcore-state-hub-bridge or equivalent).
  • Parallel-run with local custodian-sync.timer for one week, then disable the local timer.
  • Update infra/README.md, docs/cron-migration.md, and operator runbooks.

Out of scope:

  • Changing consistency_check.py reconciliation rules (ADR-001 logic stays).
  • Renaming # custodian-sync-hook in every registered repo's git hook (separate hygiene pass; hooks may keep the marker until all repos are updated).
  • Per-commit hook migration to event-driven activity-core (see cron-migration §C).

Naming decision (decided)

Layer Current Target
Operator docs custodian sync / custodian-sync State Hub consistency sync
ActivityDefinition id (not landed) the-custodian.state-hub-consistency-sweep
systemd unit (interim) custodian-sync.{service,timer} disable after cutover; optional rename to statehub-consistency-sync.* during WP-0063 if low cost
git hook marker # custodian-sync-hook unchanged in this workplan

T1 — ActivityDefinition and cluster wiring

id: STATE-WP-0064-T01
status: done
priority: high
state_hub_task_id: "ecc0f846-e00f-4063-8ec1-f6ad630e9265"

Create the-custodian/activity-definitions/state-hub-consistency-sweep.md from the draft in docs/cron-migration.md §2A, adjusting:

  • shell command to reach the workstation repo path or a cluster-side checkout
  • STATE_HUB_URL via bridge service (not hard-coded 127.0.0.1 on cluster)
  • misfire_policy: skip and --max-seconds 300 budget
  • on_failure: log_and_continue for warn-only sweeps

Sync definition to Railiance01 activity-core (projection manifest per hourly-recently-on-scope precedent). Enable after manual canary.

Done 2026-06-21:

  • State Hub POST /consistency/sweep/remote-all + progress event consistency_sweep_remote_all
  • ActivityDefinition in the-custodian/activity-definitions/ (enabled: false)
  • activity-core resolver query + k8s projection in 20-runtime.yaml
  • Uses API invocation pattern (not cluster shell into laptop repo)

T2 — Manual canary on Railiance01

id: STATE-WP-0064-T02
status: done
priority: high
state_hub_task_id: "2e9b5b66-a7b1-46a5-8e1f-22e6b5caeff6"

Trigger one manual ActivityRun. Confirm:

  • consistency_check.py --remote --all completes within budget
  • C-15 writeback and C-16 pull gate behave as today
  • progress or activity-core run history shows success
  • no duplicate side-effects when local timer also fires (idempotent)

Done 2026-06-21:

  • Applied 20-runtime.yaml on Railiance01; actcore-sync upserted definition 7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b (paused schedule).
  • Rebuilt/imported activity-core:railiance01-prod with consistency_sweep_remote_all resolver.
  • Bridge proxy POST timeout raised to 360s (30s was aborting sweeps).
  • Manual canaries: cluster POST via bridge (exit_code 0, progress event 65d0bc12-…) and worker resolver (exit_code 0, 1 repo @ 60s budget).
  • Laptop make sync-activity-definitions is not valid against Railiance01 DB; use kubectl actcore-sync job instead.

T3 — Parallel run and observability

id: STATE-WP-0064-T03
status: progress
priority: medium
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"

Run cluster schedule (*/15 * * * * UTC per design stub) alongside local custodian-sync.timer for one week. Compare:

  • sweep completion rate
  • repos skipped due to lock or budget
  • hard failures vs warn-only exits

Document comparison in a progress event or short runbook addendum.

Progress 2026-06-21 (parallel week started):

  • Enabled state-hub-consistency-sweep on Railiance01 (enabled: true, Temporal schedule upserted — no longer paused).
  • Unified both runners on POST /consistency/sweep/remote-all with detail.source (local-timer vs activity-core).
  • Local custodian-sync.service now calls the API (not direct script).
  • Added scripts/compare_consistency_sweep_parallel.py and runbook §T3.
  • Review window ends 2026-06-28; then proceed to T04 cutover.

T4 — Retire local timer

id: STATE-WP-0064-T04
status: todo
priority: medium
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"

After parallel week passes:

systemctl --user disable --now custodian-sync.timer

Archive or update unit files under infra/. Mark cron-migration stub §5 step 4 complete. Update docs/activity-core-delegation.md cross-reference.

T5 — Docs and operator handoff

id: STATE-WP-0064-T05
status: progress
priority: low
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
  • infra/README.md: primary schedule is activity-core on Railiance01; local timer is retired.
  • docs/cron-migration.md: promote §2A from design stub to implemented; note blockers cleared.
  • Dashboard or AGENTS snippet: "State Hub consistency sync" terminology.

Mark workplan finished when cluster schedule is the sole primary runner.

Progress 2026-06-21: docs/consistency-sweep-runbook.md added; infra/README.md and docs/cron-migration.md updated for API + parallel week. Parallel-week observability script landed; final cutover wording deferred to T04.