Files
state-hub/workplans/STATE-WP-0064-statehub-consistency-sync-railiance01.md
tegwick 39ed5459b9 finish(STATE-WP-0064): cut over scheduler and split sweep errors from failures
STATE-WP-0064 cutover (state-hub only):
- Retire local custodian-sync.timer; archive units under infra/systemd/archived/
- Mark workplan finished; update infra/README, cron-migration, runbook, AGENTS.md
- Point activity-core-delegation at the consistency-sweep runbook

Consistency engine — automation error vs assessment failure:
- C-00 is an automation error; C-01..C-23 assessment failures are recorded
  for follow-up but no longer fail --remote --all scheduled sweeps (exit 0)
- Skip workplans/README.md in the workplan glob (human index, not a workplan)
- Progress events and compare script expose automation_error and
  assessment_failures separately from exit_code
2026-06-22 01:20:59 +02:00

6.1 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
STATE-WP-0064 workplan Move State Hub consistency sync to Railiance01 (activity-core) custodian state-hub finished codex custodian 2026-06-21 2026-06-21 669d810a-53f4-448b-a0c1-a6543daa7c44

STATE-WP-0064 — Move State Hub consistency sync to Railiance01

Origin: history/20260621-weekend-automation-assessment.md and docs/cron-migration.md design stub (CUST-WP-0040 T04).

The 15-minute workplan↔DB reconciliation is a State Hub read-model maintenance job across all registered repos. The legacy name custodian-sync reflects the owning domain, not the job's scope. Operator-facing names should use State Hub consistency sync; the ActivityDefinition id the-custodian.state-hub-consistency-sweep already matches this in docs/cron-migration.md.

This workplan moves scheduling to activity-core on Railiance01 while scripts/consistency_check.py remains in the state-hub repo.

Depends on STATE-WP-0063 repairing the current broken local path so there is a known-good baseline before cutover.

Scope

In scope:

  • Land state-hub-consistency-sweep ActivityDefinition in the-custodian/activity-definitions/.
  • Run the sweep from Railiance01 against the workstation State Hub via the existing bridge/tunnel pattern (actcore-state-hub-bridge or equivalent).
  • Parallel-run with local custodian-sync.timer for validation, then disable the local timer.
  • Update infra/README.md, docs/cron-migration.md, and operator runbooks.

Out of scope:

  • Changing consistency_check.py reconciliation rules (ADR-001 logic stays).
  • Renaming # custodian-sync-hook in every registered repo's git hook (separate hygiene pass; hooks may keep the marker until all repos are updated).
  • Per-commit hook migration to event-driven activity-core (see cron-migration §C).

Naming decision (decided)

Layer Current Target
Operator docs custodian sync / custodian-sync State Hub consistency sync
ActivityDefinition id (not landed) the-custodian.state-hub-consistency-sweep
systemd unit (interim) custodian-sync.{service,timer} disabled; archived under infra/systemd/archived/
git hook marker # custodian-sync-hook unchanged in this workplan

T1 — ActivityDefinition and cluster wiring

id: STATE-WP-0064-T01
status: done
priority: high
state_hub_task_id: "ecc0f846-e00f-4063-8ec1-f6ad630e9265"

Create the-custodian/activity-definitions/state-hub-consistency-sweep.md from the draft in docs/cron-migration.md §2A, adjusting:

  • shell command to reach the workstation repo path or a cluster-side checkout
  • STATE_HUB_URL via bridge service (not hard-coded 127.0.0.1 on cluster)
  • misfire_policy: skip and --max-seconds 300 budget
  • on_failure: log_and_continue for warn-only sweeps

Sync definition to Railiance01 activity-core (projection manifest per hourly-recently-on-scope precedent). Enable after manual canary.

Done 2026-06-21:

  • State Hub POST /consistency/sweep/remote-all + progress event consistency_sweep_remote_all
  • ActivityDefinition in the-custodian/activity-definitions/
  • activity-core resolver query + k8s projection in 20-runtime.yaml
  • Uses API invocation pattern (not cluster shell into laptop repo)

T2 — Manual canary on Railiance01

id: STATE-WP-0064-T02
status: done
priority: high
state_hub_task_id: "2e9b5b66-a7b1-46a5-8e1f-22e6b5caeff6"

Trigger one manual ActivityRun. Confirm:

  • consistency_check.py --remote --all completes within budget
  • C-15 writeback and C-16 pull gate behave as today
  • progress or activity-core run history shows success
  • no duplicate side-effects when local timer also fires (idempotent)

Done 2026-06-21:

  • Applied 20-runtime.yaml on Railiance01; actcore-sync upserted definition 7c4e9a12-8f3b-4d5e-9c6a-1b2d3e4f5a6b.
  • Rebuilt/imported activity-core:railiance01-prod with consistency_sweep_remote_all resolver.
  • Bridge proxy POST timeout raised to 360s (30s was aborting sweeps).
  • Manual canaries: cluster POST via bridge (exit_code 0) and worker resolver.
  • Laptop make sync-activity-definitions is not valid against Railiance01 DB; use kubectl actcore-sync job instead.

T3 — Parallel run and observability

id: STATE-WP-0064-T03
status: done
priority: medium
state_hub_task_id: "8abb31ad-2f03-4aa7-889e-e60c3c39f1f8"

Run cluster schedule (*/15 * * * * UTC per design stub) alongside local custodian-sync.timer for validation. Compare sweep completion rate, lock skips, and hard failures.

Done 2026-06-21 (accelerated validation — parallel week shortened):

  • Enabled state-hub-consistency-sweep on Railiance01 (enabled: true).
  • Unified both runners on POST /consistency/sweep/remote-all with detail.source (local-timer vs activity-core).
  • compare_consistency_sweep_parallel.py over 72h: activity-core 5 events (3 completed, 2 lock_skipped), local-timer 6 events (5 completed, 1 lock_skipped). Matching hard-fail profile (repo-level C-06, not scheduler).
  • Lock overlap confirmed healthy idempotence. Evidence sufficient for cutover.

T4 — Retire local timer

id: STATE-WP-0064-T04
status: done
priority: medium
state_hub_task_id: "c8275471-5ec0-4dfb-8fec-2b3ec3894036"

After parallel validation passes:

systemctl --user disable --now custodian-sync.timer

Done 2026-06-21:

  • Local timer disabled (inactive, disabled).
  • Unit files archived to infra/systemd/archived/.
  • cron-migration §5 step 4 marked complete.
  • docs/activity-core-delegation.md cross-reference added.

T5 — Docs and operator handoff

id: STATE-WP-0064-T05
status: done
priority: low
state_hub_task_id: "270ed7dd-aa79-469d-a817-e3fa1e71be41"
  • infra/README.md: primary schedule is activity-core on Railiance01; local timer retired.
  • docs/cron-migration.md: §2A promoted to implemented; cutover complete.
  • docs/consistency-sweep-runbook.md: steady-state ops (no parallel week).
  • AGENTS.md: State Hub consistency sync terminology and runbook link.

Done 2026-06-21. Cluster schedule is the sole primary runner.