Files
state-hub/workplans/STATE-WP-0063-weekend-automation-repair.md
tegwick 3d5e354ff8 docs(state-hub): weekend automation assessment and repair workplans
Persist the Fri-evening→Sun-afternoon automation gap assessment in
history/, and add STATE-WP-0063 (repair broken paths and cluster
reachability) plus STATE-WP-0064 (move State Hub consistency sync to
Railiance01 via activity-core). Workplans registered in State Hub via
fix-consistency.
2026-06-21 17:32:44 +02:00

4.6 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
STATE-WP-0063 workplan Repair broken weekend automation (paths, cluster reachability, triage) custodian state-hub ready codex custodian 2026-06-21 2026-06-21 41194e3b-7d45-4cf1-9715-56a843230ad7

STATE-WP-0063 — Repair broken weekend automation

Origin: history/20260621-weekend-automation-assessment.md (2026-06-21).

Over the Fri-evening → Sun-afternoon window, three automation substrates were offline or broken: local custodian-sync (wrong repo path), activity-core hourly/daily jobs on Railiance01 (~44 h gap), and local railiance backup cron (missing binary). This workplan restores reliable operation before the Railiance01 migration in STATE-WP-0064.

Scope

In scope:

  • Fix the local custodian-sync systemd unit so consistency_check.py --remote --all runs successfully from /home/worsch/state-hub.
  • Diagnose and fix the Railiance01 activity-core gap affecting hourly RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
  • Repair or document the broken railiance backup crontab entry.
  • Confirm bridge maintenance cleanup cron runs or fix its path/logging.
  • Post a State Hub progress milestone when repairs are verified.

Out of scope:

  • Moving the consistency sweep to Railiance01 (see STATE-WP-0064).
  • Renaming custodian-sync units cluster-wide (deferred to WP-0064).

Evidence baseline

See history/20260621-weekend-automation-assessment.md for the full timeline. Key failures:

  • custodian-sync.service: WorkingDirectory=/home/worsch/the-custodian/state-hub (pre-extraction path).
  • Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
  • Daily triage: no daily_triage events on 2026-06-20 or 2026-06-21 morning.

T1 — Fix local State Hub consistency sync unit

id: STATE-WP-0063-T01
status: todo
priority: high
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"

Update ~/.config/systemd/user/custodian-sync.service (and any checked-in template under infra/) to use /home/worsch/state-hub and invoke consistency via uv run python scripts/consistency_check.py --remote --all (or equivalent $(UV) path) instead of a stale .venv/bin/python.

Acceptance:

  • systemctl --user start custodian-sync.service exits 0.
  • journalctl --user -u custodian-sync.service -n 5 shows a completed sweep, not status=127.
  • Document the interim fix in infra/README.md until WP-0064 cutover.

T2 — Diagnose activity-core weekend gap on Railiance01

id: STATE-WP-0063-T02
status: todo
priority: high
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"

Investigate why Temporal schedules for hourly-recently-on-scope and daily-statehub-wsjf-triage produced no State Hub progress between Fri 20:00 CEST and Sun 16:00 CEST. Check:

  • activity-core worker/API pod health on Railiance01
  • actcore-state-hub-bridge reachability to the workstation State Hub
  • state-hub-railiance01 ops-bridge tunnel state when the laptop is awake
  • Temporal schedule pause/misfire history

Record findings in a short note (progress event or workplan update). Fix any configuration regression found (do not re-enable duplicate Codex app fallbacks).

T3 — Restore daily WSJF triage evidence

id: STATE-WP-0063-T03
status: todo
priority: medium
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"

After T2, trigger one manual canary of daily-statehub-wsjf-triage on Railiance01 and confirm:

  • event_type: daily_triage progress event lands in State Hub
  • working-memory report file appears under the-custodian/memory/working/

If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin fire is armed.

T4 — Repair ancillary local crons

id: STATE-WP-0063-T04
status: todo
priority: low
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
  • Fix railiance backup crontab to use the current railiance-bootstrap install path, or document the correct operator install step.
  • Verify bridge maintenance cleanup at 03:00 writes to ~/.local/state/bridge/cleanup.log on the next run.

T5 — Verification sweep and close-out

id: STATE-WP-0063-T05
status: todo
priority: medium
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"

Run a 24-hour observation window:

  • at least one successful custodian-sync sweep per hour while machine is awake
  • at least two hourly RecentlyOnScope progress events (generated or skipped)
  • no duplicate Codex + activity-core primary runners for the same job

Mark workplan finished and log a milestone progress event summarising repairs.