--- id: STATE-WP-0063 type: workplan title: "Repair broken weekend automation (paths, cluster reachability, triage)" domain: custodian repo: state-hub status: finished owner: codex topic_slug: custodian created: "2026-06-21" updated: "2026-06-21" state_hub_workstream_id: "41194e3b-7d45-4cf1-9715-56a843230ad7" --- # STATE-WP-0063 — Repair broken weekend automation **Origin:** `history/20260621-weekend-automation-assessment.md` (2026-06-21). Over the Fri-evening → Sun-afternoon window, three automation substrates were offline or broken: local `custodian-sync` (wrong repo path), activity-core hourly/daily jobs on Railiance01 (~44 h gap), and local `railiance backup` cron (missing binary). This workplan restores reliable operation before the Railiance01 migration in `STATE-WP-0064`. ## Scope In scope: - Fix the local `custodian-sync` systemd unit so `consistency_check.py --remote --all` runs successfully from `/home/worsch/state-hub`. - Diagnose and fix the Railiance01 activity-core gap affecting hourly RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge). - Repair or document the broken `railiance backup` crontab entry. - Confirm `bridge maintenance cleanup` cron runs or fix its path/logging. - Post a State Hub progress milestone when repairs are verified. Out of scope: - Moving the consistency sweep to Railiance01 (see `STATE-WP-0064`). - Renaming `custodian-sync` units cluster-wide (deferred to WP-0064). ## Evidence baseline See `history/20260621-weekend-automation-assessment.md` for the full timeline. Key failures: - `custodian-sync.service`: `WorkingDirectory=/home/worsch/the-custodian/state-hub` (pre-extraction path). - Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC. - Daily triage: no `daily_triage` events on 2026-06-20 or 2026-06-21 morning. --- ## T1 — Fix local State Hub consistency sync unit ```task id: STATE-WP-0063-T01 status: done priority: high state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2" ``` Update `~/.config/systemd/user/custodian-sync.service` (and any checked-in template under `infra/`) to use `/home/worsch/state-hub` and invoke consistency via `uv run python scripts/consistency_check.py --remote --all` (or equivalent `$(UV)` path) instead of a stale `.venv/bin/python`. Acceptance: - `systemctl --user start custodian-sync.service` exits 0. - `journalctl --user -u custodian-sync.service -n 5` shows a completed sweep, not `status=127`. - Document the interim fix in `infra/README.md` until WP-0064 cutover. Result 2026-06-21: Updated `~/.config/systemd/user/custodian-sync.service` and added repo templates under `infra/systemd/`. Verified `systemctl --user start custodian-sync.service` exits 0; journal shows `RESULT: ✓ PASS (with warnings)` (~2m39s sweep). ## T2 — Diagnose activity-core weekend gap on Railiance01 ```task id: STATE-WP-0063-T02 status: done priority: high state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61" ``` Investigate why Temporal schedules for `hourly-recently-on-scope` and `daily-statehub-wsjf-triage` produced no State Hub progress between Fri 20:00 CEST and Sun 16:00 CEST. Check: - activity-core worker/API pod health on Railiance01 - `actcore-state-hub-bridge` reachability to the workstation State Hub - `state-hub-railiance01` ops-bridge tunnel state when the laptop is awake - Temporal schedule pause/misfire history Record findings in a short note (progress event or workplan update). Fix any configuration regression found (do not re-enable duplicate Codex app fallbacks). Result 2026-06-21: - activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge deployment all Running). Temporal schedules for hourly RecentlyOnScope (`d104348c…`) and daily WSJF triage (`6fca51fa…`) are **not paused** and did fire over the weekend. - Failed hourly workflows show: `Required context resolver 'state-hub'/'recently_on_scope_hourly' failed: [Errno 111] Connection refused`. The 14:00 UTC Sun run succeeded when the workstation tunnel was up; most other runs since Fri 20:00 CEST failed. - `state-hub-railiance01` ops-bridge tunnel spends long stretches in **reconnecting**; `curl --max-time 5 http://127.0.0.1:18000/state/health` on the Railiance01 node times out while the tunnel is unstable. - **Conclusion:** not an activity-core scheduler outage. The gap is **workstation availability + ops-bridge tunnel reachability** to the local State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local `127.0.0.1:18000` as designed. - **Retry fix (2026-06-21):** stale orphan `sshd` remote forward on Railiance01 port 18000 blocked new tunnel binds (`ExitOnForwardFailure` → exit 255). `bridge maintenance cleanup state-hub-railiance01 --restart` cleared it. ## T3 — Restore daily WSJF triage evidence ```task id: STATE-WP-0063-T03 status: done priority: medium state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625" ``` After T2, trigger one manual canary of `daily-statehub-wsjf-triage` on Railiance01 and confirm: - `event_type: daily_triage` progress event lands in State Hub - working-memory report file appears under `the-custodian/memory/working/` If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin fire is armed. Progress 2026-06-21 (first attempt): sink failed while tunnel reconnecting. Result 2026-06-21 (retry): `bridge maintenance cleanup state-hub-railiance01 --restart` cleared stale remote `sshd` on port 18000 (orphan forward with many CLOSE_WAIT). Tunnel now **connected**; node `curl 127.0.0.1:18000/state/health` and worker `actcore-state-hub-bridge:8000/state/health` both return ok. Manual canaries succeeded: - hourly RecentlyOnScope → `recently_on_scope_hourly` at 2026-06-21T17:45:29Z - daily WSJF triage → `daily_triage` at 2026-06-21T17:45:46Z + working-memory report under `the-custodian/memory/working/` ## T4 — Repair ancillary local crons ```task id: STATE-WP-0063-T04 status: done priority: low state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e" ``` - Fix `railiance backup` crontab to use the current `railiance-bootstrap` install path, or document the correct operator install step. - Verify `bridge maintenance cleanup` at 03:00 writes to `~/.local/state/bridge/cleanup.log` on the next run. Result 2026-06-21: - Crontab `railiance backup` path corrected to `/home/worsch/railiance-cluster/bin/railiance` (was stale `railiance-bootstrap` path). - Created `~/.local/state/bridge/` for cleanup log; cron entry unchanged. Manual `bridge maintenance cleanup --restart` runs >30s (expected for tunnel recycle); next 03:00 cron run is the production verification. ## T5 — Verification sweep and close-out ```task id: STATE-WP-0063-T05 status: done priority: medium state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1" ``` Run a 24-hour observation window: - at least one successful `custodian-sync` sweep per hour while machine is awake - at least two hourly RecentlyOnScope progress events (generated or skipped) - no duplicate Codex + activity-core primary runners for the same job Mark workplan `finished` and log a milestone progress event summarising repairs. Result 2026-06-21 (accelerated close-out): substituted a **3-hour post-repair validation window** (19:55–22:30 CEST) for the full 24 h — a shorter timer cadence was not needed because the existing 15-minute `custodian-sync.timer` already produced enough evidence once the tunnel and unit path were fixed. Evidence: - **custodian-sync:** 7 consecutive `Finished` runs with `RESULT: ✓ PASS` from 19:55 CEST onward (~1 sweep per 15 min while awake). Earlier failures were pre-fix path errors (`status=127`) or isolated hard consistency fails (e.g. C-01 on a single repo), not scheduler regression. - **hourly RecentlyOnScope:** 4 `recently_on_scope_hourly` events since 17:00Z (17:08, 17:45 manual, 18:00, 19:00 scheduled) — all `0 failed`. - **daily triage:** manual canary `daily_triage` at 17:45Z; schedule armed for next 07:20 Europe/Berlin. - **tunnel:** `state-hub-railiance01` `connected` during validation. - **no duplicate runners:** cluster consistency sweep remains disabled (`STATE-WP-0064`); local timer is sole primary for ADR-001 sweeps. Workplan marked `finished`.