Replace the 24h observation wait with evidence from post-repair sweeps: seven consecutive custodian-sync passes, four hourly RecentlyOnScope events, and a stable state-hub-railiance01 tunnel.
8.1 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| STATE-WP-0063 | workplan | Repair broken weekend automation (paths, cluster reachability, triage) | custodian | state-hub | finished | codex | custodian | 2026-06-21 | 2026-06-21 | 41194e3b-7d45-4cf1-9715-56a843230ad7 |
STATE-WP-0063 — Repair broken weekend automation
Origin: history/20260621-weekend-automation-assessment.md (2026-06-21).
Over the Fri-evening → Sun-afternoon window, three automation substrates were
offline or broken: local custodian-sync (wrong repo path), activity-core
hourly/daily jobs on Railiance01 (~44 h gap), and local railiance backup
cron (missing binary). This workplan restores reliable operation before the
Railiance01 migration in STATE-WP-0064.
Scope
In scope:
- Fix the local
custodian-syncsystemd unit soconsistency_check.py --remote --allruns successfully from/home/worsch/state-hub. - Diagnose and fix the Railiance01 activity-core gap affecting hourly RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
- Repair or document the broken
railiance backupcrontab entry. - Confirm
bridge maintenance cleanupcron runs or fix its path/logging. - Post a State Hub progress milestone when repairs are verified.
Out of scope:
- Moving the consistency sweep to Railiance01 (see
STATE-WP-0064). - Renaming
custodian-syncunits cluster-wide (deferred to WP-0064).
Evidence baseline
See history/20260621-weekend-automation-assessment.md for the full timeline.
Key failures:
custodian-sync.service:WorkingDirectory=/home/worsch/the-custodian/state-hub(pre-extraction path).- Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
- Daily triage: no
daily_triageevents on 2026-06-20 or 2026-06-21 morning.
T1 — Fix local State Hub consistency sync unit
id: STATE-WP-0063-T01
status: done
priority: high
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
Update ~/.config/systemd/user/custodian-sync.service (and any checked-in
template under infra/) to use /home/worsch/state-hub and invoke consistency
via uv run python scripts/consistency_check.py --remote --all (or equivalent
$(UV) path) instead of a stale .venv/bin/python.
Acceptance:
systemctl --user start custodian-sync.serviceexits 0.journalctl --user -u custodian-sync.service -n 5shows a completed sweep, notstatus=127.- Document the interim fix in
infra/README.mduntil WP-0064 cutover.
Result 2026-06-21: Updated ~/.config/systemd/user/custodian-sync.service and
added repo templates under infra/systemd/. Verified
systemctl --user start custodian-sync.service exits 0; journal shows
RESULT: ✓ PASS (with warnings) (~2m39s sweep).
T2 — Diagnose activity-core weekend gap on Railiance01
id: STATE-WP-0063-T02
status: done
priority: high
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
Investigate why Temporal schedules for hourly-recently-on-scope and
daily-statehub-wsjf-triage produced no State Hub progress between Fri
20:00 CEST and Sun 16:00 CEST. Check:
- activity-core worker/API pod health on Railiance01
actcore-state-hub-bridgereachability to the workstation State Hubstate-hub-railiance01ops-bridge tunnel state when the laptop is awake- Temporal schedule pause/misfire history
Record findings in a short note (progress event or workplan update). Fix any configuration regression found (do not re-enable duplicate Codex app fallbacks).
Result 2026-06-21:
- activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge
deployment all Running). Temporal schedules for hourly RecentlyOnScope
(
d104348c…) and daily WSJF triage (6fca51fa…) are not paused and did fire over the weekend. - Failed hourly workflows show:
Required context resolver 'state-hub'/'recently_on_scope_hourly' failed: [Errno 111] Connection refused. The 14:00 UTC Sun run succeeded when the workstation tunnel was up; most other runs since Fri 20:00 CEST failed. state-hub-railiance01ops-bridge tunnel spends long stretches in reconnecting;curl --max-time 5 http://127.0.0.1:18000/state/healthon the Railiance01 node times out while the tunnel is unstable.- Conclusion: not an activity-core scheduler outage. The gap is
workstation availability + ops-bridge tunnel reachability to the local
State Hub API. Cluster-side
actcore-state-hub-bridgeproxies to node-local127.0.0.1:18000as designed. - Retry fix (2026-06-21): stale orphan
sshdremote forward on Railiance01 port 18000 blocked new tunnel binds (ExitOnForwardFailure→ exit 255).bridge maintenance cleanup state-hub-railiance01 --restartcleared it.
T3 — Restore daily WSJF triage evidence
id: STATE-WP-0063-T03
status: done
priority: medium
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
After T2, trigger one manual canary of daily-statehub-wsjf-triage on
Railiance01 and confirm:
event_type: daily_triageprogress event lands in State Hub- working-memory report file appears under
the-custodian/memory/working/
If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin fire is armed.
Progress 2026-06-21 (first attempt): sink failed while tunnel reconnecting.
Result 2026-06-21 (retry): bridge maintenance cleanup state-hub-railiance01 --restart cleared stale remote sshd on port 18000 (orphan forward with many
CLOSE_WAIT). Tunnel now connected; node curl 127.0.0.1:18000/state/health
and worker actcore-state-hub-bridge:8000/state/health both return ok. Manual
canaries succeeded:
- hourly RecentlyOnScope →
recently_on_scope_hourlyat 2026-06-21T17:45:29Z - daily WSJF triage →
daily_triageat 2026-06-21T17:45:46Z + working-memory report underthe-custodian/memory/working/
T4 — Repair ancillary local crons
id: STATE-WP-0063-T04
status: done
priority: low
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
- Fix
railiance backupcrontab to use the currentrailiance-bootstrapinstall path, or document the correct operator install step. - Verify
bridge maintenance cleanupat 03:00 writes to~/.local/state/bridge/cleanup.logon the next run.
Result 2026-06-21:
- Crontab
railiance backuppath corrected to/home/worsch/railiance-cluster/bin/railiance(was stalerailiance-bootstrappath). - Created
~/.local/state/bridge/for cleanup log; cron entry unchanged. Manualbridge maintenance cleanup --restartruns >30s (expected for tunnel recycle); next 03:00 cron run is the production verification.
T5 — Verification sweep and close-out
id: STATE-WP-0063-T05
status: done
priority: medium
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
Run a 24-hour observation window:
- at least one successful
custodian-syncsweep per hour while machine is awake - at least two hourly RecentlyOnScope progress events (generated or skipped)
- no duplicate Codex + activity-core primary runners for the same job
Mark workplan finished and log a milestone progress event summarising repairs.
Result 2026-06-21 (accelerated close-out): substituted a 3-hour post-repair
validation window (19:55–22:30 CEST) for the full 24 h — a shorter timer
cadence was not needed because the existing 15-minute custodian-sync.timer
already produced enough evidence once the tunnel and unit path were fixed.
Evidence:
- custodian-sync: 7 consecutive
Finishedruns withRESULT: ✓ PASSfrom 19:55 CEST onward (~1 sweep per 15 min while awake). Earlier failures were pre-fix path errors (status=127) or isolated hard consistency fails (e.g. C-01 on a single repo), not scheduler regression. - hourly RecentlyOnScope: 4
recently_on_scope_hourlyevents since 17:00Z (17:08, 17:45 manual, 18:00, 19:00 scheduled) — all0 failed. - daily triage: manual canary
daily_triageat 17:45Z; schedule armed for next 07:20 Europe/Berlin. - tunnel:
state-hub-railiance01connectedduring validation. - no duplicate runners: cluster consistency sweep remains disabled
(
STATE-WP-0064); local timer is sole primary for ADR-001 sweeps.
Workplan marked finished.