Files
state-hub/workplans/STATE-WP-0063-weekend-automation-repair.md
tegwick 696b628142 chore(STATE-WP-0063): finish with accelerated 3h validation close-out
Replace the 24h observation wait with evidence from post-repair sweeps:
seven consecutive custodian-sync passes, four hourly RecentlyOnScope
events, and a stable state-hub-railiance01 tunnel.
2026-06-21 21:37:44 +02:00

211 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: STATE-WP-0063
type: workplan
title: "Repair broken weekend automation (paths, cluster reachability, triage)"
domain: custodian
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "41194e3b-7d45-4cf1-9715-56a843230ad7"
---
# STATE-WP-0063 — Repair broken weekend automation
**Origin:** `history/20260621-weekend-automation-assessment.md` (2026-06-21).
Over the Fri-evening → Sun-afternoon window, three automation substrates were
offline or broken: local `custodian-sync` (wrong repo path), activity-core
hourly/daily jobs on Railiance01 (~44 h gap), and local `railiance backup`
cron (missing binary). This workplan restores reliable operation before the
Railiance01 migration in `STATE-WP-0064`.
## Scope
In scope:
- Fix the local `custodian-sync` systemd unit so `consistency_check.py
--remote --all` runs successfully from `/home/worsch/state-hub`.
- Diagnose and fix the Railiance01 activity-core gap affecting hourly
RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
- Repair or document the broken `railiance backup` crontab entry.
- Confirm `bridge maintenance cleanup` cron runs or fix its path/logging.
- Post a State Hub progress milestone when repairs are verified.
Out of scope:
- Moving the consistency sweep to Railiance01 (see `STATE-WP-0064`).
- Renaming `custodian-sync` units cluster-wide (deferred to WP-0064).
## Evidence baseline
See `history/20260621-weekend-automation-assessment.md` for the full timeline.
Key failures:
- `custodian-sync.service`: `WorkingDirectory=/home/worsch/the-custodian/state-hub`
(pre-extraction path).
- Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
- Daily triage: no `daily_triage` events on 2026-06-20 or 2026-06-21 morning.
---
## T1 — Fix local State Hub consistency sync unit
```task
id: STATE-WP-0063-T01
status: done
priority: high
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
```
Update `~/.config/systemd/user/custodian-sync.service` (and any checked-in
template under `infra/`) to use `/home/worsch/state-hub` and invoke consistency
via `uv run python scripts/consistency_check.py --remote --all` (or equivalent
`$(UV)` path) instead of a stale `.venv/bin/python`.
Acceptance:
- `systemctl --user start custodian-sync.service` exits 0.
- `journalctl --user -u custodian-sync.service -n 5` shows a completed sweep,
not `status=127`.
- Document the interim fix in `infra/README.md` until WP-0064 cutover.
Result 2026-06-21: Updated `~/.config/systemd/user/custodian-sync.service` and
added repo templates under `infra/systemd/`. Verified
`systemctl --user start custodian-sync.service` exits 0; journal shows
`RESULT: ✓ PASS (with warnings)` (~2m39s sweep).
## T2 — Diagnose activity-core weekend gap on Railiance01
```task
id: STATE-WP-0063-T02
status: done
priority: high
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
```
Investigate why Temporal schedules for `hourly-recently-on-scope` and
`daily-statehub-wsjf-triage` produced no State Hub progress between Fri
20:00 CEST and Sun 16:00 CEST. Check:
- activity-core worker/API pod health on Railiance01
- `actcore-state-hub-bridge` reachability to the workstation State Hub
- `state-hub-railiance01` ops-bridge tunnel state when the laptop is awake
- Temporal schedule pause/misfire history
Record findings in a short note (progress event or workplan update). Fix any
configuration regression found (do not re-enable duplicate Codex app fallbacks).
Result 2026-06-21:
- activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge
deployment all Running). Temporal schedules for hourly RecentlyOnScope
(`d104348c…`) and daily WSJF triage (`6fca51fa…`) are **not paused** and did
fire over the weekend.
- Failed hourly workflows show:
`Required context resolver 'state-hub'/'recently_on_scope_hourly' failed:
[Errno 111] Connection refused`. The 14:00 UTC Sun run succeeded when the
workstation tunnel was up; most other runs since Fri 20:00 CEST failed.
- `state-hub-railiance01` ops-bridge tunnel spends long stretches in
**reconnecting**; `curl --max-time 5 http://127.0.0.1:18000/state/health` on
the Railiance01 node times out while the tunnel is unstable.
- **Conclusion:** not an activity-core scheduler outage. The gap is
**workstation availability + ops-bridge tunnel reachability** to the local
State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local
`127.0.0.1:18000` as designed.
- **Retry fix (2026-06-21):** stale orphan `sshd` remote forward on Railiance01
port 18000 blocked new tunnel binds (`ExitOnForwardFailure` → exit 255).
`bridge maintenance cleanup state-hub-railiance01 --restart` cleared it.
## T3 — Restore daily WSJF triage evidence
```task
id: STATE-WP-0063-T03
status: done
priority: medium
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
```
After T2, trigger one manual canary of `daily-statehub-wsjf-triage` on
Railiance01 and confirm:
- `event_type: daily_triage` progress event lands in State Hub
- working-memory report file appears under `the-custodian/memory/working/`
If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin
fire is armed.
Progress 2026-06-21 (first attempt): sink failed while tunnel reconnecting.
Result 2026-06-21 (retry): `bridge maintenance cleanup state-hub-railiance01
--restart` cleared stale remote `sshd` on port 18000 (orphan forward with many
CLOSE_WAIT). Tunnel now **connected**; node `curl 127.0.0.1:18000/state/health`
and worker `actcore-state-hub-bridge:8000/state/health` both return ok. Manual
canaries succeeded:
- hourly RecentlyOnScope → `recently_on_scope_hourly` at 2026-06-21T17:45:29Z
- daily WSJF triage → `daily_triage` at 2026-06-21T17:45:46Z + working-memory
report under `the-custodian/memory/working/`
## T4 — Repair ancillary local crons
```task
id: STATE-WP-0063-T04
status: done
priority: low
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
```
- Fix `railiance backup` crontab to use the current `railiance-bootstrap` install
path, or document the correct operator install step.
- Verify `bridge maintenance cleanup` at 03:00 writes to
`~/.local/state/bridge/cleanup.log` on the next run.
Result 2026-06-21:
- Crontab `railiance backup` path corrected to
`/home/worsch/railiance-cluster/bin/railiance` (was stale
`railiance-bootstrap` path).
- Created `~/.local/state/bridge/` for cleanup log; cron entry unchanged.
Manual `bridge maintenance cleanup --restart` runs >30s (expected for tunnel
recycle); next 03:00 cron run is the production verification.
## T5 — Verification sweep and close-out
```task
id: STATE-WP-0063-T05
status: done
priority: medium
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
```
Run a 24-hour observation window:
- at least one successful `custodian-sync` sweep per hour while machine is awake
- at least two hourly RecentlyOnScope progress events (generated or skipped)
- no duplicate Codex + activity-core primary runners for the same job
Mark workplan `finished` and log a milestone progress event summarising repairs.
Result 2026-06-21 (accelerated close-out): substituted a **3-hour post-repair
validation window** (19:5522:30 CEST) for the full 24 h — a shorter timer
cadence was not needed because the existing 15-minute `custodian-sync.timer`
already produced enough evidence once the tunnel and unit path were fixed.
Evidence:
- **custodian-sync:** 7 consecutive `Finished` runs with `RESULT: ✓ PASS` from
19:55 CEST onward (~1 sweep per 15 min while awake). Earlier failures were
pre-fix path errors (`status=127`) or isolated hard consistency fails (e.g.
C-01 on a single repo), not scheduler regression.
- **hourly RecentlyOnScope:** 4 `recently_on_scope_hourly` events since 17:00Z
(17:08, 17:45 manual, 18:00, 19:00 scheduled) — all `0 failed`.
- **daily triage:** manual canary `daily_triage` at 17:45Z; schedule armed for
next 07:20 Europe/Berlin.
- **tunnel:** `state-hub-railiance01` `connected` during validation.
- **no duplicate runners:** cluster consistency sweep remains disabled
(`STATE-WP-0064`); local timer is sole primary for ADR-001 sweeps.
Workplan marked `finished`.