Files
state-hub/workplans/STATE-WP-0063-weekend-automation-repair.md
tegwick 3d5e354ff8 docs(state-hub): weekend automation assessment and repair workplans
Persist the Fri-evening→Sun-afternoon automation gap assessment in
history/, and add STATE-WP-0063 (repair broken paths and cluster
reachability) plus STATE-WP-0064 (move State Hub consistency sync to
Railiance01 via activity-core). Workplans registered in State Hub via
fix-consistency.
2026-06-21 17:32:44 +02:00

143 lines
4.6 KiB
Markdown

---
id: STATE-WP-0063
type: workplan
title: "Repair broken weekend automation (paths, cluster reachability, triage)"
domain: custodian
repo: state-hub
status: ready
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "41194e3b-7d45-4cf1-9715-56a843230ad7"
---
# STATE-WP-0063 — Repair broken weekend automation
**Origin:** `history/20260621-weekend-automation-assessment.md` (2026-06-21).
Over the Fri-evening → Sun-afternoon window, three automation substrates were
offline or broken: local `custodian-sync` (wrong repo path), activity-core
hourly/daily jobs on Railiance01 (~44 h gap), and local `railiance backup`
cron (missing binary). This workplan restores reliable operation before the
Railiance01 migration in `STATE-WP-0064`.
## Scope
In scope:
- Fix the local `custodian-sync` systemd unit so `consistency_check.py
--remote --all` runs successfully from `/home/worsch/state-hub`.
- Diagnose and fix the Railiance01 activity-core gap affecting hourly
RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
- Repair or document the broken `railiance backup` crontab entry.
- Confirm `bridge maintenance cleanup` cron runs or fix its path/logging.
- Post a State Hub progress milestone when repairs are verified.
Out of scope:
- Moving the consistency sweep to Railiance01 (see `STATE-WP-0064`).
- Renaming `custodian-sync` units cluster-wide (deferred to WP-0064).
## Evidence baseline
See `history/20260621-weekend-automation-assessment.md` for the full timeline.
Key failures:
- `custodian-sync.service`: `WorkingDirectory=/home/worsch/the-custodian/state-hub`
(pre-extraction path).
- Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
- Daily triage: no `daily_triage` events on 2026-06-20 or 2026-06-21 morning.
---
## T1 — Fix local State Hub consistency sync unit
```task
id: STATE-WP-0063-T01
status: todo
priority: high
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
```
Update `~/.config/systemd/user/custodian-sync.service` (and any checked-in
template under `infra/`) to use `/home/worsch/state-hub` and invoke consistency
via `uv run python scripts/consistency_check.py --remote --all` (or equivalent
`$(UV)` path) instead of a stale `.venv/bin/python`.
Acceptance:
- `systemctl --user start custodian-sync.service` exits 0.
- `journalctl --user -u custodian-sync.service -n 5` shows a completed sweep,
not `status=127`.
- Document the interim fix in `infra/README.md` until WP-0064 cutover.
## T2 — Diagnose activity-core weekend gap on Railiance01
```task
id: STATE-WP-0063-T02
status: todo
priority: high
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
```
Investigate why Temporal schedules for `hourly-recently-on-scope` and
`daily-statehub-wsjf-triage` produced no State Hub progress between Fri
20:00 CEST and Sun 16:00 CEST. Check:
- activity-core worker/API pod health on Railiance01
- `actcore-state-hub-bridge` reachability to the workstation State Hub
- `state-hub-railiance01` ops-bridge tunnel state when the laptop is awake
- Temporal schedule pause/misfire history
Record findings in a short note (progress event or workplan update). Fix any
configuration regression found (do not re-enable duplicate Codex app fallbacks).
## T3 — Restore daily WSJF triage evidence
```task
id: STATE-WP-0063-T03
status: todo
priority: medium
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
```
After T2, trigger one manual canary of `daily-statehub-wsjf-triage` on
Railiance01 and confirm:
- `event_type: daily_triage` progress event lands in State Hub
- working-memory report file appears under `the-custodian/memory/working/`
If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin
fire is armed.
## T4 — Repair ancillary local crons
```task
id: STATE-WP-0063-T04
status: todo
priority: low
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
```
- Fix `railiance backup` crontab to use the current `railiance-bootstrap` install
path, or document the correct operator install step.
- Verify `bridge maintenance cleanup` at 03:00 writes to
`~/.local/state/bridge/cleanup.log` on the next run.
## T5 — Verification sweep and close-out
```task
id: STATE-WP-0063-T05
status: todo
priority: medium
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
```
Run a 24-hour observation window:
- at least one successful `custodian-sync` sweep per hour while machine is awake
- at least two hourly RecentlyOnScope progress events (generated or skipped)
- no duplicate Codex + activity-core primary runners for the same job
Mark workplan `finished` and log a milestone progress event summarising repairs.