generated from coulomb/repo-seed
Document stale remote sshd forward on Railiance01 :18000 as root cause of reconnect loop; T03 verified after bridge maintenance cleanup and manual canaries for hourly RecentlyOnScope and daily WSJF triage.
194 lines
7.3 KiB
Markdown
194 lines
7.3 KiB
Markdown
---
|
|
id: STATE-WP-0063
|
|
type: workplan
|
|
title: "Repair broken weekend automation (paths, cluster reachability, triage)"
|
|
domain: custodian
|
|
repo: state-hub
|
|
status: active
|
|
owner: codex
|
|
topic_slug: custodian
|
|
created: "2026-06-21"
|
|
updated: "2026-06-21"
|
|
state_hub_workstream_id: "41194e3b-7d45-4cf1-9715-56a843230ad7"
|
|
---
|
|
|
|
# STATE-WP-0063 — Repair broken weekend automation
|
|
|
|
**Origin:** `history/20260621-weekend-automation-assessment.md` (2026-06-21).
|
|
|
|
Over the Fri-evening → Sun-afternoon window, three automation substrates were
|
|
offline or broken: local `custodian-sync` (wrong repo path), activity-core
|
|
hourly/daily jobs on Railiance01 (~44 h gap), and local `railiance backup`
|
|
cron (missing binary). This workplan restores reliable operation before the
|
|
Railiance01 migration in `STATE-WP-0064`.
|
|
|
|
## Scope
|
|
|
|
In scope:
|
|
|
|
- Fix the local `custodian-sync` systemd unit so `consistency_check.py
|
|
--remote --all` runs successfully from `/home/worsch/state-hub`.
|
|
- Diagnose and fix the Railiance01 activity-core gap affecting hourly
|
|
RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
|
|
- Repair or document the broken `railiance backup` crontab entry.
|
|
- Confirm `bridge maintenance cleanup` cron runs or fix its path/logging.
|
|
- Post a State Hub progress milestone when repairs are verified.
|
|
|
|
Out of scope:
|
|
|
|
- Moving the consistency sweep to Railiance01 (see `STATE-WP-0064`).
|
|
- Renaming `custodian-sync` units cluster-wide (deferred to WP-0064).
|
|
|
|
## Evidence baseline
|
|
|
|
See `history/20260621-weekend-automation-assessment.md` for the full timeline.
|
|
Key failures:
|
|
|
|
- `custodian-sync.service`: `WorkingDirectory=/home/worsch/the-custodian/state-hub`
|
|
(pre-extraction path).
|
|
- Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
|
|
- Daily triage: no `daily_triage` events on 2026-06-20 or 2026-06-21 morning.
|
|
|
|
---
|
|
|
|
## T1 — Fix local State Hub consistency sync unit
|
|
|
|
```task
|
|
id: STATE-WP-0063-T01
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
|
|
```
|
|
|
|
Update `~/.config/systemd/user/custodian-sync.service` (and any checked-in
|
|
template under `infra/`) to use `/home/worsch/state-hub` and invoke consistency
|
|
via `uv run python scripts/consistency_check.py --remote --all` (or equivalent
|
|
`$(UV)` path) instead of a stale `.venv/bin/python`.
|
|
|
|
Acceptance:
|
|
|
|
- `systemctl --user start custodian-sync.service` exits 0.
|
|
- `journalctl --user -u custodian-sync.service -n 5` shows a completed sweep,
|
|
not `status=127`.
|
|
- Document the interim fix in `infra/README.md` until WP-0064 cutover.
|
|
|
|
Result 2026-06-21: Updated `~/.config/systemd/user/custodian-sync.service` and
|
|
added repo templates under `infra/systemd/`. Verified
|
|
`systemctl --user start custodian-sync.service` exits 0; journal shows
|
|
`RESULT: ✓ PASS (with warnings)` (~2m39s sweep).
|
|
|
|
## T2 — Diagnose activity-core weekend gap on Railiance01
|
|
|
|
```task
|
|
id: STATE-WP-0063-T02
|
|
status: done
|
|
priority: high
|
|
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
|
|
```
|
|
|
|
Investigate why Temporal schedules for `hourly-recently-on-scope` and
|
|
`daily-statehub-wsjf-triage` produced no State Hub progress between Fri
|
|
20:00 CEST and Sun 16:00 CEST. Check:
|
|
|
|
- activity-core worker/API pod health on Railiance01
|
|
- `actcore-state-hub-bridge` reachability to the workstation State Hub
|
|
- `state-hub-railiance01` ops-bridge tunnel state when the laptop is awake
|
|
- Temporal schedule pause/misfire history
|
|
|
|
Record findings in a short note (progress event or workplan update). Fix any
|
|
configuration regression found (do not re-enable duplicate Codex app fallbacks).
|
|
|
|
Result 2026-06-21:
|
|
|
|
- activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge
|
|
deployment all Running). Temporal schedules for hourly RecentlyOnScope
|
|
(`d104348c…`) and daily WSJF triage (`6fca51fa…`) are **not paused** and did
|
|
fire over the weekend.
|
|
- Failed hourly workflows show:
|
|
`Required context resolver 'state-hub'/'recently_on_scope_hourly' failed:
|
|
[Errno 111] Connection refused`. The 14:00 UTC Sun run succeeded when the
|
|
workstation tunnel was up; most other runs since Fri 20:00 CEST failed.
|
|
- `state-hub-railiance01` ops-bridge tunnel spends long stretches in
|
|
**reconnecting**; `curl --max-time 5 http://127.0.0.1:18000/state/health` on
|
|
the Railiance01 node times out while the tunnel is unstable.
|
|
- **Conclusion:** not an activity-core scheduler outage. The gap is
|
|
**workstation availability + ops-bridge tunnel reachability** to the local
|
|
State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local
|
|
`127.0.0.1:18000` as designed.
|
|
- **Retry fix (2026-06-21):** stale orphan `sshd` remote forward on Railiance01
|
|
port 18000 blocked new tunnel binds (`ExitOnForwardFailure` → exit 255).
|
|
`bridge maintenance cleanup state-hub-railiance01 --restart` cleared it.
|
|
|
|
## T3 — Restore daily WSJF triage evidence
|
|
|
|
```task
|
|
id: STATE-WP-0063-T03
|
|
status: done
|
|
priority: medium
|
|
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
|
|
```
|
|
|
|
After T2, trigger one manual canary of `daily-statehub-wsjf-triage` on
|
|
Railiance01 and confirm:
|
|
|
|
- `event_type: daily_triage` progress event lands in State Hub
|
|
- working-memory report file appears under `the-custodian/memory/working/`
|
|
|
|
If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin
|
|
fire is armed.
|
|
|
|
Progress 2026-06-21 (first attempt): sink failed while tunnel reconnecting.
|
|
|
|
Result 2026-06-21 (retry): `bridge maintenance cleanup state-hub-railiance01
|
|
--restart` cleared stale remote `sshd` on port 18000 (orphan forward with many
|
|
CLOSE_WAIT). Tunnel now **connected**; node `curl 127.0.0.1:18000/state/health`
|
|
and worker `actcore-state-hub-bridge:8000/state/health` both return ok. Manual
|
|
canaries succeeded:
|
|
|
|
- hourly RecentlyOnScope → `recently_on_scope_hourly` at 2026-06-21T17:45:29Z
|
|
- daily WSJF triage → `daily_triage` at 2026-06-21T17:45:46Z + working-memory
|
|
report under `the-custodian/memory/working/`
|
|
|
|
## T4 — Repair ancillary local crons
|
|
|
|
```task
|
|
id: STATE-WP-0063-T04
|
|
status: done
|
|
priority: low
|
|
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
|
|
```
|
|
|
|
- Fix `railiance backup` crontab to use the current `railiance-bootstrap` install
|
|
path, or document the correct operator install step.
|
|
- Verify `bridge maintenance cleanup` at 03:00 writes to
|
|
`~/.local/state/bridge/cleanup.log` on the next run.
|
|
|
|
Result 2026-06-21:
|
|
|
|
- Crontab `railiance backup` path corrected to
|
|
`/home/worsch/railiance-cluster/bin/railiance` (was stale
|
|
`railiance-bootstrap` path).
|
|
- Created `~/.local/state/bridge/` for cleanup log; cron entry unchanged.
|
|
Manual `bridge maintenance cleanup --restart` runs >30s (expected for tunnel
|
|
recycle); next 03:00 cron run is the production verification.
|
|
|
|
## T5 — Verification sweep and close-out
|
|
|
|
```task
|
|
id: STATE-WP-0063-T05
|
|
status: wait
|
|
priority: medium
|
|
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
|
|
```
|
|
|
|
Run a 24-hour observation window:
|
|
|
|
- at least one successful `custodian-sync` sweep per hour while machine is awake
|
|
- at least two hourly RecentlyOnScope progress events (generated or skipped)
|
|
- no duplicate Codex + activity-core primary runners for the same job
|
|
|
|
Mark workplan `finished` and log a milestone progress event summarising repairs.
|
|
|
|
Blocked on T3 tunnel stabilisation and a 24-hour observation window while the
|
|
workstation remains awake. T01 and local timer are verified; cluster schedules
|
|
resume once `state-hub-railiance01` stays `connected`. |