fix(state-hub): STATE-WP-0063 T01/T02/T04 — restore local consistency sync

Point custodian-sync systemd units at /home/worsch/state-hub and uv run;
add infra/systemd templates and README interim guidance. Document T02
diagnosis (activity-core schedules fire; ops-bridge tunnel gaps cause State
Hub connection refused). T04 crontab path fixed locally; T03/T05 remain open.
This commit is contained in:
2026-06-21 18:06:34 +02:00
parent 661eb01e45
commit 1b33a27a56
4 changed files with 98 additions and 16 deletions

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Repair broken weekend automation (paths, cluster reachability, triage)"
domain: custodian
repo: state-hub
status: ready
status: active
owner: codex
topic_slug: custodian
created: "2026-06-21"
@@ -55,7 +55,7 @@ Key failures:
```task
id: STATE-WP-0063-T01
status: todo
status: done
priority: high
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
```
@@ -72,11 +72,16 @@ Acceptance:
not `status=127`.
- Document the interim fix in `infra/README.md` until WP-0064 cutover.
Result 2026-06-21: Updated `~/.config/systemd/user/custodian-sync.service` and
added repo templates under `infra/systemd/`. Verified
`systemctl --user start custodian-sync.service` exits 0; journal shows
`RESULT: ✓ PASS (with warnings)` (~2m39s sweep).
## T2 — Diagnose activity-core weekend gap on Railiance01
```task
id: STATE-WP-0063-T02
status: todo
status: done
priority: high
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
```
@@ -93,11 +98,29 @@ Investigate why Temporal schedules for `hourly-recently-on-scope` and
Record findings in a short note (progress event or workplan update). Fix any
configuration regression found (do not re-enable duplicate Codex app fallbacks).
Result 2026-06-21:
- activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge
deployment all Running). Temporal schedules for hourly RecentlyOnScope
(`d104348c…`) and daily WSJF triage (`6fca51fa…`) are **not paused** and did
fire over the weekend.
- Failed hourly workflows show:
`Required context resolver 'state-hub'/'recently_on_scope_hourly' failed:
[Errno 111] Connection refused`. The 14:00 UTC Sun run succeeded when the
workstation tunnel was up; most other runs since Fri 20:00 CEST failed.
- `state-hub-railiance01` ops-bridge tunnel spends long stretches in
**reconnecting**; `curl --max-time 5 http://127.0.0.1:18000/state/health` on
the Railiance01 node times out while the tunnel is unstable.
- **Conclusion:** not an activity-core scheduler outage. The gap is
**workstation availability + ops-bridge tunnel reachability** to the local
State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local
`127.0.0.1:18000` as designed.
## T3 — Restore daily WSJF triage evidence
```task
id: STATE-WP-0063-T03
status: todo
status: progress
priority: medium
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
```
@@ -111,11 +134,18 @@ Railiance01 and confirm:
If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin
fire is armed.
Progress 2026-06-21: Manual trigger via actcore-api returned workflow
`activity-6fca51fa…:manual-ca469cb5…`. Schedule is armed (next 07:20 Europe/Berlin).
Report sink `state-hub-progress` is still failing with `[Errno 111] Connection
refused` while `state-hub-railiance01` tunnel is reconnecting. Re-verify after
tunnel stabilises (`bridge check state-hub-railiance01` + new `daily_triage`
progress event).
## T4 — Repair ancillary local crons
```task
id: STATE-WP-0063-T04
status: todo
status: done
priority: low
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
```
@@ -125,11 +155,20 @@ state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
- Verify `bridge maintenance cleanup` at 03:00 writes to
`~/.local/state/bridge/cleanup.log` on the next run.
Result 2026-06-21:
- Crontab `railiance backup` path corrected to
`/home/worsch/railiance-cluster/bin/railiance` (was stale
`railiance-bootstrap` path).
- Created `~/.local/state/bridge/` for cleanup log; cron entry unchanged.
Manual `bridge maintenance cleanup --restart` runs >30s (expected for tunnel
recycle); next 03:00 cron run is the production verification.
## T5 — Verification sweep and close-out
```task
id: STATE-WP-0063-T05
status: todo
status: wait
priority: medium
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
```
@@ -140,4 +179,8 @@ Run a 24-hour observation window:
- at least two hourly RecentlyOnScope progress events (generated or skipped)
- no duplicate Codex + activity-core primary runners for the same job
Mark workplan `finished` and log a milestone progress event summarising repairs.
Mark workplan `finished` and log a milestone progress event summarising repairs.
Blocked on T3 tunnel stabilisation and a 24-hour observation window while the
workstation remains awake. T01 and local timer are verified; cluster schedules
resume once `state-hub-railiance01` stays `connected`.