fix(state-hub): STATE-WP-0063 T01/T02/T04 — restore local consistency sync

Point custodian-sync systemd units at /home/worsch/state-hub and uv run; add infra/systemd templates and README interim guidance. Document T02 diagnosis (activity-core schedules fire; ops-bridge tunnel gaps cause State Hub connection refused). T04 crontab path fixed locally; T03/T05 remain open.
2026-06-21 18:06:34 +02:00
parent 661eb01e45
commit 1b33a27a56
4 changed files with 98 additions and 16 deletions
--- a/infra/README.md
+++ b/infra/README.md
@@ -17,9 +17,16 @@ The compose file is `infra/docker-compose.yml`. Copy `.env.example` to `.env` an

 ## Periodic Repo Sync — systemd user timer

-The custodian sync timer runs `consistency_check.py --remote --all` every 15
-minutes, keeping workplan file state in sync with the state-hub DB automatically
-(belt-and-suspenders alongside the per-repo git post-commit hooks).
+The **State Hub consistency sync** timer (legacy unit name `custodian-sync`)
+runs `consistency_check.py --remote --all` every 15 minutes, keeping workplan
+file state in sync with the state-hub DB automatically (belt-and-suspenders
+alongside the per-repo git post-commit hooks).
+
+> **Interim local runner (STATE-WP-0063):** units must target the standalone
+> repo at `/home/worsch/state-hub` and invoke consistency via
+> `/home/worsch/.local/bin/uv run python …`. The pre-extraction path
+> `/home/worsch/the-custodian/state-hub` is obsolete. Scheduling moves to
+> Railiance01 activity-core in `STATE-WP-0064`.

 The all-repo remote sweep has two built-in load guards:

@@ -31,12 +38,22 @@ The all-repo remote sweep has two built-in load guards:
 - Warn-only sweeps exit 0 in `--remote --all` mode so the systemd unit only
  goes failed for hard consistency failures.

-### Installed unit files
+### Unit files

-| File | Location |
-|------|----------|
-| `custodian-sync.service` | `~/.config/systemd/user/custodian-sync.service` |
-| `custodian-sync.timer`   | `~/.config/systemd/user/custodian-sync.timer`   |
+| File | Repo template | Installed copy |
+|------|---------------|----------------|
+| `custodian-sync.service` | `infra/systemd/custodian-sync.service` | `~/.config/systemd/user/custodian-sync.service` |
+| `custodian-sync.timer`   | `infra/systemd/custodian-sync.timer`   | `~/.config/systemd/user/custodian-sync.timer`   |
+
+Install or refresh from the repo templates:
+
+```bash
+mkdir -p ~/.config/systemd/user
+cp ~/state-hub/infra/systemd/custodian-sync.service ~/.config/systemd/user/
+cp ~/state-hub/infra/systemd/custodian-sync.timer ~/.config/systemd/user/
+systemctl --user daemon-reload
+systemctl --user enable --now custodian-sync.timer
+```

 ### Management commands

@@ -74,7 +91,7 @@ If systemd is not available, fall back to crontab:

 ```bash
 # Crontab fallback (run crontab -e and add):
-*/15 * * * * curl -sf http://127.0.0.1:8000/state/health && cd ~/state-hub && .venv/bin/python scripts/consistency_check.py --remote --all >> /tmp/custodian-sync.log 2>&1
+*/15 * * * * curl -sf http://127.0.0.1:8000/state/health && cd ~/state-hub && /home/worsch/.local/bin/uv run python scripts/consistency_check.py --remote --all >> /tmp/custodian-sync.log 2>&1
 ```

 ---
--- a/infra/systemd/custodian-sync.service
+++ b/infra/systemd/custodian-sync.service
@@ -0,0 +1,11 @@
+[Unit]
+Description=State Hub consistency sync — fix-consistency-remote sweep
+After=network.target
+
+[Service]
+Type=oneshot
+WorkingDirectory=/home/worsch/state-hub
+ExecStartPre=/usr/bin/curl -sf http://127.0.0.1:8000/state/health
+ExecStart=/home/worsch/.local/bin/uv run python scripts/consistency_check.py --remote --all
+StandardOutput=journal
+StandardError=journal
--- a/infra/systemd/custodian-sync.timer
+++ b/infra/systemd/custodian-sync.timer
@@ -0,0 +1,11 @@
+[Unit]
+Description=State Hub consistency sync — periodic repo sync (every 15 min)
+Requires=custodian-sync.service
+
+[Timer]
+OnBootSec=5min
+OnUnitActiveSec=15min
+Unit=custodian-sync.service
+
+[Install]
+WantedBy=timers.target
--- a/workplans/STATE-WP-0063-weekend-automation-repair.md
+++ b/workplans/STATE-WP-0063-weekend-automation-repair.md
@@ -4,7 +4,7 @@ type: workplan
 title: "Repair broken weekend automation (paths, cluster reachability, triage)"
 domain: custodian
 repo: state-hub
-status: ready
+status: active
 owner: codex
 topic_slug: custodian
 created: "2026-06-21"
@@ -55,7 +55,7 @@ Key failures:

 ```task
 id: STATE-WP-0063-T01
-status: todo
+status: done
 priority: high
 state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"
 ```
@@ -72,11 +72,16 @@ Acceptance:
  not `status=127`.
 - Document the interim fix in `infra/README.md` until WP-0064 cutover.

+Result 2026-06-21: Updated `~/.config/systemd/user/custodian-sync.service` and
+added repo templates under `infra/systemd/`. Verified
+`systemctl --user start custodian-sync.service` exits 0; journal shows
+`RESULT: ✓ PASS (with warnings)` (~2m39s sweep).
+
 ## T2 — Diagnose activity-core weekend gap on Railiance01

 ```task
 id: STATE-WP-0063-T02
-status: todo
+status: done
 priority: high
 state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"
 ```
@@ -93,11 +98,29 @@ Investigate why Temporal schedules for `hourly-recently-on-scope` and
 Record findings in a short note (progress event or workplan update). Fix any
 configuration regression found (do not re-enable duplicate Codex app fallbacks).

+Result 2026-06-21:
+
+- activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge
+  deployment all Running). Temporal schedules for hourly RecentlyOnScope
+  (`d104348c…`) and daily WSJF triage (`6fca51fa…`) are **not paused** and did
+  fire over the weekend.
+- Failed hourly workflows show:
+  `Required context resolver 'state-hub'/'recently_on_scope_hourly' failed:
+  [Errno 111] Connection refused`. The 14:00 UTC Sun run succeeded when the
+  workstation tunnel was up; most other runs since Fri 20:00 CEST failed.
+- `state-hub-railiance01` ops-bridge tunnel spends long stretches in
+  **reconnecting**; `curl --max-time 5 http://127.0.0.1:18000/state/health` on
+  the Railiance01 node times out while the tunnel is unstable.
+- **Conclusion:** not an activity-core scheduler outage. The gap is
+  **workstation availability + ops-bridge tunnel reachability** to the local
+  State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local
+  `127.0.0.1:18000` as designed.
+
 ## T3 — Restore daily WSJF triage evidence

 ```task
 id: STATE-WP-0063-T03
-status: todo
+status: progress
 priority: medium
 state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"
 ```
@@ -111,11 +134,18 @@ Railiance01 and confirm:
 If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin
 fire is armed.

+Progress 2026-06-21: Manual trigger via actcore-api returned workflow
+`activity-6fca51fa…:manual-ca469cb5…`. Schedule is armed (next 07:20 Europe/Berlin).
+Report sink `state-hub-progress` is still failing with `[Errno 111] Connection
+refused` while `state-hub-railiance01` tunnel is reconnecting. Re-verify after
+tunnel stabilises (`bridge check state-hub-railiance01` + new `daily_triage`
+progress event).
+
 ## T4 — Repair ancillary local crons

 ```task
 id: STATE-WP-0063-T04
-status: todo
+status: done
 priority: low
 state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
 ```
@@ -125,11 +155,20 @@ state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"
 - Verify `bridge maintenance cleanup` at 03:00 writes to
  `~/.local/state/bridge/cleanup.log` on the next run.

+Result 2026-06-21:
+
+- Crontab `railiance backup` path corrected to
+  `/home/worsch/railiance-cluster/bin/railiance` (was stale
+  `railiance-bootstrap` path).
+- Created `~/.local/state/bridge/` for cleanup log; cron entry unchanged.
+  Manual `bridge maintenance cleanup --restart` runs >30s (expected for tunnel
+  recycle); next 03:00 cron run is the production verification.
+
 ## T5 — Verification sweep and close-out

 ```task
 id: STATE-WP-0063-T05
-status: todo
+status: wait
 priority: medium
 state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"
 ```
@@ -140,4 +179,8 @@ Run a 24-hour observation window:
 - at least two hourly RecentlyOnScope progress events (generated or skipped)
 - no duplicate Codex + activity-core primary runners for the same job

-Mark workplan `finished` and log a milestone progress event summarising repairs.
+Mark workplan `finished` and log a milestone progress event summarising repairs.
+
+Blocked on T3 tunnel stabilisation and a 24-hour observation window while the
+workstation remains awake. T01 and local timer are verified; cluster schedules
+resume once `state-hub-railiance01` stays `connected`.