diff --git a/infra/README.md b/infra/README.md index fe01143..8308c14 100644 --- a/infra/README.md +++ b/infra/README.md @@ -17,9 +17,16 @@ The compose file is `infra/docker-compose.yml`. Copy `.env.example` to `.env` an ## Periodic Repo Sync — systemd user timer -The custodian sync timer runs `consistency_check.py --remote --all` every 15 -minutes, keeping workplan file state in sync with the state-hub DB automatically -(belt-and-suspenders alongside the per-repo git post-commit hooks). +The **State Hub consistency sync** timer (legacy unit name `custodian-sync`) +runs `consistency_check.py --remote --all` every 15 minutes, keeping workplan +file state in sync with the state-hub DB automatically (belt-and-suspenders +alongside the per-repo git post-commit hooks). + +> **Interim local runner (STATE-WP-0063):** units must target the standalone +> repo at `/home/worsch/state-hub` and invoke consistency via +> `/home/worsch/.local/bin/uv run python …`. The pre-extraction path +> `/home/worsch/the-custodian/state-hub` is obsolete. Scheduling moves to +> Railiance01 activity-core in `STATE-WP-0064`. The all-repo remote sweep has two built-in load guards: @@ -31,12 +38,22 @@ The all-repo remote sweep has two built-in load guards: - Warn-only sweeps exit 0 in `--remote --all` mode so the systemd unit only goes failed for hard consistency failures. -### Installed unit files +### Unit files -| File | Location | -|------|----------| -| `custodian-sync.service` | `~/.config/systemd/user/custodian-sync.service` | -| `custodian-sync.timer` | `~/.config/systemd/user/custodian-sync.timer` | +| File | Repo template | Installed copy | +|------|---------------|----------------| +| `custodian-sync.service` | `infra/systemd/custodian-sync.service` | `~/.config/systemd/user/custodian-sync.service` | +| `custodian-sync.timer` | `infra/systemd/custodian-sync.timer` | `~/.config/systemd/user/custodian-sync.timer` | + +Install or refresh from the repo templates: + +```bash +mkdir -p ~/.config/systemd/user +cp ~/state-hub/infra/systemd/custodian-sync.service ~/.config/systemd/user/ +cp ~/state-hub/infra/systemd/custodian-sync.timer ~/.config/systemd/user/ +systemctl --user daemon-reload +systemctl --user enable --now custodian-sync.timer +``` ### Management commands @@ -74,7 +91,7 @@ If systemd is not available, fall back to crontab: ```bash # Crontab fallback (run crontab -e and add): -*/15 * * * * curl -sf http://127.0.0.1:8000/state/health && cd ~/state-hub && .venv/bin/python scripts/consistency_check.py --remote --all >> /tmp/custodian-sync.log 2>&1 +*/15 * * * * curl -sf http://127.0.0.1:8000/state/health && cd ~/state-hub && /home/worsch/.local/bin/uv run python scripts/consistency_check.py --remote --all >> /tmp/custodian-sync.log 2>&1 ``` --- diff --git a/infra/systemd/custodian-sync.service b/infra/systemd/custodian-sync.service new file mode 100644 index 0000000..9a7d71a --- /dev/null +++ b/infra/systemd/custodian-sync.service @@ -0,0 +1,11 @@ +[Unit] +Description=State Hub consistency sync — fix-consistency-remote sweep +After=network.target + +[Service] +Type=oneshot +WorkingDirectory=/home/worsch/state-hub +ExecStartPre=/usr/bin/curl -sf http://127.0.0.1:8000/state/health +ExecStart=/home/worsch/.local/bin/uv run python scripts/consistency_check.py --remote --all +StandardOutput=journal +StandardError=journal \ No newline at end of file diff --git a/infra/systemd/custodian-sync.timer b/infra/systemd/custodian-sync.timer new file mode 100644 index 0000000..e7a5357 --- /dev/null +++ b/infra/systemd/custodian-sync.timer @@ -0,0 +1,11 @@ +[Unit] +Description=State Hub consistency sync — periodic repo sync (every 15 min) +Requires=custodian-sync.service + +[Timer] +OnBootSec=5min +OnUnitActiveSec=15min +Unit=custodian-sync.service + +[Install] +WantedBy=timers.target \ No newline at end of file diff --git a/workplans/STATE-WP-0063-weekend-automation-repair.md b/workplans/STATE-WP-0063-weekend-automation-repair.md index 4848b26..3aae56f 100644 --- a/workplans/STATE-WP-0063-weekend-automation-repair.md +++ b/workplans/STATE-WP-0063-weekend-automation-repair.md @@ -4,7 +4,7 @@ type: workplan title: "Repair broken weekend automation (paths, cluster reachability, triage)" domain: custodian repo: state-hub -status: ready +status: active owner: codex topic_slug: custodian created: "2026-06-21" @@ -55,7 +55,7 @@ Key failures: ```task id: STATE-WP-0063-T01 -status: todo +status: done priority: high state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2" ``` @@ -72,11 +72,16 @@ Acceptance: not `status=127`. - Document the interim fix in `infra/README.md` until WP-0064 cutover. +Result 2026-06-21: Updated `~/.config/systemd/user/custodian-sync.service` and +added repo templates under `infra/systemd/`. Verified +`systemctl --user start custodian-sync.service` exits 0; journal shows +`RESULT: ✓ PASS (with warnings)` (~2m39s sweep). + ## T2 — Diagnose activity-core weekend gap on Railiance01 ```task id: STATE-WP-0063-T02 -status: todo +status: done priority: high state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61" ``` @@ -93,11 +98,29 @@ Investigate why Temporal schedules for `hourly-recently-on-scope` and Record findings in a short note (progress event or workplan update). Fix any configuration regression found (do not re-enable duplicate Codex app fallbacks). +Result 2026-06-21: + +- activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge + deployment all Running). Temporal schedules for hourly RecentlyOnScope + (`d104348c…`) and daily WSJF triage (`6fca51fa…`) are **not paused** and did + fire over the weekend. +- Failed hourly workflows show: + `Required context resolver 'state-hub'/'recently_on_scope_hourly' failed: + [Errno 111] Connection refused`. The 14:00 UTC Sun run succeeded when the + workstation tunnel was up; most other runs since Fri 20:00 CEST failed. +- `state-hub-railiance01` ops-bridge tunnel spends long stretches in + **reconnecting**; `curl --max-time 5 http://127.0.0.1:18000/state/health` on + the Railiance01 node times out while the tunnel is unstable. +- **Conclusion:** not an activity-core scheduler outage. The gap is + **workstation availability + ops-bridge tunnel reachability** to the local + State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local + `127.0.0.1:18000` as designed. + ## T3 — Restore daily WSJF triage evidence ```task id: STATE-WP-0063-T03 -status: todo +status: progress priority: medium state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625" ``` @@ -111,11 +134,18 @@ Railiance01 and confirm: If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin fire is armed. +Progress 2026-06-21: Manual trigger via actcore-api returned workflow +`activity-6fca51fa…:manual-ca469cb5…`. Schedule is armed (next 07:20 Europe/Berlin). +Report sink `state-hub-progress` is still failing with `[Errno 111] Connection +refused` while `state-hub-railiance01` tunnel is reconnecting. Re-verify after +tunnel stabilises (`bridge check state-hub-railiance01` + new `daily_triage` +progress event). + ## T4 — Repair ancillary local crons ```task id: STATE-WP-0063-T04 -status: todo +status: done priority: low state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e" ``` @@ -125,11 +155,20 @@ state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e" - Verify `bridge maintenance cleanup` at 03:00 writes to `~/.local/state/bridge/cleanup.log` on the next run. +Result 2026-06-21: + +- Crontab `railiance backup` path corrected to + `/home/worsch/railiance-cluster/bin/railiance` (was stale + `railiance-bootstrap` path). +- Created `~/.local/state/bridge/` for cleanup log; cron entry unchanged. + Manual `bridge maintenance cleanup --restart` runs >30s (expected for tunnel + recycle); next 03:00 cron run is the production verification. + ## T5 — Verification sweep and close-out ```task id: STATE-WP-0063-T05 -status: todo +status: wait priority: medium state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1" ``` @@ -140,4 +179,8 @@ Run a 24-hour observation window: - at least two hourly RecentlyOnScope progress events (generated or skipped) - no duplicate Codex + activity-core primary runners for the same job -Mark workplan `finished` and log a milestone progress event summarising repairs. \ No newline at end of file +Mark workplan `finished` and log a milestone progress event summarising repairs. + +Blocked on T3 tunnel stabilisation and a 24-hour observation window while the +workstation remains awake. T01 and local timer are verified; cluster schedules +resume once `state-hub-railiance01` stays `connected`. \ No newline at end of file