Files

tegwick 696b628142 chore(STATE-WP-0063): finish with accelerated 3h validation close-out

Replace the 24h observation wait with evidence from post-repair sweeps:
seven consecutive custodian-sync passes, four hourly RecentlyOnScope
events, and a stable state-hub-railiance01 tunnel.

2026-06-21 21:37:44 +02:00

8.1 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
STATE-WP-0063	workplan	Repair broken weekend automation (paths, cluster reachability, triage)	custodian	state-hub	finished	codex	custodian	2026-06-21	2026-06-21	41194e3b-7d45-4cf1-9715-56a843230ad7

STATE-WP-0063 — Repair broken weekend automation

Origin: history/20260621-weekend-automation-assessment.md (2026-06-21).

Over the Fri-evening → Sun-afternoon window, three automation substrates were offline or broken: local custodian-sync (wrong repo path), activity-core hourly/daily jobs on Railiance01 (~44 h gap), and local railiance backup cron (missing binary). This workplan restores reliable operation before the Railiance01 migration in STATE-WP-0064.

Scope

In scope:

Fix the local custodian-sync systemd unit so consistency_check.py --remote --all runs successfully from /home/worsch/state-hub.
Diagnose and fix the Railiance01 activity-core gap affecting hourly RecentlyOnScope and daily WSJF triage (Temporal schedule, tunnel, bridge).
Repair or document the broken railiance backup crontab entry.
Confirm bridge maintenance cleanup cron runs or fix its path/logging.
Post a State Hub progress milestone when repairs are verified.

Out of scope:

Moving the consistency sweep to Railiance01 (see STATE-WP-0064).
Renaming custodian-sync units cluster-wide (deferred to WP-0064).

Evidence baseline

See history/20260621-weekend-automation-assessment.md for the full timeline. Key failures:

custodian-sync.service: WorkingDirectory=/home/worsch/the-custodian/state-hub (pre-extraction path).
Hourly RecentlyOnScope: last run Fri 18:00 UTC; next Sun 14:00 UTC.
Daily triage: no daily_triage events on 2026-06-20 or 2026-06-21 morning.

T1 — Fix local State Hub consistency sync unit

id: STATE-WP-0063-T01
status: done
priority: high
state_hub_task_id: "720e69f7-b02c-4baa-b4f7-d2916e1737c2"

Update ~/.config/systemd/user/custodian-sync.service (and any checked-in template under infra/) to use /home/worsch/state-hub and invoke consistency via uv run python scripts/consistency_check.py --remote --all (or equivalent $(UV) path) instead of a stale .venv/bin/python.

Acceptance:

systemctl --user start custodian-sync.service exits 0.
journalctl --user -u custodian-sync.service -n 5 shows a completed sweep, not status=127.
Document the interim fix in infra/README.md until WP-0064 cutover.

Result 2026-06-21: Updated ~/.config/systemd/user/custodian-sync.service and added repo templates under infra/systemd/. Verified systemctl --user start custodian-sync.service exits 0; journal shows RESULT: ✓ PASS (with warnings) (~2m39s sweep).

T2 — Diagnose activity-core weekend gap on Railiance01

id: STATE-WP-0063-T02
status: done
priority: high
state_hub_task_id: "4d35698c-3176-468b-a8f6-043e19191a61"

Investigate why Temporal schedules for hourly-recently-on-scope and daily-statehub-wsjf-triage produced no State Hub progress between Fri 20:00 CEST and Sun 16:00 CEST. Check:

activity-core worker/API pod health on Railiance01
actcore-state-hub-bridge reachability to the workstation State Hub
state-hub-railiance01 ops-bridge tunnel state when the laptop is awake
Temporal schedule pause/misfire history

Record findings in a short note (progress event or workplan update). Fix any configuration regression found (do not re-enable duplicate Codex app fallbacks).

Result 2026-06-21:

activity-core pods on Railiance01 are healthy (API, worker, Temporal, bridge deployment all Running). Temporal schedules for hourly RecentlyOnScope (d104348c…) and daily WSJF triage (6fca51fa…) are not paused and did fire over the weekend.
Failed hourly workflows show: Required context resolver 'state-hub'/'recently_on_scope_hourly' failed: [Errno 111] Connection refused. The 14:00 UTC Sun run succeeded when the workstation tunnel was up; most other runs since Fri 20:00 CEST failed.
state-hub-railiance01 ops-bridge tunnel spends long stretches in reconnecting; curl --max-time 5 http://127.0.0.1:18000/state/health on the Railiance01 node times out while the tunnel is unstable.
Conclusion: not an activity-core scheduler outage. The gap is workstation availability + ops-bridge tunnel reachability to the local State Hub API. Cluster-side actcore-state-hub-bridge proxies to node-local 127.0.0.1:18000 as designed.
Retry fix (2026-06-21): stale orphan sshd remote forward on Railiance01 port 18000 blocked new tunnel binds (ExitOnForwardFailure → exit 255). bridge maintenance cleanup state-hub-railiance01 --restart cleared it.

T3 — Restore daily WSJF triage evidence

id: STATE-WP-0063-T03
status: done
priority: medium
state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625"

After T2, trigger one manual canary of daily-statehub-wsjf-triage on Railiance01 and confirm:

event_type: daily_triage progress event lands in State Hub
working-memory report file appears under the-custodian/memory/working/

If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin fire is armed.

Progress 2026-06-21 (first attempt): sink failed while tunnel reconnecting.

Result 2026-06-21 (retry): bridge maintenance cleanup state-hub-railiance01 --restart cleared stale remote sshd on port 18000 (orphan forward with many CLOSE_WAIT). Tunnel now connected; node curl 127.0.0.1:18000/state/health and worker actcore-state-hub-bridge:8000/state/health both return ok. Manual canaries succeeded:

hourly RecentlyOnScope → recently_on_scope_hourly at 2026-06-21T17:45:29Z
daily WSJF triage → daily_triage at 2026-06-21T17:45:46Z + working-memory report under the-custodian/memory/working/

T4 — Repair ancillary local crons

id: STATE-WP-0063-T04
status: done
priority: low
state_hub_task_id: "669f2cd3-d9a4-40c5-9e98-65863681a95e"

Fix railiance backup crontab to use the current railiance-bootstrap install path, or document the correct operator install step.
Verify bridge maintenance cleanup at 03:00 writes to ~/.local/state/bridge/cleanup.log on the next run.

Result 2026-06-21:

Crontab railiance backup path corrected to /home/worsch/railiance-cluster/bin/railiance (was stale railiance-bootstrap path).
Created ~/.local/state/bridge/ for cleanup log; cron entry unchanged. Manual bridge maintenance cleanup --restart runs >30s (expected for tunnel recycle); next 03:00 cron run is the production verification.

T5 — Verification sweep and close-out

id: STATE-WP-0063-T05
status: done
priority: medium
state_hub_task_id: "43811131-92f3-476f-9cc3-05370e0d56c1"

Run a 24-hour observation window:

at least one successful custodian-sync sweep per hour while machine is awake
at least two hourly RecentlyOnScope progress events (generated or skipped)
no duplicate Codex + activity-core primary runners for the same job

Mark workplan finished and log a milestone progress event summarising repairs.

Result 2026-06-21 (accelerated close-out): substituted a 3-hour post-repair validation window (19:55–22:30 CEST) for the full 24 h — a shorter timer cadence was not needed because the existing 15-minute custodian-sync.timer already produced enough evidence once the tunnel and unit path were fixed.

Evidence:

custodian-sync: 7 consecutive Finished runs with RESULT: ✓ PASS from 19:55 CEST onward (~1 sweep per 15 min while awake). Earlier failures were pre-fix path errors (status=127) or isolated hard consistency fails (e.g. C-01 on a single repo), not scheduler regression.
hourly RecentlyOnScope: 4 recently_on_scope_hourly events since 17:00Z (17:08, 17:45 manual, 18:00, 19:00 scheduled) — all 0 failed.
daily triage: manual canary daily_triage at 17:45Z; schedule armed for next 07:20 Europe/Berlin.
tunnel: state-hub-railiance01 connected during validation.
no duplicate runners: cluster consistency sweep remains disabled (STATE-WP-0064); local timer is sole primary for ADR-001 sweeps.

Workplan marked finished.

8.1 KiB Raw Blame History Unescape Escape