From 323599f2fc226cda84d1c6a8b401084cba317490 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sun, 21 Jun 2026 19:47:56 +0200 Subject: [PATCH] =?UTF-8?q?docs(state-hub):=20STATE-WP-0063=20T03=20done?= =?UTF-8?q?=20=E2=80=94=20tunnel=20cleanup=20restored=20activity-core?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Document stale remote sshd forward on Railiance01 :18000 as root cause of reconnect loop; T03 verified after bridge maintenance cleanup and manual canaries for hourly RecentlyOnScope and daily WSJF triage. --- ...STATE-WP-0063-weekend-automation-repair.md | 22 +++++++++++++------ 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/workplans/STATE-WP-0063-weekend-automation-repair.md b/workplans/STATE-WP-0063-weekend-automation-repair.md index 3aae56f..43e4ed6 100644 --- a/workplans/STATE-WP-0063-weekend-automation-repair.md +++ b/workplans/STATE-WP-0063-weekend-automation-repair.md @@ -115,12 +115,15 @@ Result 2026-06-21: **workstation availability + ops-bridge tunnel reachability** to the local State Hub API. Cluster-side `actcore-state-hub-bridge` proxies to node-local `127.0.0.1:18000` as designed. +- **Retry fix (2026-06-21):** stale orphan `sshd` remote forward on Railiance01 + port 18000 blocked new tunnel binds (`ExitOnForwardFailure` → exit 255). + `bridge maintenance cleanup state-hub-railiance01 --restart` cleared it. ## T3 — Restore daily WSJF triage evidence ```task id: STATE-WP-0063-T03 -status: progress +status: done priority: medium state_hub_task_id: "4b68b207-a80e-4c71-a246-10035ef69625" ``` @@ -134,12 +137,17 @@ Railiance01 and confirm: If the schedule was paused, unpause and verify the next 07:20 Europe/Berlin fire is armed. -Progress 2026-06-21: Manual trigger via actcore-api returned workflow -`activity-6fca51fa…:manual-ca469cb5…`. Schedule is armed (next 07:20 Europe/Berlin). -Report sink `state-hub-progress` is still failing with `[Errno 111] Connection -refused` while `state-hub-railiance01` tunnel is reconnecting. Re-verify after -tunnel stabilises (`bridge check state-hub-railiance01` + new `daily_triage` -progress event). +Progress 2026-06-21 (first attempt): sink failed while tunnel reconnecting. + +Result 2026-06-21 (retry): `bridge maintenance cleanup state-hub-railiance01 +--restart` cleared stale remote `sshd` on port 18000 (orphan forward with many +CLOSE_WAIT). Tunnel now **connected**; node `curl 127.0.0.1:18000/state/health` +and worker `actcore-state-hub-bridge:8000/state/health` both return ok. Manual +canaries succeeded: + +- hourly RecentlyOnScope → `recently_on_scope_hourly` at 2026-06-21T17:45:29Z +- daily WSJF triage → `daily_triage` at 2026-06-21T17:45:46Z + working-memory + report under `the-custodian/memory/working/` ## T4 — Repair ancillary local crons