bridge restart now means blank-slate recovery: reverse tunnels run should_cleanup_tunnel and clear orphan remote listeners before reconnecting; healthy forwards are left running. Local-direction tunnels keep stop/start only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted, restarted, error) and exit non-zero on cleanup failure. Closes BRIDGE-WP-0005.
7.6 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| BRIDGE-WP-0005 | workplan | Restart includes remote cleanup (blank-slate recovery) | custodian | ops-bridge | finished | codex | custodian | 2026-06-21 | 2026-06-21 | 9565491f-e664-4add-bea4-27c4fb015ee0 |
BRIDGE-WP-0005 — Restart includes remote cleanup
Origin: STATE-WP-0063 weekend automation repair (2026-06-21). A stale orphan
sshd remote forward on Railiance01 port 18000 blocked
bridge restart state-hub-railiance01 from producing a working tunnel. Operators
had to discover bridge maintenance cleanup <tunnel> --restart separately.
Operator expectation: bridge restart should mean operational again — a
blank-slate recovery — not merely "cycle the local manager PID while a broken
remote listener still holds the port."
Topology and failure modes (refined)
Tunnels in ~/.config/bridge/tunnels.yaml serve three distinct host roles.
Cleanup policy must respect all of them.
A. Workstation (laptop WSL) — tunnel origin
The State Hub API runs locally (127.0.0.1:8000). Reverse tunnels expose it on
remote hosts:
| Remote host | Tunnels (reverse) | Role |
|---|---|---|
coulombcore (92.205.130.254) |
state-hub-coulombcore, state-hub-mcp-coulombcore |
VPS — stable, occasional maintenance reboot |
railiance01 (92.205.62.239) |
state-hub-railiance01, state-hub-mcp-railiance01 |
VPS — stable, occasional maintenance reboot |
haskelseed (192.168.178.135) |
state-hub-haskelseed, state-hub-mcp-haskelseed |
LAN builder — may sleep/reboot when moved |
Laptop behaviour: shutdown, sleep, and location changes (home ↔ office) kill
local bridge processes without graceful remote SSH teardown. Orphan sshd
listeners on all three remotes are common after wake — especially
18000/18001 on VPS hosts that activity-core and remote agents depend on.
B. Haskelseed — also intermittently offline
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on different networks. The same orphan-forward pattern applies to its reverse ports when the workstation-side tunnel dies uncleanly.
C. VPS remotes (coulombcore, railiance01)
Normally always-on. Maintenance reboots clear remote kernel state, but:
- a VPS reboot does not fix a workstation that is still in
reconnectingwith a dead local SSH child; - when the laptop returns, orphan forwards from the previous session may
still block new
-Rbinds if the VPS did not reboot.
Conclusion: conditional remote cleanup before restart benefits all reverse
tunnels, not only laptop-adjacent hosts. should_cleanup_tunnel() already
skips healthy forwards — VPS tunnels with live working forwards are untouched.
D. Local-direction tunnels — no remote cleanup
direction: local tunnels (k3s-api-coulombcore, nix-daemon-haskelseed) use
forward mode from workstation to remote services. They do not bind remote reverse
ports for State Hub. restart stays local stop/start only for these.
Design (decided)
| Command | Behaviour after this workplan |
|---|---|
bridge restart [tunnel] |
For each reverse tunnel: cleanup_tunnel(..., restart=True) — run should_cleanup_tunnel; clear stale remote listener if needed; then start. For local tunnels: existing stop() + start(). |
bridge maintenance cleanup |
Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
bridge up |
Out of scope here (see T4 optional follow-up). |
Implementation sketch: replace the body of cli.restart() with a call to
cleanup_all_tunnels(..., restart=True, tunnel_name=...) for reverse tunnels,
or per-tunnel cleanup_tunnel when a single tunnel is named.
Emit the same action summary strings cleanup already uses (healthy,
cleaned_and_restarted, error) so operators see whether remote hygiene ran.
Out of scope
- Changing
should_cleanup_tunnelheuristics (unless tests expose a VPS false positive during T2). - Auto-cleanup inside the reconnect backoff loop (stretch — T4).
- Renaming tunnels or changing
tunnels.yamlhost entries.
T1 — Wire restart through cleanup path
id: BRIDGE-WP-0005-T01
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
Refactor bridge/cli.py restart() so reverse tunnels call
cleanup_tunnel(cfg, state_mgr, restart=True) instead of bare
TunnelManager.stop() + start().
Requirements:
- Single-tunnel and all-tunnel restart both work.
- Local-direction tunnels keep stop/start only.
- Exit codes: preserve today’s semantics where practical; exit non-zero if any
named tunnel ends in
CleanupAction.action == "error". - Stdout tells the operator what happened (
healthy,cleaned_and_restarted, etc.), not only "Restarted tunnel".
T2 — Tests and regression coverage
id: BRIDGE-WP-0005-T02
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
Update tests/test_cli.py:
test_restart_calls_stop_then_start→ assert restart delegates to cleanup for reverse tunnels.- Add cases: healthy forward (no remote kill), stale forward (remote cleanup invoked), local-direction tunnel (no cleanup call).
- Reuse mocks from
tests/test_cleanup.pypatterns.
make test and make lint pass.
T3 — Operator docs and CLI help
id: BRIDGE-WP-0005-T03
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
Document the blank-slate restart contract:
wiki/OpsBridge.md— restart vs maintenance cleanup vs up/down.bridge restart --help— mention conditional remote stale-forward cleanup.- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS maintenance — matching this workplan's topology section.
- Cross-link from
state-hubSTATE-WP-0063/history/20260621-weekend-automation-assessment.mdincident note (one line each way).
T4 — Optional: reconnect-loop hygiene (stretch)
id: BRIDGE-WP-0005-T04
status: cancelled
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
Evaluate whether TunnelManager reconnect backoff should invoke remote cleanup
once after repeated exit-255 bind failures (laptop wake without operator running
bridge restart). Defer unless T1–T3 are done; mark cancel if heuristic risk
outweighs benefit.
Decision (2026-06-21): cancelled for now. Auto-cleanup inside the reconnect
loop risks killing a legitimately healthy orphan forward owned by another session
or operator. bridge restart now covers the operator-facing blank-slate path;
nightly maintenance cleanup --restart covers unattended hygiene. Revisit only if
wake-from-sleep reconnect failures remain frequent after a month of observation.
T5 — Live verification on workstation + VPS
id: BRIDGE-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
After T1–T2 ship, verify on real config:
- railiance01 —
state-hub-mcp-railiance01wasreconnectingwith stale forward;bridge restartreportedcleaned_and_restartedand tunnel reachedconnected. - haskelseed — not exercised (all tunnels already healthy); Alpine netstat path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
- coulombcore —
bridge restart state-hub-coulombcorereportedhealthy, PID unchanged (4116), forward undisturbed.
State Hub progress logged (2026-06-21). Workplan marked finished.