Files
ops-bridge/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
tegwick 10c6fdaec9 feat(restart): route reverse tunnels through stale-forward cleanup
bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.
2026-06-21 20:12:13 +02:00

7.6 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
BRIDGE-WP-0005 workplan Restart includes remote cleanup (blank-slate recovery) custodian ops-bridge finished codex custodian 2026-06-21 2026-06-21 9565491f-e664-4add-bea4-27c4fb015ee0

BRIDGE-WP-0005 — Restart includes remote cleanup

Origin: STATE-WP-0063 weekend automation repair (2026-06-21). A stale orphan sshd remote forward on Railiance01 port 18000 blocked bridge restart state-hub-railiance01 from producing a working tunnel. Operators had to discover bridge maintenance cleanup <tunnel> --restart separately.

Operator expectation: bridge restart should mean operational again — a blank-slate recovery — not merely "cycle the local manager PID while a broken remote listener still holds the port."

Topology and failure modes (refined)

Tunnels in ~/.config/bridge/tunnels.yaml serve three distinct host roles. Cleanup policy must respect all of them.

A. Workstation (laptop WSL) — tunnel origin

The State Hub API runs locally (127.0.0.1:8000). Reverse tunnels expose it on remote hosts:

Remote host Tunnels (reverse) Role
coulombcore (92.205.130.254) state-hub-coulombcore, state-hub-mcp-coulombcore VPS — stable, occasional maintenance reboot
railiance01 (92.205.62.239) state-hub-railiance01, state-hub-mcp-railiance01 VPS — stable, occasional maintenance reboot
haskelseed (192.168.178.135) state-hub-haskelseed, state-hub-mcp-haskelseed LAN builder — may sleep/reboot when moved

Laptop behaviour: shutdown, sleep, and location changes (home ↔ office) kill local bridge processes without graceful remote SSH teardown. Orphan sshd listeners on all three remotes are common after wake — especially 18000/18001 on VPS hosts that activity-core and remote agents depend on.

B. Haskelseed — also intermittently offline

Haskelseed is not a datacenter VPS; it may be powered down or unreachable on different networks. The same orphan-forward pattern applies to its reverse ports when the workstation-side tunnel dies uncleanly.

C. VPS remotes (coulombcore, railiance01)

Normally always-on. Maintenance reboots clear remote kernel state, but:

  • a VPS reboot does not fix a workstation that is still in reconnecting with a dead local SSH child;
  • when the laptop returns, orphan forwards from the previous session may still block new -R binds if the VPS did not reboot.

Conclusion: conditional remote cleanup before restart benefits all reverse tunnels, not only laptop-adjacent hosts. should_cleanup_tunnel() already skips healthy forwards — VPS tunnels with live working forwards are untouched.

D. Local-direction tunnels — no remote cleanup

direction: local tunnels (k3s-api-coulombcore, nix-daemon-haskelseed) use forward mode from workstation to remote services. They do not bind remote reverse ports for State Hub. restart stays local stop/start only for these.

Design (decided)

Command Behaviour after this workplan
bridge restart [tunnel] For each reverse tunnel: cleanup_tunnel(..., restart=True) — run should_cleanup_tunnel; clear stale remote listener if needed; then start. For local tunnels: existing stop() + start().
bridge maintenance cleanup Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart".
bridge up Out of scope here (see T4 optional follow-up).

Implementation sketch: replace the body of cli.restart() with a call to cleanup_all_tunnels(..., restart=True, tunnel_name=...) for reverse tunnels, or per-tunnel cleanup_tunnel when a single tunnel is named.

Emit the same action summary strings cleanup already uses (healthy, cleaned_and_restarted, error) so operators see whether remote hygiene ran.

Out of scope

  • Changing should_cleanup_tunnel heuristics (unless tests expose a VPS false positive during T2).
  • Auto-cleanup inside the reconnect backoff loop (stretch — T4).
  • Renaming tunnels or changing tunnels.yaml host entries.

T1 — Wire restart through cleanup path

id: BRIDGE-WP-0005-T01
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"

Refactor bridge/cli.py restart() so reverse tunnels call cleanup_tunnel(cfg, state_mgr, restart=True) instead of bare TunnelManager.stop() + start().

Requirements:

  • Single-tunnel and all-tunnel restart both work.
  • Local-direction tunnels keep stop/start only.
  • Exit codes: preserve todays semantics where practical; exit non-zero if any named tunnel ends in CleanupAction.action == "error".
  • Stdout tells the operator what happened (healthy, cleaned_and_restarted, etc.), not only "Restarted tunnel".

T2 — Tests and regression coverage

id: BRIDGE-WP-0005-T02
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"

Update tests/test_cli.py:

  • test_restart_calls_stop_then_start → assert restart delegates to cleanup for reverse tunnels.
  • Add cases: healthy forward (no remote kill), stale forward (remote cleanup invoked), local-direction tunnel (no cleanup call).
  • Reuse mocks from tests/test_cleanup.py patterns.

make test and make lint pass.

T3 — Operator docs and CLI help

id: BRIDGE-WP-0005-T03
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"

Document the blank-slate restart contract:

  • wiki/OpsBridge.md — restart vs maintenance cleanup vs up/down.
  • bridge restart --help — mention conditional remote stale-forward cleanup.
  • Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS maintenance — matching this workplan's topology section.
  • Cross-link from state-hub STATE-WP-0063 / history/20260621-weekend-automation-assessment.md incident note (one line each way).

T4 — Optional: reconnect-loop hygiene (stretch)

id: BRIDGE-WP-0005-T04
status: cancelled
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"

Evaluate whether TunnelManager reconnect backoff should invoke remote cleanup once after repeated exit-255 bind failures (laptop wake without operator running bridge restart). Defer unless T1T3 are done; mark cancel if heuristic risk outweighs benefit.

Decision (2026-06-21): cancelled for now. Auto-cleanup inside the reconnect loop risks killing a legitimately healthy orphan forward owned by another session or operator. bridge restart now covers the operator-facing blank-slate path; nightly maintenance cleanup --restart covers unattended hygiene. Revisit only if wake-from-sleep reconnect failures remain frequent after a month of observation.

T5 — Live verification on workstation + VPS

id: BRIDGE-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"

After T1T2 ship, verify on real config:

  1. railiance01state-hub-mcp-railiance01 was reconnecting with stale forward; bridge restart reported cleaned_and_restarted and tunnel reached connected.
  2. haskelseed — not exercised (all tunnels already healthy); Alpine netstat path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
  3. coulombcorebridge restart state-hub-coulombcore reported healthy, PID unchanged (4116), forward undisturbed.

State Hub progress logged (2026-06-21). Workplan marked finished.