Files

tegwick 10c6fdaec9 feat(restart): route reverse tunnels through stale-forward cleanup

bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.

2026-06-21 20:12:13 +02:00

7.6 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
BRIDGE-WP-0005	workplan	Restart includes remote cleanup (blank-slate recovery)	custodian	ops-bridge	finished	codex	custodian	2026-06-21	2026-06-21	9565491f-e664-4add-bea4-27c4fb015ee0

BRIDGE-WP-0005 — Restart includes remote cleanup

Origin: STATE-WP-0063 weekend automation repair (2026-06-21). A stale orphan sshd remote forward on Railiance01 port 18000 blocked bridge restart state-hub-railiance01 from producing a working tunnel. Operators had to discover bridge maintenance cleanup <tunnel> --restart separately.

Operator expectation: bridge restart should mean operational again — a blank-slate recovery — not merely "cycle the local manager PID while a broken remote listener still holds the port."

Topology and failure modes (refined)

Tunnels in ~/.config/bridge/tunnels.yaml serve three distinct host roles. Cleanup policy must respect all of them.

A. Workstation (laptop WSL) — tunnel origin

The State Hub API runs locally (127.0.0.1:8000). Reverse tunnels expose it on remote hosts:

Remote host	Tunnels (reverse)	Role
coulombcore (`92.205.130.254`)	`state-hub-coulombcore`, `state-hub-mcp-coulombcore`	VPS — stable, occasional maintenance reboot
railiance01 (`92.205.62.239`)	`state-hub-railiance01`, `state-hub-mcp-railiance01`	VPS — stable, occasional maintenance reboot
haskelseed (`192.168.178.135`)	`state-hub-haskelseed`, `state-hub-mcp-haskelseed`	LAN builder — may sleep/reboot when moved

Laptop behaviour: shutdown, sleep, and location changes (home ↔ office) kill local bridge processes without graceful remote SSH teardown. Orphan sshd listeners on all three remotes are common after wake — especially 18000/18001 on VPS hosts that activity-core and remote agents depend on.

B. Haskelseed — also intermittently offline

Haskelseed is not a datacenter VPS; it may be powered down or unreachable on different networks. The same orphan-forward pattern applies to its reverse ports when the workstation-side tunnel dies uncleanly.

C. VPS remotes (coulombcore, railiance01)

Normally always-on. Maintenance reboots clear remote kernel state, but:

a VPS reboot does not fix a workstation that is still in reconnecting with a dead local SSH child;
when the laptop returns, orphan forwards from the previous session may still block new -R binds if the VPS did not reboot.

Conclusion: conditional remote cleanup before restart benefits all reverse tunnels, not only laptop-adjacent hosts. should_cleanup_tunnel() already skips healthy forwards — VPS tunnels with live working forwards are untouched.

D. Local-direction tunnels — no remote cleanup

direction: local tunnels (k3s-api-coulombcore, nix-daemon-haskelseed) use forward mode from workstation to remote services. They do not bind remote reverse ports for State Hub. restart stays local stop/start only for these.

Design (decided)

Command	Behaviour after this workplan
`bridge restart [tunnel]`	For each reverse tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For local tunnels: existing `stop()` + `start()`.
`bridge maintenance cleanup`	Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart".
`bridge up`	Out of scope here (see T4 optional follow-up).

Implementation sketch: replace the body of cli.restart() with a call to cleanup_all_tunnels(..., restart=True, tunnel_name=...) for reverse tunnels, or per-tunnel cleanup_tunnel when a single tunnel is named.

Emit the same action summary strings cleanup already uses (healthy, cleaned_and_restarted, error) so operators see whether remote hygiene ran.

Out of scope

Changing should_cleanup_tunnel heuristics (unless tests expose a VPS false positive during T2).
Auto-cleanup inside the reconnect backoff loop (stretch — T4).
Renaming tunnels or changing tunnels.yaml host entries.

T1 — Wire restart through cleanup path

id: BRIDGE-WP-0005-T01
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"

Refactor bridge/cli.py restart() so reverse tunnels call cleanup_tunnel(cfg, state_mgr, restart=True) instead of bare TunnelManager.stop() + start().

Requirements:

Single-tunnel and all-tunnel restart both work.
Local-direction tunnels keep stop/start only.
Exit codes: preserve today’s semantics where practical; exit non-zero if any named tunnel ends in CleanupAction.action == "error".
Stdout tells the operator what happened (healthy, cleaned_and_restarted, etc.), not only "Restarted tunnel".

T2 — Tests and regression coverage

id: BRIDGE-WP-0005-T02
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"

Update tests/test_cli.py:

test_restart_calls_stop_then_start → assert restart delegates to cleanup for reverse tunnels.
Add cases: healthy forward (no remote kill), stale forward (remote cleanup invoked), local-direction tunnel (no cleanup call).
Reuse mocks from tests/test_cleanup.py patterns.

make test and make lint pass.

T3 — Operator docs and CLI help

id: BRIDGE-WP-0005-T03
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"

Document the blank-slate restart contract:

wiki/OpsBridge.md — restart vs maintenance cleanup vs up/down.
bridge restart --help — mention conditional remote stale-forward cleanup.
Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS maintenance — matching this workplan's topology section.
Cross-link from state-hub STATE-WP-0063 / history/20260621-weekend-automation-assessment.md incident note (one line each way).

T4 — Optional: reconnect-loop hygiene (stretch)

id: BRIDGE-WP-0005-T04
status: cancelled
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"

Evaluate whether TunnelManager reconnect backoff should invoke remote cleanup once after repeated exit-255 bind failures (laptop wake without operator running bridge restart). Defer unless T1–T3 are done; mark cancel if heuristic risk outweighs benefit.

Decision (2026-06-21): cancelled for now. Auto-cleanup inside the reconnect loop risks killing a legitimately healthy orphan forward owned by another session or operator. bridge restart now covers the operator-facing blank-slate path; nightly maintenance cleanup --restart covers unattended hygiene. Revisit only if wake-from-sleep reconnect failures remain frequent after a month of observation.

T5 — Live verification on workstation + VPS

id: BRIDGE-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"

After T1–T2 ship, verify on real config:

railiance01 — state-hub-mcp-railiance01 was reconnecting with stale forward; bridge restart reported cleaned_and_restarted and tunnel reached connected.
haskelseed — not exercised (all tunnels already healthy); Alpine netstat path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
coulombcore — bridge restart state-hub-coulombcore reported healthy, PID unchanged (4116), forward undisturbed.

State Hub progress logged (2026-06-21). Workplan marked finished.

7.6 KiB Raw Blame History Unescape Escape