--- id: BRIDGE-WP-0005 type: workplan title: "Restart includes remote cleanup (blank-slate recovery)" domain: infotech repo: ops-bridge status: finished owner: codex topic_slug: custodian created: "2026-06-21" updated: "2026-06-21" state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0" --- # BRIDGE-WP-0005 — Restart includes remote cleanup **Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan `sshd` remote forward on Railiance01 port `18000` blocked `bridge restart state-hub-railiance01` from producing a working tunnel. Operators had to discover `bridge maintenance cleanup --restart` separately. **Operator expectation:** `bridge restart` should mean *operational again* — a blank-slate recovery — not merely "cycle the local manager PID while a broken remote listener still holds the port." ## Topology and failure modes (refined) Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles. Cleanup policy must respect all of them. ### A. Workstation (laptop WSL) — tunnel **origin** The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on remote hosts: | Remote host | Tunnels (reverse) | Role | |-------------|-------------------|------| | **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot | | **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot | | **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved | **Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill local bridge processes without graceful remote SSH teardown. Orphan `sshd` listeners on **all three remotes** are common after wake — especially `18000`/`18001` on VPS hosts that activity-core and remote agents depend on. ### B. Haskelseed — also intermittently offline Haskelseed is not a datacenter VPS; it may be powered down or unreachable on different networks. The same orphan-forward pattern applies to its reverse ports when the workstation-side tunnel dies uncleanly. ### C. VPS remotes (coulombcore, railiance01) Normally always-on. Maintenance reboots clear remote kernel state, but: - a VPS reboot does **not** fix a workstation that is still in `reconnecting` with a dead local SSH child; - when the laptop returns, orphan forwards from the **previous** session may still block new `-R` binds if the VPS did not reboot. **Conclusion:** conditional remote cleanup before restart benefits **all reverse tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already skips healthy forwards — VPS tunnels with live working forwards are untouched. ### D. Local-direction tunnels — no remote cleanup `direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use forward mode from workstation to remote services. They do not bind remote reverse ports for State Hub. **`restart` stays local stop/start only** for these. ## Design (decided) | Command | Behaviour after this workplan | |---------|-------------------------------| | `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. | | `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". | | `bridge up` | Out of scope here (see T4 optional follow-up). | Implementation sketch: replace the body of `cli.restart()` with a call to `cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels, or per-tunnel `cleanup_tunnel` when a single tunnel is named. Emit the same action summary strings cleanup already uses (`healthy`, `cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran. ## Out of scope - Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false positive during T2). - Auto-cleanup inside the reconnect backoff loop (stretch — T4). - Renaming tunnels or changing `tunnels.yaml` host entries. --- ## T1 — Wire restart through cleanup path ```task id: BRIDGE-WP-0005-T01 status: done priority: high state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14" ``` Refactor `bridge/cli.py` `restart()` so reverse tunnels call `cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare `TunnelManager.stop()` + `start()`. Requirements: - Single-tunnel and all-tunnel restart both work. - Local-direction tunnels keep stop/start only. - Exit codes: preserve today’s semantics where practical; exit non-zero if any named tunnel ends in `CleanupAction.action == "error"`. - Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`, etc.), not only "Restarted tunnel". ## T2 — Tests and regression coverage ```task id: BRIDGE-WP-0005-T02 status: done priority: high state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af" ``` Update `tests/test_cli.py`: - `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for reverse tunnels. - Add cases: healthy forward (no remote kill), stale forward (remote cleanup invoked), local-direction tunnel (no cleanup call). - Reuse mocks from `tests/test_cleanup.py` patterns. `make test` and `make lint` pass. ## T3 — Operator docs and CLI help ```task id: BRIDGE-WP-0005-T03 status: done priority: medium state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c" ``` Document the blank-slate restart contract: - `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down. - `bridge restart --help` — mention conditional remote stale-forward cleanup. - Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS maintenance — matching this workplan's topology section. - Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md` incident note (one line each way). ## T4 — Optional: reconnect-loop hygiene (stretch) ```task id: BRIDGE-WP-0005-T04 status: cancel priority: low state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b" ``` Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup once after repeated exit-255 bind failures (laptop wake without operator running `bridge restart`). Defer unless T1–T3 are done; mark `cancel` if heuristic risk outweighs benefit. **Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect loop risks killing a legitimately healthy orphan forward owned by another session or operator. `bridge restart` now covers the operator-facing blank-slate path; nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if wake-from-sleep reconnect failures remain frequent after a month of observation. ## T5 — Live verification on workstation + VPS ```task id: BRIDGE-WP-0005-T05 status: done priority: medium state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79" ``` After T1–T2 ship, verify on real config: 1. **railiance01** — `state-hub-mcp-railiance01` was `reconnecting` with stale forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached `connected`. 2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests. 3. **coulombcore** — `bridge restart state-hub-coulombcore` reported `healthy`, PID unchanged (4116), forward undisturbed. State Hub progress logged (2026-06-21). Workplan marked `finished`.