From 8c11acc00c9b67b4c91a5d6f6658c2ae5cfeda55 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sun, 21 Jun 2026 20:02:18 +0200 Subject: [PATCH] docs(ops-bridge): BRIDGE-WP-0005 restart includes remote cleanup Add workplan to make bridge restart perform conditional stale-forward cleanup before start (blank-slate recovery). Refines topology for laptop workstation origin, intermittently offline haskelseed, and stable VPS remotes (coulombcore, railiance01). Origin: STATE-WP-0063 tunnel incident. Registered in State Hub via fix-consistency. --- ...WP-0005-restart-includes-remote-cleanup.md | 191 ++++++++++++++++++ 1 file changed, 191 insertions(+) create mode 100644 workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md diff --git a/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md b/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md new file mode 100644 index 0000000..1d422fb --- /dev/null +++ b/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md @@ -0,0 +1,191 @@ +--- +id: BRIDGE-WP-0005 +type: workplan +title: "Restart includes remote cleanup (blank-slate recovery)" +domain: custodian +repo: ops-bridge +status: ready +owner: codex +topic_slug: custodian +created: "2026-06-21" +updated: "2026-06-21" +state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0" +--- + +# BRIDGE-WP-0005 — Restart includes remote cleanup + +**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan +`sshd` remote forward on Railiance01 port `18000` blocked +`bridge restart state-hub-railiance01` from producing a working tunnel. Operators +had to discover `bridge maintenance cleanup --restart` separately. + +**Operator expectation:** `bridge restart` should mean *operational again* — a +blank-slate recovery — not merely "cycle the local manager PID while a broken +remote listener still holds the port." + +## Topology and failure modes (refined) + +Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles. +Cleanup policy must respect all of them. + +### A. Workstation (laptop WSL) — tunnel **origin** + +The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on +remote hosts: + +| Remote host | Tunnels (reverse) | Role | +|-------------|-------------------|------| +| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot | +| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot | +| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved | + +**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill +local bridge processes without graceful remote SSH teardown. Orphan `sshd` +listeners on **all three remotes** are common after wake — especially +`18000`/`18001` on VPS hosts that activity-core and remote agents depend on. + +### B. Haskelseed — also intermittently offline + +Haskelseed is not a datacenter VPS; it may be powered down or unreachable on +different networks. The same orphan-forward pattern applies to its reverse ports +when the workstation-side tunnel dies uncleanly. + +### C. VPS remotes (coulombcore, railiance01) + +Normally always-on. Maintenance reboots clear remote kernel state, but: + +- a VPS reboot does **not** fix a workstation that is still in `reconnecting` + with a dead local SSH child; +- when the laptop returns, orphan forwards from the **previous** session may + still block new `-R` binds if the VPS did not reboot. + +**Conclusion:** conditional remote cleanup before restart benefits **all reverse +tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already +skips healthy forwards — VPS tunnels with live working forwards are untouched. + +### D. Local-direction tunnels — no remote cleanup + +`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use +forward mode from workstation to remote services. They do not bind remote reverse +ports for State Hub. **`restart` stays local stop/start only** for these. + +## Design (decided) + +| Command | Behaviour after this workplan | +|---------|-------------------------------| +| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. | +| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". | +| `bridge up` | Out of scope here (see T4 optional follow-up). | + +Implementation sketch: replace the body of `cli.restart()` with a call to +`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels, +or per-tunnel `cleanup_tunnel` when a single tunnel is named. + +Emit the same action summary strings cleanup already uses (`healthy`, +`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran. + +## Out of scope + +- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false + positive during T2). +- Auto-cleanup inside the reconnect backoff loop (stretch — T4). +- Renaming tunnels or changing `tunnels.yaml` host entries. + +--- + +## T1 — Wire restart through cleanup path + +```task +id: BRIDGE-WP-0005-T01 +status: todo +priority: high +state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14" +``` + +Refactor `bridge/cli.py` `restart()` so reverse tunnels call +`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare +`TunnelManager.stop()` + `start()`. + +Requirements: + +- Single-tunnel and all-tunnel restart both work. +- Local-direction tunnels keep stop/start only. +- Exit codes: preserve today’s semantics where practical; exit non-zero if any + named tunnel ends in `CleanupAction.action == "error"`. +- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`, + etc.), not only "Restarted tunnel". + +## T2 — Tests and regression coverage + +```task +id: BRIDGE-WP-0005-T02 +status: todo +priority: high +state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af" +``` + +Update `tests/test_cli.py`: + +- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for + reverse tunnels. +- Add cases: healthy forward (no remote kill), stale forward (remote cleanup + invoked), local-direction tunnel (no cleanup call). +- Reuse mocks from `tests/test_cleanup.py` patterns. + +`make test` and `make lint` pass. + +## T3 — Operator docs and CLI help + +```task +id: BRIDGE-WP-0005-T03 +status: todo +priority: medium +state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c" +``` + +Document the blank-slate restart contract: + +- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down. +- `bridge restart --help` — mention conditional remote stale-forward cleanup. +- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS + maintenance — matching this workplan's topology section. +- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md` + incident note (one line each way). + +## T4 — Optional: reconnect-loop hygiene (stretch) + +```task +id: BRIDGE-WP-0005-T04 +status: todo +priority: low +state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b" +``` + +Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup +once after repeated exit-255 bind failures (laptop wake without operator running +`bridge restart`). Defer unless T1–T3 are done; mark `cancel` if heuristic risk +outweighs benefit. + +Done when documented decision: implement, defer, or cancel with reason. + +## T5 — Live verification on workstation + VPS + +```task +id: BRIDGE-WP-0005-T05 +status: todo +priority: medium +state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79" +``` + +After T1–T2 ship, verify on real config: + +1. **railiance01** — reproduce stale-forward scenario (or simulate); confirm + `bridge restart state-hub-railiance01` clears and connects without needing + the maintenance subcommand. +2. **haskelseed** — `bridge restart state-hub-haskelseed` after a manual + `bridge down` while remote port still listens (Alpine `netstat` path from + ADHOC-2026-06-14). +3. **coulombcore** — confirm healthy tunnel restart is a no-op remote cleanup + (`healthy` action) and does not disrupt a working forward. + +Log a State Hub progress note on workstream close. Mark workplan `finished`. \ No newline at end of file