generated from coulomb/repo-seed
docs(ops-bridge): BRIDGE-WP-0005 restart includes remote cleanup
Add workplan to make bridge restart perform conditional stale-forward cleanup before start (blank-slate recovery). Refines topology for laptop workstation origin, intermittently offline haskelseed, and stable VPS remotes (coulombcore, railiance01). Origin: STATE-WP-0063 tunnel incident. Registered in State Hub via fix-consistency.
This commit is contained in:
191
workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
Normal file
191
workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
Normal file
@@ -0,0 +1,191 @@
|
|||||||
|
---
|
||||||
|
id: BRIDGE-WP-0005
|
||||||
|
type: workplan
|
||||||
|
title: "Restart includes remote cleanup (blank-slate recovery)"
|
||||||
|
domain: custodian
|
||||||
|
repo: ops-bridge
|
||||||
|
status: ready
|
||||||
|
owner: codex
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-06-21"
|
||||||
|
updated: "2026-06-21"
|
||||||
|
state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
|
||||||
|
---
|
||||||
|
|
||||||
|
# BRIDGE-WP-0005 — Restart includes remote cleanup
|
||||||
|
|
||||||
|
**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
|
||||||
|
`sshd` remote forward on Railiance01 port `18000` blocked
|
||||||
|
`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
|
||||||
|
had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
|
||||||
|
|
||||||
|
**Operator expectation:** `bridge restart` should mean *operational again* — a
|
||||||
|
blank-slate recovery — not merely "cycle the local manager PID while a broken
|
||||||
|
remote listener still holds the port."
|
||||||
|
|
||||||
|
## Topology and failure modes (refined)
|
||||||
|
|
||||||
|
Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
|
||||||
|
Cleanup policy must respect all of them.
|
||||||
|
|
||||||
|
### A. Workstation (laptop WSL) — tunnel **origin**
|
||||||
|
|
||||||
|
The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
|
||||||
|
remote hosts:
|
||||||
|
|
||||||
|
| Remote host | Tunnels (reverse) | Role |
|
||||||
|
|-------------|-------------------|------|
|
||||||
|
| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
|
||||||
|
| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
|
||||||
|
| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
|
||||||
|
|
||||||
|
**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
|
||||||
|
local bridge processes without graceful remote SSH teardown. Orphan `sshd`
|
||||||
|
listeners on **all three remotes** are common after wake — especially
|
||||||
|
`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
|
||||||
|
|
||||||
|
### B. Haskelseed — also intermittently offline
|
||||||
|
|
||||||
|
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
|
||||||
|
different networks. The same orphan-forward pattern applies to its reverse ports
|
||||||
|
when the workstation-side tunnel dies uncleanly.
|
||||||
|
|
||||||
|
### C. VPS remotes (coulombcore, railiance01)
|
||||||
|
|
||||||
|
Normally always-on. Maintenance reboots clear remote kernel state, but:
|
||||||
|
|
||||||
|
- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
|
||||||
|
with a dead local SSH child;
|
||||||
|
- when the laptop returns, orphan forwards from the **previous** session may
|
||||||
|
still block new `-R` binds if the VPS did not reboot.
|
||||||
|
|
||||||
|
**Conclusion:** conditional remote cleanup before restart benefits **all reverse
|
||||||
|
tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
|
||||||
|
skips healthy forwards — VPS tunnels with live working forwards are untouched.
|
||||||
|
|
||||||
|
### D. Local-direction tunnels — no remote cleanup
|
||||||
|
|
||||||
|
`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
|
||||||
|
forward mode from workstation to remote services. They do not bind remote reverse
|
||||||
|
ports for State Hub. **`restart` stays local stop/start only** for these.
|
||||||
|
|
||||||
|
## Design (decided)
|
||||||
|
|
||||||
|
| Command | Behaviour after this workplan |
|
||||||
|
|---------|-------------------------------|
|
||||||
|
| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
|
||||||
|
| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
|
||||||
|
| `bridge up` | Out of scope here (see T4 optional follow-up). |
|
||||||
|
|
||||||
|
Implementation sketch: replace the body of `cli.restart()` with a call to
|
||||||
|
`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
|
||||||
|
or per-tunnel `cleanup_tunnel` when a single tunnel is named.
|
||||||
|
|
||||||
|
Emit the same action summary strings cleanup already uses (`healthy`,
|
||||||
|
`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
|
||||||
|
positive during T2).
|
||||||
|
- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
|
||||||
|
- Renaming tunnels or changing `tunnels.yaml` host entries.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## T1 — Wire restart through cleanup path
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: BRIDGE-WP-0005-T01
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
|
||||||
|
```
|
||||||
|
|
||||||
|
Refactor `bridge/cli.py` `restart()` so reverse tunnels call
|
||||||
|
`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
|
||||||
|
`TunnelManager.stop()` + `start()`.
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
|
||||||
|
- Single-tunnel and all-tunnel restart both work.
|
||||||
|
- Local-direction tunnels keep stop/start only.
|
||||||
|
- Exit codes: preserve today’s semantics where practical; exit non-zero if any
|
||||||
|
named tunnel ends in `CleanupAction.action == "error"`.
|
||||||
|
- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
|
||||||
|
etc.), not only "Restarted tunnel".
|
||||||
|
|
||||||
|
## T2 — Tests and regression coverage
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: BRIDGE-WP-0005-T02
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
|
||||||
|
```
|
||||||
|
|
||||||
|
Update `tests/test_cli.py`:
|
||||||
|
|
||||||
|
- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
|
||||||
|
reverse tunnels.
|
||||||
|
- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
|
||||||
|
invoked), local-direction tunnel (no cleanup call).
|
||||||
|
- Reuse mocks from `tests/test_cleanup.py` patterns.
|
||||||
|
|
||||||
|
`make test` and `make lint` pass.
|
||||||
|
|
||||||
|
## T3 — Operator docs and CLI help
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: BRIDGE-WP-0005-T03
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
|
||||||
|
```
|
||||||
|
|
||||||
|
Document the blank-slate restart contract:
|
||||||
|
|
||||||
|
- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
|
||||||
|
- `bridge restart --help` — mention conditional remote stale-forward cleanup.
|
||||||
|
- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
|
||||||
|
maintenance — matching this workplan's topology section.
|
||||||
|
- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
|
||||||
|
incident note (one line each way).
|
||||||
|
|
||||||
|
## T4 — Optional: reconnect-loop hygiene (stretch)
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: BRIDGE-WP-0005-T04
|
||||||
|
status: todo
|
||||||
|
priority: low
|
||||||
|
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
|
||||||
|
```
|
||||||
|
|
||||||
|
Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
|
||||||
|
once after repeated exit-255 bind failures (laptop wake without operator running
|
||||||
|
`bridge restart`). Defer unless T1–T3 are done; mark `cancel` if heuristic risk
|
||||||
|
outweighs benefit.
|
||||||
|
|
||||||
|
Done when documented decision: implement, defer, or cancel with reason.
|
||||||
|
|
||||||
|
## T5 — Live verification on workstation + VPS
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: BRIDGE-WP-0005-T05
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
|
||||||
|
```
|
||||||
|
|
||||||
|
After T1–T2 ship, verify on real config:
|
||||||
|
|
||||||
|
1. **railiance01** — reproduce stale-forward scenario (or simulate); confirm
|
||||||
|
`bridge restart state-hub-railiance01` clears and connects without needing
|
||||||
|
the maintenance subcommand.
|
||||||
|
2. **haskelseed** — `bridge restart state-hub-haskelseed` after a manual
|
||||||
|
`bridge down` while remote port still listens (Alpine `netstat` path from
|
||||||
|
ADHOC-2026-06-14).
|
||||||
|
3. **coulombcore** — confirm healthy tunnel restart is a no-op remote cleanup
|
||||||
|
(`healthy` action) and does not disrupt a working forward.
|
||||||
|
|
||||||
|
Log a State Hub progress note on workstream close. Mark workplan `finished`.
|
||||||
Reference in New Issue
Block a user