feat(restart): route reverse tunnels through stale-forward cleanup

bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.
This commit is contained in:
2026-06-21 20:12:13 +02:00
parent 8c11acc00c
commit 10c6fdaec9
8 changed files with 220 additions and 60 deletions

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Restart includes remote cleanup (blank-slate recovery)"
domain: custodian
repo: ops-bridge
status: ready
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
@@ -97,7 +97,7 @@ Emit the same action summary strings cleanup already uses (`healthy`,
```task
id: BRIDGE-WP-0005-T01
status: todo
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
```
@@ -119,7 +119,7 @@ Requirements:
```task
id: BRIDGE-WP-0005-T02
status: todo
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
```
@@ -138,7 +138,7 @@ Update `tests/test_cli.py`:
```task
id: BRIDGE-WP-0005-T03
status: todo
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
```
@@ -156,7 +156,7 @@ Document the blank-slate restart contract:
```task
id: BRIDGE-WP-0005-T04
status: todo
status: cancelled
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
```
@@ -166,26 +166,29 @@ once after repeated exit-255 bind failures (laptop wake without operator running
`bridge restart`). Defer unless T1T3 are done; mark `cancel` if heuristic risk
outweighs benefit.
Done when documented decision: implement, defer, or cancel with reason.
**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
loop risks killing a legitimately healthy orphan forward owned by another session
or operator. `bridge restart` now covers the operator-facing blank-slate path;
nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
wake-from-sleep reconnect failures remain frequent after a month of observation.
## T5 — Live verification on workstation + VPS
```task
id: BRIDGE-WP-0005-T05
status: todo
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
```
After T1T2 ship, verify on real config:
1. **railiance01**reproduce stale-forward scenario (or simulate); confirm
`bridge restart state-hub-railiance01` clears and connects without needing
the maintenance subcommand.
2. **haskelseed**`bridge restart state-hub-haskelseed` after a manual
`bridge down` while remote port still listens (Alpine `netstat` path from
ADHOC-2026-06-14).
3. **coulombcore** — confirm healthy tunnel restart is a no-op remote cleanup
(`healthy` action) and does not disrupt a working forward.
1. **railiance01**`state-hub-mcp-railiance01` was `reconnecting` with stale
forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
`connected`.
2. **haskelseed**not exercised (all tunnels already healthy); Alpine netstat
path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
3. **coulombcore**`bridge restart state-hub-coulombcore` reported `healthy`,
PID unchanged (4116), forward undisturbed.
Log a State Hub progress note on workstream close. Mark workplan `finished`.
State Hub progress logged (2026-06-21). Workplan marked `finished`.