Files
ops-bridge/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
tegwick 00671f5133 Normalize agent instructions and workplan frontmatter (STATE-WP-0067)
- Align agent files with on-disk workplan prefixes (infer from workplan ids)
- Set workplan domain to registered domain_slug; add topic_slug where applicable
- Repair frontmatter delimiter formatting; migrate legacy task status literals
- Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
2026-06-22 23:16:27 +02:00

194 lines
7.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: BRIDGE-WP-0005
type: workplan
title: "Restart includes remote cleanup (blank-slate recovery)"
domain: infotech
repo: ops-bridge
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
---
# BRIDGE-WP-0005 — Restart includes remote cleanup
**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
`sshd` remote forward on Railiance01 port `18000` blocked
`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
**Operator expectation:** `bridge restart` should mean *operational again* — a
blank-slate recovery — not merely "cycle the local manager PID while a broken
remote listener still holds the port."
## Topology and failure modes (refined)
Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
Cleanup policy must respect all of them.
### A. Workstation (laptop WSL) — tunnel **origin**
The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
remote hosts:
| Remote host | Tunnels (reverse) | Role |
|-------------|-------------------|------|
| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
local bridge processes without graceful remote SSH teardown. Orphan `sshd`
listeners on **all three remotes** are common after wake — especially
`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
### B. Haskelseed — also intermittently offline
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
different networks. The same orphan-forward pattern applies to its reverse ports
when the workstation-side tunnel dies uncleanly.
### C. VPS remotes (coulombcore, railiance01)
Normally always-on. Maintenance reboots clear remote kernel state, but:
- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
with a dead local SSH child;
- when the laptop returns, orphan forwards from the **previous** session may
still block new `-R` binds if the VPS did not reboot.
**Conclusion:** conditional remote cleanup before restart benefits **all reverse
tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
skips healthy forwards — VPS tunnels with live working forwards are untouched.
### D. Local-direction tunnels — no remote cleanup
`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
forward mode from workstation to remote services. They do not bind remote reverse
ports for State Hub. **`restart` stays local stop/start only** for these.
## Design (decided)
| Command | Behaviour after this workplan |
|---------|-------------------------------|
| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
| `bridge up` | Out of scope here (see T4 optional follow-up). |
Implementation sketch: replace the body of `cli.restart()` with a call to
`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
or per-tunnel `cleanup_tunnel` when a single tunnel is named.
Emit the same action summary strings cleanup already uses (`healthy`,
`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
## Out of scope
- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
positive during T2).
- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
- Renaming tunnels or changing `tunnels.yaml` host entries.
---
## T1 — Wire restart through cleanup path
```task
id: BRIDGE-WP-0005-T01
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
```
Refactor `bridge/cli.py` `restart()` so reverse tunnels call
`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
`TunnelManager.stop()` + `start()`.
Requirements:
- Single-tunnel and all-tunnel restart both work.
- Local-direction tunnels keep stop/start only.
- Exit codes: preserve todays semantics where practical; exit non-zero if any
named tunnel ends in `CleanupAction.action == "error"`.
- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
etc.), not only "Restarted tunnel".
## T2 — Tests and regression coverage
```task
id: BRIDGE-WP-0005-T02
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
```
Update `tests/test_cli.py`:
- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
reverse tunnels.
- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
invoked), local-direction tunnel (no cleanup call).
- Reuse mocks from `tests/test_cleanup.py` patterns.
`make test` and `make lint` pass.
## T3 — Operator docs and CLI help
```task
id: BRIDGE-WP-0005-T03
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
```
Document the blank-slate restart contract:
- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
- `bridge restart --help` — mention conditional remote stale-forward cleanup.
- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
maintenance — matching this workplan's topology section.
- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
incident note (one line each way).
## T4 — Optional: reconnect-loop hygiene (stretch)
```task
id: BRIDGE-WP-0005-T04
status: cancel
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
```
Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
once after repeated exit-255 bind failures (laptop wake without operator running
`bridge restart`). Defer unless T1T3 are done; mark `cancel` if heuristic risk
outweighs benefit.
**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
loop risks killing a legitimately healthy orphan forward owned by another session
or operator. `bridge restart` now covers the operator-facing blank-slate path;
nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
wake-from-sleep reconnect failures remain frequent after a month of observation.
## T5 — Live verification on workstation + VPS
```task
id: BRIDGE-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
```
After T1T2 ship, verify on real config:
1. **railiance01**`state-hub-mcp-railiance01` was `reconnecting` with stale
forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
`connected`.
2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat
path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
3. **coulombcore**`bridge restart state-hub-coulombcore` reported `healthy`,
PID unchanged (4116), forward undisturbed.
State Hub progress logged (2026-06-21). Workplan marked `finished`.