--- id: BRIDGE-WP-0001 type: workplan title: "OpsBridge Initial Implementation" domain: custodian repo: ops-bridge status: completed owner: Bernd topic_slug: custodian state_hub_workstream_id: 79112cff-9c0a-42ad-aa3d-916013001aee created: "2026-03-11" updated: "2026-03-12" --- # BRIDGE-WP-0001 — OpsBridge Initial Implementation **Scope:** Full implementation of the `bridge` CLI tool as specified in the PRD and FRS. **Out of scope:** OpsCatalog integration (deferred to a future workplan). --- ## Goal Deliver a working `bridge` CLI installable via `uv tool install` that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log. --- ## Reference Documents | Document | Location | |---|---| | PRD | `wiki/OpsBridgePrd.md` | | FRS | `wiki/OpsBridgeFrs.md` | | CLAUDE.md | `CLAUDE.md` | --- ## Architecture Summary ``` ~/.config/bridge/tunnels.yaml # static config: tunnels + actors ~/.local/state/bridge/ # runtime state .pid # PID of tunnel subprocess manager .log # reconnect + health event log .state # current state string (for status cmd) src/bridge/ __init__.py cli.py # Typer app, all commands config.py # load + validate tunnels.yaml models.py # dataclasses: TunnelConfig, BridgeState, ActorInfo manager.py # TunnelManager: start/stop subprocess, reconnect loop health.py # HTTP health check via httpx state.py # read/write PID + state files audit.py # structured event log writer ``` **Bridge state machine:** `stopped → starting → connected → degraded → failed` - `degraded` = SSH process alive but HTTP health check failing - `failed` = reconnect attempts exhausted (configurable max) --- ## Config Schema (`~/.config/bridge/tunnels.yaml`) ```yaml tunnels: state-hub-coulombcore: host: coulombcore.local remote_port: 18000 local_port: 8000 ssh_user: ubuntu ssh_key: ~/.ssh/id_ops actor: agent.claude-coulombcore health_check: url: http://127.0.0.1:18000/health # checked from remote side interval_seconds: 30 timeout_seconds: 5 reconnect: max_attempts: 0 # 0 = infinite backoff_initial: 5 backoff_max: 60 actors: agent.claude-coulombcore: class: automation description: Claude Code agent on CoulombCore operator.bernd: class: human description: Bernd Worsch ``` --- ## Phase 1 — Project Scaffolding **Acceptance:** `bridge --help` lists all commands. ### T01 — Create pyproject.toml ```task id: BRIDGE-WP-0001-T01 state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea status: done priority: high ``` Set up `[project]`, `[project.scripts]` (entry point `bridge = bridge.cli:app`), and dependencies: `typer`, `pyyaml`, `httpx`. Run `uv lock`. ### T02 — Create package skeleton ```task id: BRIDGE-WP-0001-T02 state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105 status: done priority: high ``` Create `src/bridge/__init__.py` and empty module stubs: `cli.py`, `config.py`, `models.py`, `manager.py`, `health.py`, `state.py`, `audit.py`. ### T03 — Verify uv tool install ```task id: BRIDGE-WP-0001-T03 state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79 status: done priority: medium ``` Verify `uv tool install -e .` produces a working `bridge --help`. --- ## Phase 2 — Config Loading (FR-2, FC-1) **Acceptance:** `config.load()` returns typed config objects; clear error message on bad YAML. ### T04 — Define config dataclasses in models.py ```task id: BRIDGE-WP-0001-T04 state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e status: done priority: high ``` Define `TunnelConfig`, `ReconnectPolicy`, `HealthCheckConfig`, `ActorInfo` as dataclasses. ### T05 — Implement config.py ```task id: BRIDGE-WP-0001-T05 state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907 status: done priority: high ``` Load `~/.config/bridge/tunnels.yaml`, validate required fields, raise clear errors. Support `BRIDGE_CONFIG` env var override for testing. ### T06 — Unit tests for config loading ```task id: BRIDGE-WP-0001-T06 state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252 status: done priority: medium ``` Test: valid config, missing required field, unknown tunnel name. --- ## Phase 3 — State Management (FR-4, FR-7, FR-14) **Acceptance:** State round-trips correctly; stale PIDs detected without error. ### T07 — Implement state.py ```task id: BRIDGE-WP-0001-T07 state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927 status: done priority: high ``` Read/write PID file and state file under `~/.local/state/bridge/`. Check if PID is alive. Create state dir on first write. ### T08 — Define BridgeState enum ```task id: BRIDGE-WP-0001-T08 state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9 status: done priority: medium ``` States: `STOPPED`, `STARTING`, `CONNECTED`, `DEGRADED`, `RECONNECTING`, `FAILED`. ### T09 — Unit tests for state management ```task id: BRIDGE-WP-0001-T09 state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096 status: done priority: medium ``` Test: write/read state round-trip, stale PID detection without error. --- ## Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13) **Acceptance:** `bridge up ` starts tunnel; killing SSH process triggers reconnect; `bridge down ` stops cleanly. ### T10 — Implement TunnelManager — SSH subprocess wrapper ```task id: BRIDGE-WP-0001-T10 state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec status: done priority: high ``` SSH command: `ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}`. Manager runs as a daemonised child process; parent writes PID and exits. ### T11 — Implement reconnect backoff loop ```task id: BRIDGE-WP-0001-T11 state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27 status: done priority: high ``` Exponential backoff between `backoff_initial` and `backoff_max`. Respect `max_attempts` (0 = infinite). On disconnect: state → `RECONNECTING`, log event, restart SSH. ### T12 — Implement graceful shutdown ```task id: BRIDGE-WP-0001-T12 state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c status: done priority: medium ``` Catch SIGTERM/SIGINT, kill SSH subprocess, write `STOPPED` state. --- ## Phase 5 — Health Monitoring (FR-15, FR-16, FR-17) **Acceptance:** With a non-responsive health URL, `bridge status` shows `degraded`. ### T13 — Implement health.py ```task id: BRIDGE-WP-0001-T13 state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4 status: done priority: medium ``` Async HTTP GET via `httpx` to configured health URL. Run health check loop inside manager process. On failure: state → `DEGRADED`; on recovery: state → `CONNECTED`. ### T14 — Write health check result to state dir ```task id: BRIDGE-WP-0001-T14 state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467 status: done priority: low ``` Persist timestamp, status, HTTP code or error for display in `bridge status`. --- ## Phase 6 — Audit Logging (FR-24, FR-25, FR-26) **Acceptance:** All lifecycle events appear in the log with actor attribution. ### T15 — Implement audit.py ```task id: BRIDGE-WP-0001-T15 state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7 status: done priority: medium ``` Append JSON-lines to `~/.local/state/bridge/.log`. Events: `bridge_started`, `bridge_connected`, `bridge_disconnected`, `bridge_reconnecting`, `health_check_failed`, `health_check_recovered`, `bridge_stopped`. Each entry: `timestamp` (ISO-8601), `tunnel`, `actor`, `actor_class`, `event`, `detail`. --- ## Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11) **Acceptance:** All commands work end-to-end; `--help` on each command shows correct usage. Status table columns: `TUNNEL`, `STATE`, `ACTOR`, `HOST`, `UPTIME`, `HEALTH`. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. `--json` flag on `status` for automation. ### T16 — CLI: bridge up ```task id: BRIDGE-WP-0001-T16 state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6 status: done priority: high ``` Start named tunnel or all tunnels if name omitted. ### T17 — CLI: bridge down ```task id: BRIDGE-WP-0001-T17 state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657 status: done priority: high ``` Stop named tunnel or all tunnels if name omitted. ### T18 — CLI: bridge restart ```task id: BRIDGE-WP-0001-T18 state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681 status: done priority: medium ``` Down then up for named tunnel or all. ### T19 — CLI: bridge status ```task id: BRIDGE-WP-0001-T19 state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10 status: done priority: high ``` Table output with `--json` flag for automation. ### T20 — CLI: bridge logs ```task id: BRIDGE-WP-0001-T20 state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732 status: done priority: medium ``` Tail log file. Defaults to last 50 lines. `--follow` for live tail. `--lines N` to override. --- ## Phase 8 — Integration Tests **Acceptance:** `uv run pytest` passes cleanly. ### T21 — Integration test: up/status/down cycle ```task id: BRIDGE-WP-0001-T21 state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8 status: done priority: medium ``` Test fixture with minimal `tunnels.yaml` pointing to localhost. Test full `up → status → down` cycle against loopback SSH target or mocked subprocess. ### T22 — Integration test: reconnect behaviour ```task id: BRIDGE-WP-0001-T22 state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e status: done priority: medium ``` Test reconnect loop with a subprocess that exits immediately. ### T23 — Integration test: health check degraded path ```task id: BRIDGE-WP-0001-T23 state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde status: done priority: medium ``` Test degraded state with a mock HTTP server that returns failures. --- ## FRS Traceability | FRS Requirement Group | Phase | |---|---| | FR-1 to FR-4 — Bridge creation | 4 | | FR-5 to FR-7 — Bridge termination | 4 | | FR-8 to FR-9 — Bridge restart | 7 | | FR-10 to FR-11 — Status inspection | 7 | | FR-12 to FR-14 — Lifecycle monitoring | 4 | | FR-15 to FR-17 — Health monitoring | 5 | | FR-18 to FR-20 — Actor attribution | 2, 6 | | FR-24 to FR-26 — Audit logging | 6 | | FC-1 — Config dependency | 2 | | FC-2 — External connectivity | 4 | *FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.* --- ## Deferred - **FR-21–FR-23** — Infrastructure target discovery (`bridge targets`) — requires OpsCatalog - **FR-27–FR-29** — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure - **OpsCatalog** — Separate workplan (`BRIDGE-WP-0002`)