Files
ops-bridge/workplans/BRIDGE-WP-0001-initial-implementation.md
tegwick af2d419bf6 chore: mark BRIDGE-WP-0001 and BRIDGE-WP-0002 workplans as completed
All 39 tasks marked done; both workstreams updated to completed status
in the State Hub and workplan files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 03:37:32 +01:00

421 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: BRIDGE-WP-0001
type: workplan
title: "OpsBridge Initial Implementation"
domain: custodian
repo: ops-bridge
status: completed
owner: Bernd
topic_slug: custodian
state_hub_workstream_id: 79112cff-9c0a-42ad-aa3d-916013001aee
created: "2026-03-11"
updated: "2026-03-12"
---
# BRIDGE-WP-0001 — OpsBridge Initial Implementation
**Scope:** Full implementation of the `bridge` CLI tool as specified in the PRD and FRS.
**Out of scope:** OpsCatalog integration (deferred to a future workplan).
---
## Goal
Deliver a working `bridge` CLI installable via `uv tool install` that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log.
---
## Reference Documents
| Document | Location |
|---|---|
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
| CLAUDE.md | `CLAUDE.md` |
---
## Architecture Summary
```
~/.config/bridge/tunnels.yaml # static config: tunnels + actors
~/.local/state/bridge/ # runtime state
<name>.pid # PID of tunnel subprocess manager
<name>.log # reconnect + health event log
<name>.state # current state string (for status cmd)
src/bridge/
__init__.py
cli.py # Typer app, all commands
config.py # load + validate tunnels.yaml
models.py # dataclasses: TunnelConfig, BridgeState, ActorInfo
manager.py # TunnelManager: start/stop subprocess, reconnect loop
health.py # HTTP health check via httpx
state.py # read/write PID + state files
audit.py # structured event log writer
```
**Bridge state machine:** `stopped → starting → connected → degraded → failed`
- `degraded` = SSH process alive but HTTP health check failing
- `failed` = reconnect attempts exhausted (configurable max)
---
## Config Schema (`~/.config/bridge/tunnels.yaml`)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health # checked from remote side
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0 # 0 = infinite
backoff_initial: 5
backoff_max: 60
actors:
agent.claude-coulombcore:
class: automation
description: Claude Code agent on CoulombCore
operator.bernd:
class: human
description: Bernd Worsch
```
---
## Phase 1 — Project Scaffolding
**Acceptance:** `bridge --help` lists all commands.
### T01 — Create pyproject.toml
```task
id: BRIDGE-WP-0001-T01
state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea
status: done
priority: high
```
Set up `[project]`, `[project.scripts]` (entry point `bridge = bridge.cli:app`), and dependencies: `typer`, `pyyaml`, `httpx`. Run `uv lock`.
### T02 — Create package skeleton
```task
id: BRIDGE-WP-0001-T02
state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105
status: done
priority: high
```
Create `src/bridge/__init__.py` and empty module stubs: `cli.py`, `config.py`, `models.py`, `manager.py`, `health.py`, `state.py`, `audit.py`.
### T03 — Verify uv tool install
```task
id: BRIDGE-WP-0001-T03
state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79
status: done
priority: medium
```
Verify `uv tool install -e .` produces a working `bridge --help`.
---
## Phase 2 — Config Loading (FR-2, FC-1)
**Acceptance:** `config.load()` returns typed config objects; clear error message on bad YAML.
### T04 — Define config dataclasses in models.py
```task
id: BRIDGE-WP-0001-T04
state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e
status: done
priority: high
```
Define `TunnelConfig`, `ReconnectPolicy`, `HealthCheckConfig`, `ActorInfo` as dataclasses.
### T05 — Implement config.py
```task
id: BRIDGE-WP-0001-T05
state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907
status: done
priority: high
```
Load `~/.config/bridge/tunnels.yaml`, validate required fields, raise clear errors. Support `BRIDGE_CONFIG` env var override for testing.
### T06 — Unit tests for config loading
```task
id: BRIDGE-WP-0001-T06
state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252
status: done
priority: medium
```
Test: valid config, missing required field, unknown tunnel name.
---
## Phase 3 — State Management (FR-4, FR-7, FR-14)
**Acceptance:** State round-trips correctly; stale PIDs detected without error.
### T07 — Implement state.py
```task
id: BRIDGE-WP-0001-T07
state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927
status: done
priority: high
```
Read/write PID file and state file under `~/.local/state/bridge/`. Check if PID is alive. Create state dir on first write.
### T08 — Define BridgeState enum
```task
id: BRIDGE-WP-0001-T08
state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9
status: done
priority: medium
```
States: `STOPPED`, `STARTING`, `CONNECTED`, `DEGRADED`, `RECONNECTING`, `FAILED`.
### T09 — Unit tests for state management
```task
id: BRIDGE-WP-0001-T09
state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096
status: done
priority: medium
```
Test: write/read state round-trip, stale PID detection without error.
---
## Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13)
**Acceptance:** `bridge up <name>` starts tunnel; killing SSH process triggers reconnect; `bridge down <name>` stops cleanly.
### T10 — Implement TunnelManager — SSH subprocess wrapper
```task
id: BRIDGE-WP-0001-T10
state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec
status: done
priority: high
```
SSH command: `ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}`. Manager runs as a daemonised child process; parent writes PID and exits.
### T11 — Implement reconnect backoff loop
```task
id: BRIDGE-WP-0001-T11
state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27
status: done
priority: high
```
Exponential backoff between `backoff_initial` and `backoff_max`. Respect `max_attempts` (0 = infinite). On disconnect: state → `RECONNECTING`, log event, restart SSH.
### T12 — Implement graceful shutdown
```task
id: BRIDGE-WP-0001-T12
state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c
status: done
priority: medium
```
Catch SIGTERM/SIGINT, kill SSH subprocess, write `STOPPED` state.
---
## Phase 5 — Health Monitoring (FR-15, FR-16, FR-17)
**Acceptance:** With a non-responsive health URL, `bridge status` shows `degraded`.
### T13 — Implement health.py
```task
id: BRIDGE-WP-0001-T13
state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4
status: done
priority: medium
```
Async HTTP GET via `httpx` to configured health URL. Run health check loop inside manager process. On failure: state → `DEGRADED`; on recovery: state → `CONNECTED`.
### T14 — Write health check result to state dir
```task
id: BRIDGE-WP-0001-T14
state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467
status: done
priority: low
```
Persist timestamp, status, HTTP code or error for display in `bridge status`.
---
## Phase 6 — Audit Logging (FR-24, FR-25, FR-26)
**Acceptance:** All lifecycle events appear in the log with actor attribution.
### T15 — Implement audit.py
```task
id: BRIDGE-WP-0001-T15
state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7
status: done
priority: medium
```
Append JSON-lines to `~/.local/state/bridge/<name>.log`. Events: `bridge_started`, `bridge_connected`, `bridge_disconnected`, `bridge_reconnecting`, `health_check_failed`, `health_check_recovered`, `bridge_stopped`. Each entry: `timestamp` (ISO-8601), `tunnel`, `actor`, `actor_class`, `event`, `detail`.
---
## Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11)
**Acceptance:** All commands work end-to-end; `--help` on each command shows correct usage.
Status table columns: `TUNNEL`, `STATE`, `ACTOR`, `HOST`, `UPTIME`, `HEALTH`. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. `--json` flag on `status` for automation.
### T16 — CLI: bridge up
```task
id: BRIDGE-WP-0001-T16
state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6
status: done
priority: high
```
Start named tunnel or all tunnels if name omitted.
### T17 — CLI: bridge down
```task
id: BRIDGE-WP-0001-T17
state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657
status: done
priority: high
```
Stop named tunnel or all tunnels if name omitted.
### T18 — CLI: bridge restart
```task
id: BRIDGE-WP-0001-T18
state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681
status: done
priority: medium
```
Down then up for named tunnel or all.
### T19 — CLI: bridge status
```task
id: BRIDGE-WP-0001-T19
state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10
status: done
priority: high
```
Table output with `--json` flag for automation.
### T20 — CLI: bridge logs
```task
id: BRIDGE-WP-0001-T20
state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732
status: done
priority: medium
```
Tail log file. Defaults to last 50 lines. `--follow` for live tail. `--lines N` to override.
---
## Phase 8 — Integration Tests
**Acceptance:** `uv run pytest` passes cleanly.
### T21 — Integration test: up/status/down cycle
```task
id: BRIDGE-WP-0001-T21
state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8
status: done
priority: medium
```
Test fixture with minimal `tunnels.yaml` pointing to localhost. Test full `up → status → down` cycle against loopback SSH target or mocked subprocess.
### T22 — Integration test: reconnect behaviour
```task
id: BRIDGE-WP-0001-T22
state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e
status: done
priority: medium
```
Test reconnect loop with a subprocess that exits immediately.
### T23 — Integration test: health check degraded path
```task
id: BRIDGE-WP-0001-T23
state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde
status: done
priority: medium
```
Test degraded state with a mock HTTP server that returns failures.
---
## FRS Traceability
| FRS Requirement Group | Phase |
|---|---|
| FR-1 to FR-4 — Bridge creation | 4 |
| FR-5 to FR-7 — Bridge termination | 4 |
| FR-8 to FR-9 — Bridge restart | 7 |
| FR-10 to FR-11 — Status inspection | 7 |
| FR-12 to FR-14 — Lifecycle monitoring | 4 |
| FR-15 to FR-17 — Health monitoring | 5 |
| FR-18 to FR-20 — Actor attribution | 2, 6 |
| FR-24 to FR-26 — Audit logging | 6 |
| FC-1 — Config dependency | 2 |
| FC-2 — External connectivity | 4 |
*FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.*
---
## Deferred
- **FR-21FR-23** — Infrastructure target discovery (`bridge targets`) — requires OpsCatalog
- **FR-27FR-29** — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure
- **OpsCatalog** — Separate workplan (`BRIDGE-WP-0002`)