From 1364cbcece54884ee73c6c154c122c5b5d8bc226 Mon Sep 17 00:00:00 2001 From: tegwick Date: Wed, 11 Mar 2026 21:53:29 +0100 Subject: [PATCH] docs: add CLAUDE.md improvements and BRIDGE-WP-0001 workplan - Expand CLAUDE.md with dev commands, architecture overview, and required prefix - Add workplans/BRIDGE-WP-0001-initial-implementation.md: 8-phase implementation plan covering FRS FR-1 to FR-26 (23 tasks registered in Custodian State Hub, workstream bridge-wp-0001) Co-Authored-By: Claude Sonnet 4.6 --- CLAUDE.md | 48 ++ .../BRIDGE-WP-0001-initial-implementation.md | 420 ++++++++++++++++++ 2 files changed, 468 insertions(+) create mode 100644 workplans/BRIDGE-WP-0001-initial-implementation.md diff --git a/CLAUDE.md b/CLAUDE.md index 2177798..fdb1405 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,3 +1,7 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + # ops-bridge — Claude Code Instructions **Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution @@ -69,6 +73,50 @@ PRD: `workplans/BRIDGE-WP-0001-initial-implementation.md` - **No system daemons** — process management is internal, PID tracked in `~/.local/state/bridge/` +## Dev Commands + +```bash +# Install locally for development +uv tool install -e . + +# Run tests +uv run pytest + +# Run a single test +uv run pytest tests/test_tunnel.py::test_name -v + +# Lint +uv run ruff check . +``` + +## Architecture + +OpsBridge has two logical components: + +**1. OpsBridge — tunnel lifecycle manager** (this repo) +Manages named SSH reverse tunnels defined in `~/.config/bridge/tunnels.yaml`. +Each tunnel runs in a subprocess with a reconnect backoff loop; PIDs are tracked +in `~/.local/state/bridge/`. Bridge states: `stopped → starting → connected → +degraded → failed`. The `degraded` state means SSH is up but the optional HTTP +health check is failing. + +**2. OpsCatalog — operations knowledge repository** (planned extension) +A Git-backed YAML catalog of operations domains, targets, bridges, and actor +classes. OpsBridge consumes this catalog to resolve bridge identifiers and +orient operators. Schema examples are in `wiki/OpsCatalogSpecification.md`. +The catalog layout follows: `opscatalog/domains//{domain.yaml, +targets/, bridges/, docs/}`. + +Key design constraints: +- OpsBridge owns lifecycle management only; it does not own identity/credentials +- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used + in config, CLI args, and log filenames must stay consistent +- Actor attribution (human operator vs. automation agent) is tracked per bridge + for audit log traceability (FRS §5.7) + +Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS +(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`). + ## Repo boundary This repo owns **tunnel lifecycle management only**. It does not own: diff --git a/workplans/BRIDGE-WP-0001-initial-implementation.md b/workplans/BRIDGE-WP-0001-initial-implementation.md new file mode 100644 index 0000000..e03de7a --- /dev/null +++ b/workplans/BRIDGE-WP-0001-initial-implementation.md @@ -0,0 +1,420 @@ +--- +id: BRIDGE-WP-0001 +type: workplan +title: "OpsBridge Initial Implementation" +domain: custodian +repo: ops-bridge +status: active +owner: Bernd +topic_slug: custodian +state_hub_workstream_id: 79112cff-9c0a-42ad-aa3d-916013001aee +created: "2026-03-11" +updated: "2026-03-11" +--- + +# BRIDGE-WP-0001 — OpsBridge Initial Implementation +**Scope:** Full implementation of the `bridge` CLI tool as specified in the PRD and FRS. +**Out of scope:** OpsCatalog integration (deferred to a future workplan). + +--- + +## Goal + +Deliver a working `bridge` CLI installable via `uv tool install` that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log. + +--- + +## Reference Documents + +| Document | Location | +|---|---| +| PRD | `wiki/OpsBridgePrd.md` | +| FRS | `wiki/OpsBridgeFrs.md` | +| CLAUDE.md | `CLAUDE.md` | + +--- + +## Architecture Summary + +``` +~/.config/bridge/tunnels.yaml # static config: tunnels + actors +~/.local/state/bridge/ # runtime state + .pid # PID of tunnel subprocess manager + .log # reconnect + health event log + .state # current state string (for status cmd) + +src/bridge/ + __init__.py + cli.py # Typer app, all commands + config.py # load + validate tunnels.yaml + models.py # dataclasses: TunnelConfig, BridgeState, ActorInfo + manager.py # TunnelManager: start/stop subprocess, reconnect loop + health.py # HTTP health check via httpx + state.py # read/write PID + state files + audit.py # structured event log writer +``` + +**Bridge state machine:** `stopped → starting → connected → degraded → failed` +- `degraded` = SSH process alive but HTTP health check failing +- `failed` = reconnect attempts exhausted (configurable max) + +--- + +## Config Schema (`~/.config/bridge/tunnels.yaml`) + +```yaml +tunnels: + state-hub-coulombcore: + host: coulombcore.local + remote_port: 18000 + local_port: 8000 + ssh_user: ubuntu + ssh_key: ~/.ssh/id_ops + actor: agent.claude-coulombcore + health_check: + url: http://127.0.0.1:18000/health # checked from remote side + interval_seconds: 30 + timeout_seconds: 5 + reconnect: + max_attempts: 0 # 0 = infinite + backoff_initial: 5 + backoff_max: 60 + +actors: + agent.claude-coulombcore: + class: automation + description: Claude Code agent on CoulombCore + operator.bernd: + class: human + description: Bernd Worsch +``` + +--- + +## Phase 1 — Project Scaffolding + +**Acceptance:** `bridge --help` lists all commands. + +### T01 — Create pyproject.toml + +```task +id: BRIDGE-WP-0001-T01 +state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea +status: todo +priority: high +``` + +Set up `[project]`, `[project.scripts]` (entry point `bridge = bridge.cli:app`), and dependencies: `typer`, `pyyaml`, `httpx`. Run `uv lock`. + +### T02 — Create package skeleton + +```task +id: BRIDGE-WP-0001-T02 +state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105 +status: todo +priority: high +``` + +Create `src/bridge/__init__.py` and empty module stubs: `cli.py`, `config.py`, `models.py`, `manager.py`, `health.py`, `state.py`, `audit.py`. + +### T03 — Verify uv tool install + +```task +id: BRIDGE-WP-0001-T03 +state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79 +status: todo +priority: medium +``` + +Verify `uv tool install -e .` produces a working `bridge --help`. + +--- + +## Phase 2 — Config Loading (FR-2, FC-1) + +**Acceptance:** `config.load()` returns typed config objects; clear error message on bad YAML. + +### T04 — Define config dataclasses in models.py + +```task +id: BRIDGE-WP-0001-T04 +state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e +status: todo +priority: high +``` + +Define `TunnelConfig`, `ReconnectPolicy`, `HealthCheckConfig`, `ActorInfo` as dataclasses. + +### T05 — Implement config.py + +```task +id: BRIDGE-WP-0001-T05 +state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907 +status: todo +priority: high +``` + +Load `~/.config/bridge/tunnels.yaml`, validate required fields, raise clear errors. Support `BRIDGE_CONFIG` env var override for testing. + +### T06 — Unit tests for config loading + +```task +id: BRIDGE-WP-0001-T06 +state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252 +status: todo +priority: medium +``` + +Test: valid config, missing required field, unknown tunnel name. + +--- + +## Phase 3 — State Management (FR-4, FR-7, FR-14) + +**Acceptance:** State round-trips correctly; stale PIDs detected without error. + +### T07 — Implement state.py + +```task +id: BRIDGE-WP-0001-T07 +state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927 +status: todo +priority: high +``` + +Read/write PID file and state file under `~/.local/state/bridge/`. Check if PID is alive. Create state dir on first write. + +### T08 — Define BridgeState enum + +```task +id: BRIDGE-WP-0001-T08 +state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9 +status: todo +priority: medium +``` + +States: `STOPPED`, `STARTING`, `CONNECTED`, `DEGRADED`, `RECONNECTING`, `FAILED`. + +### T09 — Unit tests for state management + +```task +id: BRIDGE-WP-0001-T09 +state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096 +status: todo +priority: medium +``` + +Test: write/read state round-trip, stale PID detection without error. + +--- + +## Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13) + +**Acceptance:** `bridge up ` starts tunnel; killing SSH process triggers reconnect; `bridge down ` stops cleanly. + +### T10 — Implement TunnelManager — SSH subprocess wrapper + +```task +id: BRIDGE-WP-0001-T10 +state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec +status: todo +priority: high +``` + +SSH command: `ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}`. Manager runs as a daemonised child process; parent writes PID and exits. + +### T11 — Implement reconnect backoff loop + +```task +id: BRIDGE-WP-0001-T11 +state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27 +status: todo +priority: high +``` + +Exponential backoff between `backoff_initial` and `backoff_max`. Respect `max_attempts` (0 = infinite). On disconnect: state → `RECONNECTING`, log event, restart SSH. + +### T12 — Implement graceful shutdown + +```task +id: BRIDGE-WP-0001-T12 +state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c +status: todo +priority: medium +``` + +Catch SIGTERM/SIGINT, kill SSH subprocess, write `STOPPED` state. + +--- + +## Phase 5 — Health Monitoring (FR-15, FR-16, FR-17) + +**Acceptance:** With a non-responsive health URL, `bridge status` shows `degraded`. + +### T13 — Implement health.py + +```task +id: BRIDGE-WP-0001-T13 +state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4 +status: todo +priority: medium +``` + +Async HTTP GET via `httpx` to configured health URL. Run health check loop inside manager process. On failure: state → `DEGRADED`; on recovery: state → `CONNECTED`. + +### T14 — Write health check result to state dir + +```task +id: BRIDGE-WP-0001-T14 +state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467 +status: todo +priority: low +``` + +Persist timestamp, status, HTTP code or error for display in `bridge status`. + +--- + +## Phase 6 — Audit Logging (FR-24, FR-25, FR-26) + +**Acceptance:** All lifecycle events appear in the log with actor attribution. + +### T15 — Implement audit.py + +```task +id: BRIDGE-WP-0001-T15 +state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7 +status: todo +priority: medium +``` + +Append JSON-lines to `~/.local/state/bridge/.log`. Events: `bridge_started`, `bridge_connected`, `bridge_disconnected`, `bridge_reconnecting`, `health_check_failed`, `health_check_recovered`, `bridge_stopped`. Each entry: `timestamp` (ISO-8601), `tunnel`, `actor`, `actor_class`, `event`, `detail`. + +--- + +## Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11) + +**Acceptance:** All commands work end-to-end; `--help` on each command shows correct usage. + +Status table columns: `TUNNEL`, `STATE`, `ACTOR`, `HOST`, `UPTIME`, `HEALTH`. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. `--json` flag on `status` for automation. + +### T16 — CLI: bridge up + +```task +id: BRIDGE-WP-0001-T16 +state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6 +status: todo +priority: high +``` + +Start named tunnel or all tunnels if name omitted. + +### T17 — CLI: bridge down + +```task +id: BRIDGE-WP-0001-T17 +state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657 +status: todo +priority: high +``` + +Stop named tunnel or all tunnels if name omitted. + +### T18 — CLI: bridge restart + +```task +id: BRIDGE-WP-0001-T18 +state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681 +status: todo +priority: medium +``` + +Down then up for named tunnel or all. + +### T19 — CLI: bridge status + +```task +id: BRIDGE-WP-0001-T19 +state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10 +status: todo +priority: high +``` + +Table output with `--json` flag for automation. + +### T20 — CLI: bridge logs + +```task +id: BRIDGE-WP-0001-T20 +state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732 +status: todo +priority: medium +``` + +Tail log file. Defaults to last 50 lines. `--follow` for live tail. `--lines N` to override. + +--- + +## Phase 8 — Integration Tests + +**Acceptance:** `uv run pytest` passes cleanly. + +### T21 — Integration test: up/status/down cycle + +```task +id: BRIDGE-WP-0001-T21 +state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8 +status: todo +priority: medium +``` + +Test fixture with minimal `tunnels.yaml` pointing to localhost. Test full `up → status → down` cycle against loopback SSH target or mocked subprocess. + +### T22 — Integration test: reconnect behaviour + +```task +id: BRIDGE-WP-0001-T22 +state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e +status: todo +priority: medium +``` + +Test reconnect loop with a subprocess that exits immediately. + +### T23 — Integration test: health check degraded path + +```task +id: BRIDGE-WP-0001-T23 +state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde +status: todo +priority: medium +``` + +Test degraded state with a mock HTTP server that returns failures. + +--- + +## FRS Traceability + +| FRS Requirement Group | Phase | +|---|---| +| FR-1 to FR-4 — Bridge creation | 4 | +| FR-5 to FR-7 — Bridge termination | 4 | +| FR-8 to FR-9 — Bridge restart | 7 | +| FR-10 to FR-11 — Status inspection | 7 | +| FR-12 to FR-14 — Lifecycle monitoring | 4 | +| FR-15 to FR-17 — Health monitoring | 5 | +| FR-18 to FR-20 — Actor attribution | 2, 6 | +| FR-24 to FR-26 — Audit logging | 6 | +| FC-1 — Config dependency | 2 | +| FC-2 — External connectivity | 4 | + +*FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.* + +--- + +## Deferred + +- **FR-21–FR-23** — Infrastructure target discovery (`bridge targets`) — requires OpsCatalog +- **FR-27–FR-29** — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure +- **OpsCatalog** — Separate workplan (`BRIDGE-WP-0002`)