docs: add CLAUDE.md improvements and BRIDGE-WP-0001 workplan

- Expand CLAUDE.md with dev commands, architecture overview, and required prefix
- Add workplans/BRIDGE-WP-0001-initial-implementation.md: 8-phase implementation
  plan covering FRS FR-1 to FR-26 (23 tasks registered in Custodian State Hub,
  workstream bridge-wp-0001)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-11 21:53:29 +01:00
parent 482edcd7eb
commit 1364cbcece
2 changed files with 468 additions and 0 deletions

View File

@@ -1,3 +1,7 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
# ops-bridge — Claude Code Instructions
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution
@@ -69,6 +73,50 @@ PRD: `workplans/BRIDGE-WP-0001-initial-implementation.md`
- **No system daemons** — process management is internal, PID tracked in
`~/.local/state/bridge/`
## Dev Commands
```bash
# Install locally for development
uv tool install -e .
# Run tests
uv run pytest
# Run a single test
uv run pytest tests/test_tunnel.py::test_name -v
# Lint
uv run ruff check .
```
## Architecture
OpsBridge has two logical components:
**1. OpsBridge — tunnel lifecycle manager** (this repo)
Manages named SSH reverse tunnels defined in `~/.config/bridge/tunnels.yaml`.
Each tunnel runs in a subprocess with a reconnect backoff loop; PIDs are tracked
in `~/.local/state/bridge/`. Bridge states: `stopped → starting → connected →
degraded → failed`. The `degraded` state means SSH is up but the optional HTTP
health check is failing.
**2. OpsCatalog — operations knowledge repository** (planned extension)
A Git-backed YAML catalog of operations domains, targets, bridges, and actor
classes. OpsBridge consumes this catalog to resolve bridge identifiers and
orient operators. Schema examples are in `wiki/OpsCatalogSpecification.md`.
The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
targets/, bridges/, docs/}`.
Key design constraints:
- OpsBridge owns lifecycle management only; it does not own identity/credentials
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
in config, CLI args, and log filenames must stay consistent
- Actor attribution (human operator vs. automation agent) is tracked per bridge
for audit log traceability (FRS §5.7)
Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
## Repo boundary
This repo owns **tunnel lifecycle management only**. It does not own:

View File

@@ -0,0 +1,420 @@
---
id: BRIDGE-WP-0001
type: workplan
title: "OpsBridge Initial Implementation"
domain: custodian
repo: ops-bridge
status: active
owner: Bernd
topic_slug: custodian
state_hub_workstream_id: 79112cff-9c0a-42ad-aa3d-916013001aee
created: "2026-03-11"
updated: "2026-03-11"
---
# BRIDGE-WP-0001 — OpsBridge Initial Implementation
**Scope:** Full implementation of the `bridge` CLI tool as specified in the PRD and FRS.
**Out of scope:** OpsCatalog integration (deferred to a future workplan).
---
## Goal
Deliver a working `bridge` CLI installable via `uv tool install` that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log.
---
## Reference Documents
| Document | Location |
|---|---|
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
| CLAUDE.md | `CLAUDE.md` |
---
## Architecture Summary
```
~/.config/bridge/tunnels.yaml # static config: tunnels + actors
~/.local/state/bridge/ # runtime state
<name>.pid # PID of tunnel subprocess manager
<name>.log # reconnect + health event log
<name>.state # current state string (for status cmd)
src/bridge/
__init__.py
cli.py # Typer app, all commands
config.py # load + validate tunnels.yaml
models.py # dataclasses: TunnelConfig, BridgeState, ActorInfo
manager.py # TunnelManager: start/stop subprocess, reconnect loop
health.py # HTTP health check via httpx
state.py # read/write PID + state files
audit.py # structured event log writer
```
**Bridge state machine:** `stopped → starting → connected → degraded → failed`
- `degraded` = SSH process alive but HTTP health check failing
- `failed` = reconnect attempts exhausted (configurable max)
---
## Config Schema (`~/.config/bridge/tunnels.yaml`)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health # checked from remote side
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0 # 0 = infinite
backoff_initial: 5
backoff_max: 60
actors:
agent.claude-coulombcore:
class: automation
description: Claude Code agent on CoulombCore
operator.bernd:
class: human
description: Bernd Worsch
```
---
## Phase 1 — Project Scaffolding
**Acceptance:** `bridge --help` lists all commands.
### T01 — Create pyproject.toml
```task
id: BRIDGE-WP-0001-T01
state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea
status: todo
priority: high
```
Set up `[project]`, `[project.scripts]` (entry point `bridge = bridge.cli:app`), and dependencies: `typer`, `pyyaml`, `httpx`. Run `uv lock`.
### T02 — Create package skeleton
```task
id: BRIDGE-WP-0001-T02
state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105
status: todo
priority: high
```
Create `src/bridge/__init__.py` and empty module stubs: `cli.py`, `config.py`, `models.py`, `manager.py`, `health.py`, `state.py`, `audit.py`.
### T03 — Verify uv tool install
```task
id: BRIDGE-WP-0001-T03
state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79
status: todo
priority: medium
```
Verify `uv tool install -e .` produces a working `bridge --help`.
---
## Phase 2 — Config Loading (FR-2, FC-1)
**Acceptance:** `config.load()` returns typed config objects; clear error message on bad YAML.
### T04 — Define config dataclasses in models.py
```task
id: BRIDGE-WP-0001-T04
state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e
status: todo
priority: high
```
Define `TunnelConfig`, `ReconnectPolicy`, `HealthCheckConfig`, `ActorInfo` as dataclasses.
### T05 — Implement config.py
```task
id: BRIDGE-WP-0001-T05
state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907
status: todo
priority: high
```
Load `~/.config/bridge/tunnels.yaml`, validate required fields, raise clear errors. Support `BRIDGE_CONFIG` env var override for testing.
### T06 — Unit tests for config loading
```task
id: BRIDGE-WP-0001-T06
state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252
status: todo
priority: medium
```
Test: valid config, missing required field, unknown tunnel name.
---
## Phase 3 — State Management (FR-4, FR-7, FR-14)
**Acceptance:** State round-trips correctly; stale PIDs detected without error.
### T07 — Implement state.py
```task
id: BRIDGE-WP-0001-T07
state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927
status: todo
priority: high
```
Read/write PID file and state file under `~/.local/state/bridge/`. Check if PID is alive. Create state dir on first write.
### T08 — Define BridgeState enum
```task
id: BRIDGE-WP-0001-T08
state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9
status: todo
priority: medium
```
States: `STOPPED`, `STARTING`, `CONNECTED`, `DEGRADED`, `RECONNECTING`, `FAILED`.
### T09 — Unit tests for state management
```task
id: BRIDGE-WP-0001-T09
state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096
status: todo
priority: medium
```
Test: write/read state round-trip, stale PID detection without error.
---
## Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13)
**Acceptance:** `bridge up <name>` starts tunnel; killing SSH process triggers reconnect; `bridge down <name>` stops cleanly.
### T10 — Implement TunnelManager — SSH subprocess wrapper
```task
id: BRIDGE-WP-0001-T10
state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec
status: todo
priority: high
```
SSH command: `ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}`. Manager runs as a daemonised child process; parent writes PID and exits.
### T11 — Implement reconnect backoff loop
```task
id: BRIDGE-WP-0001-T11
state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27
status: todo
priority: high
```
Exponential backoff between `backoff_initial` and `backoff_max`. Respect `max_attempts` (0 = infinite). On disconnect: state → `RECONNECTING`, log event, restart SSH.
### T12 — Implement graceful shutdown
```task
id: BRIDGE-WP-0001-T12
state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c
status: todo
priority: medium
```
Catch SIGTERM/SIGINT, kill SSH subprocess, write `STOPPED` state.
---
## Phase 5 — Health Monitoring (FR-15, FR-16, FR-17)
**Acceptance:** With a non-responsive health URL, `bridge status` shows `degraded`.
### T13 — Implement health.py
```task
id: BRIDGE-WP-0001-T13
state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4
status: todo
priority: medium
```
Async HTTP GET via `httpx` to configured health URL. Run health check loop inside manager process. On failure: state → `DEGRADED`; on recovery: state → `CONNECTED`.
### T14 — Write health check result to state dir
```task
id: BRIDGE-WP-0001-T14
state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467
status: todo
priority: low
```
Persist timestamp, status, HTTP code or error for display in `bridge status`.
---
## Phase 6 — Audit Logging (FR-24, FR-25, FR-26)
**Acceptance:** All lifecycle events appear in the log with actor attribution.
### T15 — Implement audit.py
```task
id: BRIDGE-WP-0001-T15
state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7
status: todo
priority: medium
```
Append JSON-lines to `~/.local/state/bridge/<name>.log`. Events: `bridge_started`, `bridge_connected`, `bridge_disconnected`, `bridge_reconnecting`, `health_check_failed`, `health_check_recovered`, `bridge_stopped`. Each entry: `timestamp` (ISO-8601), `tunnel`, `actor`, `actor_class`, `event`, `detail`.
---
## Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11)
**Acceptance:** All commands work end-to-end; `--help` on each command shows correct usage.
Status table columns: `TUNNEL`, `STATE`, `ACTOR`, `HOST`, `UPTIME`, `HEALTH`. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. `--json` flag on `status` for automation.
### T16 — CLI: bridge up
```task
id: BRIDGE-WP-0001-T16
state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6
status: todo
priority: high
```
Start named tunnel or all tunnels if name omitted.
### T17 — CLI: bridge down
```task
id: BRIDGE-WP-0001-T17
state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657
status: todo
priority: high
```
Stop named tunnel or all tunnels if name omitted.
### T18 — CLI: bridge restart
```task
id: BRIDGE-WP-0001-T18
state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681
status: todo
priority: medium
```
Down then up for named tunnel or all.
### T19 — CLI: bridge status
```task
id: BRIDGE-WP-0001-T19
state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10
status: todo
priority: high
```
Table output with `--json` flag for automation.
### T20 — CLI: bridge logs
```task
id: BRIDGE-WP-0001-T20
state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732
status: todo
priority: medium
```
Tail log file. Defaults to last 50 lines. `--follow` for live tail. `--lines N` to override.
---
## Phase 8 — Integration Tests
**Acceptance:** `uv run pytest` passes cleanly.
### T21 — Integration test: up/status/down cycle
```task
id: BRIDGE-WP-0001-T21
state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8
status: todo
priority: medium
```
Test fixture with minimal `tunnels.yaml` pointing to localhost. Test full `up → status → down` cycle against loopback SSH target or mocked subprocess.
### T22 — Integration test: reconnect behaviour
```task
id: BRIDGE-WP-0001-T22
state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e
status: todo
priority: medium
```
Test reconnect loop with a subprocess that exits immediately.
### T23 — Integration test: health check degraded path
```task
id: BRIDGE-WP-0001-T23
state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde
status: todo
priority: medium
```
Test degraded state with a mock HTTP server that returns failures.
---
## FRS Traceability
| FRS Requirement Group | Phase |
|---|---|
| FR-1 to FR-4 — Bridge creation | 4 |
| FR-5 to FR-7 — Bridge termination | 4 |
| FR-8 to FR-9 — Bridge restart | 7 |
| FR-10 to FR-11 — Status inspection | 7 |
| FR-12 to FR-14 — Lifecycle monitoring | 4 |
| FR-15 to FR-17 — Health monitoring | 5 |
| FR-18 to FR-20 — Actor attribution | 2, 6 |
| FR-24 to FR-26 — Audit logging | 6 |
| FC-1 — Config dependency | 2 |
| FC-2 — External connectivity | 4 |
*FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.*
---
## Deferred
- **FR-21FR-23** — Infrastructure target discovery (`bridge targets`) — requires OpsCatalog
- **FR-27FR-29** — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure
- **OpsCatalog** — Separate workplan (`BRIDGE-WP-0002`)