Files
ops-bridge/workplans/BRIDGE-WP-0001-initial-implementation.md
tegwick af2d419bf6 chore: mark BRIDGE-WP-0001 and BRIDGE-WP-0002 workplans as completed
All 39 tasks marked done; both workstreams updated to completed status
in the State Hub and workplan files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 03:37:32 +01:00

11 KiB
Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, state_hub_workstream_id, created, updated
id type title domain repo status owner topic_slug state_hub_workstream_id created updated
BRIDGE-WP-0001 workplan OpsBridge Initial Implementation custodian ops-bridge completed Bernd custodian 79112cff-9c0a-42ad-aa3d-916013001aee 2026-03-11 2026-03-12

BRIDGE-WP-0001 — OpsBridge Initial Implementation

Scope: Full implementation of the bridge CLI tool as specified in the PRD and FRS. Out of scope: OpsCatalog integration (deferred to a future workplan).


Goal

Deliver a working bridge CLI installable via uv tool install that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log.


Reference Documents

Document Location
PRD wiki/OpsBridgePrd.md
FRS wiki/OpsBridgeFrs.md
CLAUDE.md CLAUDE.md

Architecture Summary

~/.config/bridge/tunnels.yaml        # static config: tunnels + actors
~/.local/state/bridge/               # runtime state
    <name>.pid                       # PID of tunnel subprocess manager
    <name>.log                       # reconnect + health event log
    <name>.state                     # current state string (for status cmd)

src/bridge/
    __init__.py
    cli.py              # Typer app, all commands
    config.py           # load + validate tunnels.yaml
    models.py           # dataclasses: TunnelConfig, BridgeState, ActorInfo
    manager.py          # TunnelManager: start/stop subprocess, reconnect loop
    health.py           # HTTP health check via httpx
    state.py            # read/write PID + state files
    audit.py            # structured event log writer

Bridge state machine: stopped → starting → connected → degraded → failed

  • degraded = SSH process alive but HTTP health check failing
  • failed = reconnect attempts exhausted (configurable max)

Config Schema (~/.config/bridge/tunnels.yaml)

tunnels:
  state-hub-coulombcore:
    host: coulombcore.local
    remote_port: 18000
    local_port: 8000
    ssh_user: ubuntu
    ssh_key: ~/.ssh/id_ops
    actor: agent.claude-coulombcore
    health_check:
      url: http://127.0.0.1:18000/health   # checked from remote side
      interval_seconds: 30
      timeout_seconds: 5
    reconnect:
      max_attempts: 0    # 0 = infinite
      backoff_initial: 5
      backoff_max: 60

actors:
  agent.claude-coulombcore:
    class: automation
    description: Claude Code agent on CoulombCore
  operator.bernd:
    class: human
    description: Bernd Worsch

Phase 1 — Project Scaffolding

Acceptance: bridge --help lists all commands.

T01 — Create pyproject.toml

id: BRIDGE-WP-0001-T01
state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea
status: done
priority: high

Set up [project], [project.scripts] (entry point bridge = bridge.cli:app), and dependencies: typer, pyyaml, httpx. Run uv lock.

T02 — Create package skeleton

id: BRIDGE-WP-0001-T02
state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105
status: done
priority: high

Create src/bridge/__init__.py and empty module stubs: cli.py, config.py, models.py, manager.py, health.py, state.py, audit.py.

T03 — Verify uv tool install

id: BRIDGE-WP-0001-T03
state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79
status: done
priority: medium

Verify uv tool install -e . produces a working bridge --help.


Phase 2 — Config Loading (FR-2, FC-1)

Acceptance: config.load() returns typed config objects; clear error message on bad YAML.

T04 — Define config dataclasses in models.py

id: BRIDGE-WP-0001-T04
state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e
status: done
priority: high

Define TunnelConfig, ReconnectPolicy, HealthCheckConfig, ActorInfo as dataclasses.

T05 — Implement config.py

id: BRIDGE-WP-0001-T05
state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907
status: done
priority: high

Load ~/.config/bridge/tunnels.yaml, validate required fields, raise clear errors. Support BRIDGE_CONFIG env var override for testing.

T06 — Unit tests for config loading

id: BRIDGE-WP-0001-T06
state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252
status: done
priority: medium

Test: valid config, missing required field, unknown tunnel name.


Phase 3 — State Management (FR-4, FR-7, FR-14)

Acceptance: State round-trips correctly; stale PIDs detected without error.

T07 — Implement state.py

id: BRIDGE-WP-0001-T07
state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927
status: done
priority: high

Read/write PID file and state file under ~/.local/state/bridge/. Check if PID is alive. Create state dir on first write.

T08 — Define BridgeState enum

id: BRIDGE-WP-0001-T08
state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9
status: done
priority: medium

States: STOPPED, STARTING, CONNECTED, DEGRADED, RECONNECTING, FAILED.

T09 — Unit tests for state management

id: BRIDGE-WP-0001-T09
state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096
status: done
priority: medium

Test: write/read state round-trip, stale PID detection without error.


Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13)

Acceptance: bridge up <name> starts tunnel; killing SSH process triggers reconnect; bridge down <name> stops cleanly.

T10 — Implement TunnelManager — SSH subprocess wrapper

id: BRIDGE-WP-0001-T10
state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec
status: done
priority: high

SSH command: ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}. Manager runs as a daemonised child process; parent writes PID and exits.

T11 — Implement reconnect backoff loop

id: BRIDGE-WP-0001-T11
state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27
status: done
priority: high

Exponential backoff between backoff_initial and backoff_max. Respect max_attempts (0 = infinite). On disconnect: state → RECONNECTING, log event, restart SSH.

T12 — Implement graceful shutdown

id: BRIDGE-WP-0001-T12
state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c
status: done
priority: medium

Catch SIGTERM/SIGINT, kill SSH subprocess, write STOPPED state.


Phase 5 — Health Monitoring (FR-15, FR-16, FR-17)

Acceptance: With a non-responsive health URL, bridge status shows degraded.

T13 — Implement health.py

id: BRIDGE-WP-0001-T13
state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4
status: done
priority: medium

Async HTTP GET via httpx to configured health URL. Run health check loop inside manager process. On failure: state → DEGRADED; on recovery: state → CONNECTED.

T14 — Write health check result to state dir

id: BRIDGE-WP-0001-T14
state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467
status: done
priority: low

Persist timestamp, status, HTTP code or error for display in bridge status.


Phase 6 — Audit Logging (FR-24, FR-25, FR-26)

Acceptance: All lifecycle events appear in the log with actor attribution.

T15 — Implement audit.py

id: BRIDGE-WP-0001-T15
state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7
status: done
priority: medium

Append JSON-lines to ~/.local/state/bridge/<name>.log. Events: bridge_started, bridge_connected, bridge_disconnected, bridge_reconnecting, health_check_failed, health_check_recovered, bridge_stopped. Each entry: timestamp (ISO-8601), tunnel, actor, actor_class, event, detail.


Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11)

Acceptance: All commands work end-to-end; --help on each command shows correct usage.

Status table columns: TUNNEL, STATE, ACTOR, HOST, UPTIME, HEALTH. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. --json flag on status for automation.

T16 — CLI: bridge up

id: BRIDGE-WP-0001-T16
state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6
status: done
priority: high

Start named tunnel or all tunnels if name omitted.

T17 — CLI: bridge down

id: BRIDGE-WP-0001-T17
state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657
status: done
priority: high

Stop named tunnel or all tunnels if name omitted.

T18 — CLI: bridge restart

id: BRIDGE-WP-0001-T18
state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681
status: done
priority: medium

Down then up for named tunnel or all.

T19 — CLI: bridge status

id: BRIDGE-WP-0001-T19
state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10
status: done
priority: high

Table output with --json flag for automation.

T20 — CLI: bridge logs

id: BRIDGE-WP-0001-T20
state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732
status: done
priority: medium

Tail log file. Defaults to last 50 lines. --follow for live tail. --lines N to override.


Phase 8 — Integration Tests

Acceptance: uv run pytest passes cleanly.

T21 — Integration test: up/status/down cycle

id: BRIDGE-WP-0001-T21
state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8
status: done
priority: medium

Test fixture with minimal tunnels.yaml pointing to localhost. Test full up → status → down cycle against loopback SSH target or mocked subprocess.

T22 — Integration test: reconnect behaviour

id: BRIDGE-WP-0001-T22
state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e
status: done
priority: medium

Test reconnect loop with a subprocess that exits immediately.

T23 — Integration test: health check degraded path

id: BRIDGE-WP-0001-T23
state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde
status: done
priority: medium

Test degraded state with a mock HTTP server that returns failures.


FRS Traceability

FRS Requirement Group Phase
FR-1 to FR-4 — Bridge creation 4
FR-5 to FR-7 — Bridge termination 4
FR-8 to FR-9 — Bridge restart 7
FR-10 to FR-11 — Status inspection 7
FR-12 to FR-14 — Lifecycle monitoring 4
FR-15 to FR-17 — Health monitoring 5
FR-18 to FR-20 — Actor attribution 2, 6
FR-24 to FR-26 — Audit logging 6
FC-1 — Config dependency 2
FC-2 — External connectivity 4

FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.


Deferred

  • FR-21FR-23 — Infrastructure target discovery (bridge targets) — requires OpsCatalog
  • FR-27FR-29 — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure
  • OpsCatalog — Separate workplan (BRIDGE-WP-0002)