diff --git a/README.txt b/README.txt new file mode 100644 index 0000000..5445989 --- /dev/null +++ b/README.txt @@ -0,0 +1,224 @@ +ops-bridge +========== + +SSH reverse tunnel lifecycle manager. Keeps remote execution environments +(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub +so Claude Code sessions on those machines have full MCP connectivity. + + +WHAT IT DOES +------------ + +`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel: + + - Is identified by a human-readable name (e.g. state-hub-coulombcore) + - Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host + - Auto-reconnects on drop using exponential backoff + - Optionally runs an HTTP health check to confirm the forwarded service + is actually reachable (not just the SSH process alive) + - Records structured audit events (bridge_started, bridge_connected, + health_check_failed, etc.) to a JSON log per tunnel + +Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting + + +INSTALL +------- + +Requires Python 3.11+ and uv (https://docs.astral.sh/uv/). + + uv tool install /path/to/ops-bridge + +This registers the `bridge` command globally. For development: + + cd /path/to/ops-bridge + uv tool install -e . + +Verify: + + bridge --help + + +CONFIGURATION +------------- + +Config file: ~/.config/bridge/tunnels.yaml +Override with: BRIDGE_CONFIG=/path/to/config.yaml + +Minimal example: + + tunnels: + state-hub-coulombcore: + host: coulombcore.local + remote_port: 18000 + local_port: 8000 + ssh_user: ubuntu + ssh_key: ~/.ssh/id_ops + actor: agent.claude-coulombcore + + actors: + agent.claude-coulombcore: + class: automation + description: Claude Code agent on CoulombCore + +With health check and reconnect policy: + + tunnels: + state-hub-coulombcore: + host: coulombcore.local + remote_port: 18000 + local_port: 8000 + ssh_user: ubuntu + ssh_key: ~/.ssh/id_ops + actor: agent.claude-coulombcore + + health_check: + url: http://127.0.0.1:18000/health # checked from the REMOTE host + interval_seconds: 30 + timeout_seconds: 5 + + reconnect: + max_attempts: 0 # 0 = retry forever + backoff_initial: 5 + backoff_max: 60 + + actors: + agent.claude-coulombcore: + class: automation # "human" or "automation" + description: Claude Code agent on CoulombCore + operator.bernd: + class: human + description: Bernd Worsch + +Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor +Required actor fields: class (must be "human" or "automation") + + +CLI COMMANDS +------------ + +Lifecycle: + + bridge up [TUNNEL] Start one tunnel, or all if no name given + bridge down [TUNNEL] Stop one tunnel, or all + bridge restart [TUNNEL] Restart one tunnel, or all + +Observation: + + bridge status Show all tunnels: state, uptime, last event + bridge status --json Machine-readable JSON output + bridge logs TUNNEL Tail the audit log for a tunnel + bridge logs TUNNEL --lines 100 --follow + +Examples: + + bridge up state-hub-coulombcore + bridge status + bridge logs state-hub-coulombcore --follow + bridge down state-hub-coulombcore + + +OPSCATALOG EXTENSION (optional) +-------------------------------- + +If you maintain a Git-backed YAML catalog of your infrastructure, point +bridge at it in your config: + + catalog_path: ~/ops-infra/opscatalog/ + +Catalog layout: + + opscatalog/ + domains/ + / + domain.yaml + targets/ + .yaml + bridges/ + .yaml + +Then you can use: + + bridge targets [--domain DOMAIN] List all targets (optionally filtered) + bridge targets show TARGET_ID Show full target metadata + bridge catalog list List domains with counts + bridge catalog validate Check catalog for consistency errors + bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata + +Bridges defined in the catalog are resolved the same way as inline tunnels. +Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when +both define the same name. + + +STATE FILES +----------- + +Runtime state is stored in ~/.local/state/bridge/: + + {name}.pid Manager process ID + {name}.state Current bridge state (e.g. "connected") + {name}.log Audit log, one JSON object per line + +Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir + + +AUDIT LOG FORMAT +---------------- + +Each event is one JSON object per line: + + { + "ts": "2026-03-12T14:23:01.456789", + "tunnel": "state-hub-coulombcore", + "event": "bridge_connected", + "actor": "agent.claude-coulombcore", + "actor_class": "automation", + "detail": "" + } + +Event types: bridge_started, bridge_connected, bridge_disconnected, +bridge_reconnecting, health_check_failed, health_check_recovered, +bridge_stopped + + +DEVELOPMENT +----------- + + uv run pytest Run all tests + uv run pytest tests/test_cli.py -v Run a specific test file + uv run ruff check . Lint + +Source layout: + + src/bridge/ + cli.py Typer CLI (entry point) + models.py Core dataclasses and enums + config.py Config loading from tunnels.yaml + manager.py Tunnel lifecycle (subprocess, reconnect loop) + state.py PID and state file management + audit.py Audit event logging + health.py HTTP health checker (async, httpx) + catalog/ OpsCatalog extension + + +DESIGN NOTES +------------ + +- No system daemons. Tunnel processes are managed as subprocesses; PIDs + are tracked in ~/.local/state/bridge/. +- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL + follows after 5 seconds if unresponsive. +- Actor attribution on every log event (human vs. automation) supports + audit traceability (FRS ยง5.7). +- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port + -i ssh_key ssh_user@host + + +REPO STRUCTURE +-------------- + + src/bridge/ Main source + tests/ Test suite + wiki/ PRD, FRS, OpsCatalog specification + workplans/ Custodian State Hub workplan files (BRIDGE-WP-*) + pyproject.toml Build config and dependencies