ops-bridge ========== SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub so Claude Code sessions on those machines have full MCP connectivity. WHAT IT DOES ------------ `bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel: - Is identified by a human-readable name (e.g. state-hub-coulombcore) - Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host - Auto-reconnects on drop using exponential backoff - Optionally runs an HTTP health check to confirm the forwarded service is actually reachable (not just the SSH process alive) - Records structured audit events (bridge_started, bridge_connected, health_check_failed, etc.) to a JSON log per tunnel Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting INSTALL ------- Requires Python 3.11+ and uv (https://docs.astral.sh/uv/). uv tool install /path/to/ops-bridge This registers the `bridge` command globally. For development: cd /path/to/ops-bridge uv tool install -e . Verify: bridge --help CONFIGURATION ------------- Config file: ~/.config/bridge/tunnels.yaml Override with: BRIDGE_CONFIG=/path/to/config.yaml Minimal example: tunnels: state-hub-coulombcore: host: coulombcore.local remote_port: 18000 local_port: 8000 ssh_user: ubuntu ssh_key: ~/.ssh/id_ops actor: agent.claude-coulombcore actors: agent.claude-coulombcore: class: automation description: Claude Code agent on CoulombCore With health check and reconnect policy: tunnels: state-hub-coulombcore: host: coulombcore.local remote_port: 18000 local_port: 8000 ssh_user: ubuntu ssh_key: ~/.ssh/id_ops actor: agent.claude-coulombcore health_check: url: http://127.0.0.1:18000/health # checked from the REMOTE host interval_seconds: 30 timeout_seconds: 5 reconnect: max_attempts: 0 # 0 = retry forever backoff_initial: 5 backoff_max: 60 actors: agent.claude-coulombcore: class: automation # "human" or "automation" description: Claude Code agent on CoulombCore operator.bernd: class: human description: Bernd Worsch Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor Required actor fields: class (must be "human" or "automation") CLI COMMANDS ------------ Lifecycle: bridge up [TUNNEL] Start one tunnel, or all if no name given bridge down [TUNNEL] Stop one tunnel, or all bridge restart [TUNNEL] Restart one tunnel, or all Observation: bridge status Show all tunnels: state, uptime, last event bridge status --json Machine-readable JSON output bridge logs TUNNEL Tail the audit log for a tunnel bridge logs TUNNEL --lines 100 --follow Examples: bridge up state-hub-coulombcore bridge status bridge logs state-hub-coulombcore --follow bridge down state-hub-coulombcore OPSCATALOG EXTENSION (optional) -------------------------------- If you maintain a Git-backed YAML catalog of your infrastructure, point bridge at it in your config: catalog_path: ~/ops-infra/opscatalog/ Catalog layout: opscatalog/ domains/ / domain.yaml targets/ .yaml bridges/ .yaml Then you can use: bridge targets [--domain DOMAIN] List all targets (optionally filtered) bridge targets show TARGET_ID Show full target metadata bridge catalog list List domains with counts bridge catalog validate Check catalog for consistency errors bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata Bridges defined in the catalog are resolved the same way as inline tunnels. Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when both define the same name. STATE FILES ----------- Runtime state is stored in ~/.local/state/bridge/: {name}.pid Manager process ID {name}.state Current bridge state (e.g. "connected") {name}.log Audit log, one JSON object per line Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir AUDIT LOG FORMAT ---------------- Each event is one JSON object per line: { "ts": "2026-03-12T14:23:01.456789", "tunnel": "state-hub-coulombcore", "event": "bridge_connected", "actor": "agent.claude-coulombcore", "actor_class": "automation", "detail": "" } Event types: bridge_started, bridge_connected, bridge_disconnected, bridge_reconnecting, health_check_failed, health_check_recovered, bridge_stopped MCP INTEGRATION --------------- OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as first-class MCP tools — no Bash required, structured JSON in/out. Available tools: bridge_up, bridge_down, bridge_restart, bridge_status, bridge_logs, catalog_list_targets, catalog_show_target, catalog_list_domains, catalog_validate, catalog_show_bridge Available resources: bridge://status, catalog://domains, catalog://targets Project-scope (auto, inside ops-bridge/): Already configured in .mcp.json. Claude Code sessions inside this repo see the tools automatically. User-scope (machine-global, any repo): python scripts/register_mcp.py Human operator skill: /bridge-status — natural-language tunnel health summary (skill file: ~/.claude/plugins/ops-bridge/bridge-status.md) Run the server directly (for debugging): uv run python src/bridge/mcp_server/server.py DEVELOPMENT ----------- uv run pytest Run all tests uv run pytest tests/test_cli.py -v Run a specific test file uv run ruff check . Lint Source layout: src/bridge/ cli.py Typer CLI (entry point) models.py Core dataclasses and enums config.py Config loading from tunnels.yaml manager.py Tunnel lifecycle (subprocess, reconnect loop) state.py PID and state file management audit.py Audit event logging health.py HTTP health checker (async, httpx) catalog/ OpsCatalog extension SERVER PREREQUISITES -------------------- For reliable auto-reconnect after reboots or network drops, the remote sshd needs two settings in /etc/ssh/sshd_config: ClientAliveInterval 30 ClientAliveCountMax 3 Without these, dead SSH sessions hold their remote port forward open (the OS has not yet cleaned up the socket), so the next reconnect attempt hits "remote port forwarding failed" and exits with code 255. With ClientAlive enabled, sshd evicts stale sessions within ~90 seconds and frees the port. Apply and reload (no disconnect): sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config sudo kill -HUP $(cat /run/sshd.pid) If fail2ban is running on the remote, whitelist the bridge host IP so rapid reconnect storms (e.g. after a key auth failure) do not trigger a ban. Add the client IP to ignoreip in /etc/fail2ban/jail.local: [DEFAULT] ignoreip = 127.0.0.1/8 ::1 Then reload: sudo systemctl reload fail2ban Note: health_check.url must point to a LOCAL port (the local side of the tunnel), not the remote forwarded port. For a reverse tunnel (remote_port=18000, local_port=8000), the correct health check URL is http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/... For SSE endpoints (MCP), use a non-streaming endpoint from the same service (e.g. the state-hub /state/health) since the health checker waits for the response to complete. DESIGN NOTES ------------ - No system daemons. Tunnel processes are managed as subprocesses; PIDs are tracked in ~/.local/state/bridge/. - Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL follows after 5 seconds if unresponsive. - Actor attribution on every log event (human vs. automation) supports audit traceability (FRS §5.7). - SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port -i ssh_key ssh_user@host - ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote port is already in use. This is intentional — it forces a clean reconnect rather than silently running without the port forward active. REPO STRUCTURE -------------- src/bridge/ Main source tests/ Test suite wiki/ PRD, FRS, OpsCatalog specification workplans/ Custodian State Hub workplan files (BRIDGE-WP-*) pyproject.toml Build config and dependencies