generated from coulomb/repo-seed
621de64ee0e126bc37b0435ad90dce50b0be8d97
Integrates remote changes (session protocol, .custodian-brief.md, MCP SSE/HTTP mode, workplan OPS-WP-0002 completion) with local changes (AccessManagementDirective alignment, architecture docs, BRIDGE-WP-0004 and WARDEN-WP-0001 workplans). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ops-bridge
==========
SSH reverse tunnel lifecycle manager. Keeps remote execution environments
(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub
so Claude Code sessions on those machines have full MCP connectivity.
WHAT IT DOES
------------
`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel:
- Is identified by a human-readable name (e.g. state-hub-coulombcore)
- Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host
- Auto-reconnects on drop using exponential backoff
- Optionally runs an HTTP health check to confirm the forwarded service
is actually reachable (not just the SSH process alive)
- Records structured audit events (bridge_started, bridge_connected,
health_check_failed, etc.) to a JSON log per tunnel
Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting
INSTALL
-------
Requires Python 3.11+ and uv (https://docs.astral.sh/uv/).
uv tool install /path/to/ops-bridge
This registers the `bridge` command globally. For development:
cd /path/to/ops-bridge
uv tool install -e .
Verify:
bridge --help
CONFIGURATION
-------------
Config file: ~/.config/bridge/tunnels.yaml
Override with: BRIDGE_CONFIG=/path/to/config.yaml
Minimal example:
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
actors:
agent.claude-coulombcore:
class: automation
description: Claude Code agent on CoulombCore
With health check and reconnect policy:
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health # checked from the REMOTE host
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0 # 0 = retry forever
backoff_initial: 5
backoff_max: 60
actors:
agent.claude-coulombcore:
class: automation # "human" or "automation"
description: Claude Code agent on CoulombCore
operator.bernd:
class: human
description: Bernd Worsch
Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor
Required actor fields: class (must be "human" or "automation")
CLI COMMANDS
------------
Lifecycle:
bridge up [TUNNEL] Start one tunnel, or all if no name given
bridge down [TUNNEL] Stop one tunnel, or all
bridge restart [TUNNEL] Restart one tunnel, or all
Observation:
bridge status Show all tunnels: state, uptime, last event
bridge status --json Machine-readable JSON output
bridge logs TUNNEL Tail the audit log for a tunnel
bridge logs TUNNEL --lines 100 --follow
Examples:
bridge up state-hub-coulombcore
bridge status
bridge logs state-hub-coulombcore --follow
bridge down state-hub-coulombcore
OPSCATALOG EXTENSION (optional)
--------------------------------
If you maintain a Git-backed YAML catalog of your infrastructure, point
bridge at it in your config:
catalog_path: ~/ops-infra/opscatalog/
Catalog layout:
opscatalog/
domains/
<domain-id>/
domain.yaml
targets/
<target-id>.yaml
bridges/
<bridge-id>.yaml
Then you can use:
bridge targets [--domain DOMAIN] List all targets (optionally filtered)
bridge targets show TARGET_ID Show full target metadata
bridge catalog list List domains with counts
bridge catalog validate Check catalog for consistency errors
bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata
Bridges defined in the catalog are resolved the same way as inline tunnels.
Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when
both define the same name.
STATE FILES
-----------
Runtime state is stored in ~/.local/state/bridge/:
{name}.pid Manager process ID
{name}.state Current bridge state (e.g. "connected")
{name}.log Audit log, one JSON object per line
Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir
AUDIT LOG FORMAT
----------------
Each event is one JSON object per line:
{
"ts": "2026-03-12T14:23:01.456789",
"tunnel": "state-hub-coulombcore",
"event": "bridge_connected",
"actor": "agent.claude-coulombcore",
"actor_class": "automation",
"detail": ""
}
Event types: bridge_started, bridge_connected, bridge_disconnected,
bridge_reconnecting, health_check_failed, health_check_recovered,
bridge_stopped
MCP INTEGRATION
---------------
OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents
can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as
first-class MCP tools — no Bash required, structured JSON in/out.
Available tools: bridge_up, bridge_down, bridge_restart, bridge_status,
bridge_logs, catalog_list_targets, catalog_show_target,
catalog_list_domains, catalog_validate, catalog_show_bridge
Available resources: bridge://status, catalog://domains, catalog://targets
Project-scope (auto, inside ops-bridge/):
Already configured in .mcp.json. Claude Code sessions inside this repo
see the tools automatically.
User-scope (machine-global, any repo):
python scripts/register_mcp.py
Human operator skill:
/bridge-status — natural-language tunnel health summary
(skill file: ~/.claude/plugins/ops-bridge/bridge-status.md)
Run the server directly (for debugging):
uv run python src/bridge/mcp_server/server.py
DEVELOPMENT
-----------
uv run pytest Run all tests
uv run pytest tests/test_cli.py -v Run a specific test file
uv run ruff check . Lint
Source layout:
src/bridge/
cli.py Typer CLI (entry point)
models.py Core dataclasses and enums
config.py Config loading from tunnels.yaml
manager.py Tunnel lifecycle (subprocess, reconnect loop)
state.py PID and state file management
audit.py Audit event logging
health.py HTTP health checker (async, httpx)
catalog/ OpsCatalog extension
SERVER PREREQUISITES
--------------------
For reliable auto-reconnect after reboots or network drops, the remote sshd
needs two settings in /etc/ssh/sshd_config:
ClientAliveInterval 30
ClientAliveCountMax 3
Without these, dead SSH sessions hold their remote port forward open (the OS
has not yet cleaned up the socket), so the next reconnect attempt hits
"remote port forwarding failed" and exits with code 255. With ClientAlive
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
Apply and reload (no disconnect):
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config
sudo kill -HUP $(cat /run/sshd.pid)
If fail2ban is running on the remote, whitelist the bridge host IP so rapid
reconnect storms (e.g. after a key auth failure) do not trigger a ban.
Add the client IP to ignoreip in /etc/fail2ban/jail.local:
[DEFAULT]
ignoreip = 127.0.0.1/8 ::1 <your-bridge-host-ip>
Then reload: sudo systemctl reload fail2ban
Note: health_check.url must point to a LOCAL port (the local side of the
tunnel), not the remote forwarded port. For a reverse tunnel
(remote_port=18000, local_port=8000), the correct health check URL is
http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/...
For SSE endpoints (MCP), use a non-streaming endpoint from the same service
(e.g. the state-hub /state/health) since the health checker waits for the
response to complete.
DESIGN NOTES
------------
- No system daemons. Tunnel processes are managed as subprocesses; PIDs
are tracked in ~/.local/state/bridge/.
- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL
follows after 5 seconds if unresponsive.
- Actor attribution on every log event (human vs. automation) supports
audit traceability (FRS §5.7).
- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port
-i ssh_key ssh_user@host
- ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote
port is already in use. This is intentional — it forces a clean reconnect
rather than silently running without the port forward active.
REPO STRUCTURE
--------------
src/bridge/ Main source
tests/ Test suite
wiki/ PRD, FRS, OpsCatalog specification
workplans/ Custodian State Hub workplan files (BRIDGE-WP-*)
pyproject.toml Build config and dependencies
Languages
Python
99.6%
Makefile
0.4%