generated from coulomb/repo-seed
Document ClientAliveInterval/ClientAliveCountMax requirement on remote sshd to prevent stale sessions holding ports after reconnect. Document fail2ban ignoreip setup. Clarify that health_check.url must be a local port (not the remote forwarded port), and that SSE endpoints block the health checker. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
294 lines
8.7 KiB
Plaintext
294 lines
8.7 KiB
Plaintext
ops-bridge
|
|
==========
|
|
|
|
SSH reverse tunnel lifecycle manager. Keeps remote execution environments
|
|
(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub
|
|
so Claude Code sessions on those machines have full MCP connectivity.
|
|
|
|
|
|
WHAT IT DOES
|
|
------------
|
|
|
|
`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel:
|
|
|
|
- Is identified by a human-readable name (e.g. state-hub-coulombcore)
|
|
- Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host
|
|
- Auto-reconnects on drop using exponential backoff
|
|
- Optionally runs an HTTP health check to confirm the forwarded service
|
|
is actually reachable (not just the SSH process alive)
|
|
- Records structured audit events (bridge_started, bridge_connected,
|
|
health_check_failed, etc.) to a JSON log per tunnel
|
|
|
|
Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting
|
|
|
|
|
|
INSTALL
|
|
-------
|
|
|
|
Requires Python 3.11+ and uv (https://docs.astral.sh/uv/).
|
|
|
|
uv tool install /path/to/ops-bridge
|
|
|
|
This registers the `bridge` command globally. For development:
|
|
|
|
cd /path/to/ops-bridge
|
|
uv tool install -e .
|
|
|
|
Verify:
|
|
|
|
bridge --help
|
|
|
|
|
|
CONFIGURATION
|
|
-------------
|
|
|
|
Config file: ~/.config/bridge/tunnels.yaml
|
|
Override with: BRIDGE_CONFIG=/path/to/config.yaml
|
|
|
|
Minimal example:
|
|
|
|
tunnels:
|
|
state-hub-coulombcore:
|
|
host: coulombcore.local
|
|
remote_port: 18000
|
|
local_port: 8000
|
|
ssh_user: ubuntu
|
|
ssh_key: ~/.ssh/id_ops
|
|
actor: agent.claude-coulombcore
|
|
|
|
actors:
|
|
agent.claude-coulombcore:
|
|
class: automation
|
|
description: Claude Code agent on CoulombCore
|
|
|
|
With health check and reconnect policy:
|
|
|
|
tunnels:
|
|
state-hub-coulombcore:
|
|
host: coulombcore.local
|
|
remote_port: 18000
|
|
local_port: 8000
|
|
ssh_user: ubuntu
|
|
ssh_key: ~/.ssh/id_ops
|
|
actor: agent.claude-coulombcore
|
|
|
|
health_check:
|
|
url: http://127.0.0.1:18000/health # checked from the REMOTE host
|
|
interval_seconds: 30
|
|
timeout_seconds: 5
|
|
|
|
reconnect:
|
|
max_attempts: 0 # 0 = retry forever
|
|
backoff_initial: 5
|
|
backoff_max: 60
|
|
|
|
actors:
|
|
agent.claude-coulombcore:
|
|
class: automation # "human" or "automation"
|
|
description: Claude Code agent on CoulombCore
|
|
operator.bernd:
|
|
class: human
|
|
description: Bernd Worsch
|
|
|
|
Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor
|
|
Required actor fields: class (must be "human" or "automation")
|
|
|
|
|
|
CLI COMMANDS
|
|
------------
|
|
|
|
Lifecycle:
|
|
|
|
bridge up [TUNNEL] Start one tunnel, or all if no name given
|
|
bridge down [TUNNEL] Stop one tunnel, or all
|
|
bridge restart [TUNNEL] Restart one tunnel, or all
|
|
|
|
Observation:
|
|
|
|
bridge status Show all tunnels: state, uptime, last event
|
|
bridge status --json Machine-readable JSON output
|
|
bridge logs TUNNEL Tail the audit log for a tunnel
|
|
bridge logs TUNNEL --lines 100 --follow
|
|
|
|
Examples:
|
|
|
|
bridge up state-hub-coulombcore
|
|
bridge status
|
|
bridge logs state-hub-coulombcore --follow
|
|
bridge down state-hub-coulombcore
|
|
|
|
|
|
OPSCATALOG EXTENSION (optional)
|
|
--------------------------------
|
|
|
|
If you maintain a Git-backed YAML catalog of your infrastructure, point
|
|
bridge at it in your config:
|
|
|
|
catalog_path: ~/ops-infra/opscatalog/
|
|
|
|
Catalog layout:
|
|
|
|
opscatalog/
|
|
domains/
|
|
<domain-id>/
|
|
domain.yaml
|
|
targets/
|
|
<target-id>.yaml
|
|
bridges/
|
|
<bridge-id>.yaml
|
|
|
|
Then you can use:
|
|
|
|
bridge targets [--domain DOMAIN] List all targets (optionally filtered)
|
|
bridge targets show TARGET_ID Show full target metadata
|
|
bridge catalog list List domains with counts
|
|
bridge catalog validate Check catalog for consistency errors
|
|
bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata
|
|
|
|
Bridges defined in the catalog are resolved the same way as inline tunnels.
|
|
Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when
|
|
both define the same name.
|
|
|
|
|
|
STATE FILES
|
|
-----------
|
|
|
|
Runtime state is stored in ~/.local/state/bridge/:
|
|
|
|
{name}.pid Manager process ID
|
|
{name}.state Current bridge state (e.g. "connected")
|
|
{name}.log Audit log, one JSON object per line
|
|
|
|
Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir
|
|
|
|
|
|
AUDIT LOG FORMAT
|
|
----------------
|
|
|
|
Each event is one JSON object per line:
|
|
|
|
{
|
|
"ts": "2026-03-12T14:23:01.456789",
|
|
"tunnel": "state-hub-coulombcore",
|
|
"event": "bridge_connected",
|
|
"actor": "agent.claude-coulombcore",
|
|
"actor_class": "automation",
|
|
"detail": ""
|
|
}
|
|
|
|
Event types: bridge_started, bridge_connected, bridge_disconnected,
|
|
bridge_reconnecting, health_check_failed, health_check_recovered,
|
|
bridge_stopped
|
|
|
|
|
|
MCP INTEGRATION
|
|
---------------
|
|
|
|
OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents
|
|
can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as
|
|
first-class MCP tools — no Bash required, structured JSON in/out.
|
|
|
|
Available tools: bridge_up, bridge_down, bridge_restart, bridge_status,
|
|
bridge_logs, catalog_list_targets, catalog_show_target,
|
|
catalog_list_domains, catalog_validate, catalog_show_bridge
|
|
|
|
Available resources: bridge://status, catalog://domains, catalog://targets
|
|
|
|
Project-scope (auto, inside ops-bridge/):
|
|
Already configured in .mcp.json. Claude Code sessions inside this repo
|
|
see the tools automatically.
|
|
|
|
User-scope (machine-global, any repo):
|
|
python scripts/register_mcp.py
|
|
|
|
Human operator skill:
|
|
/bridge-status — natural-language tunnel health summary
|
|
(skill file: ~/.claude/plugins/ops-bridge/bridge-status.md)
|
|
|
|
Run the server directly (for debugging):
|
|
uv run python src/bridge/mcp_server/server.py
|
|
|
|
|
|
DEVELOPMENT
|
|
-----------
|
|
|
|
uv run pytest Run all tests
|
|
uv run pytest tests/test_cli.py -v Run a specific test file
|
|
uv run ruff check . Lint
|
|
|
|
Source layout:
|
|
|
|
src/bridge/
|
|
cli.py Typer CLI (entry point)
|
|
models.py Core dataclasses and enums
|
|
config.py Config loading from tunnels.yaml
|
|
manager.py Tunnel lifecycle (subprocess, reconnect loop)
|
|
state.py PID and state file management
|
|
audit.py Audit event logging
|
|
health.py HTTP health checker (async, httpx)
|
|
catalog/ OpsCatalog extension
|
|
|
|
|
|
SERVER PREREQUISITES
|
|
--------------------
|
|
|
|
For reliable auto-reconnect after reboots or network drops, the remote sshd
|
|
needs two settings in /etc/ssh/sshd_config:
|
|
|
|
ClientAliveInterval 30
|
|
ClientAliveCountMax 3
|
|
|
|
Without these, dead SSH sessions hold their remote port forward open (the OS
|
|
has not yet cleaned up the socket), so the next reconnect attempt hits
|
|
"remote port forwarding failed" and exits with code 255. With ClientAlive
|
|
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
|
|
|
|
Apply and reload (no disconnect):
|
|
|
|
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
|
|
sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config
|
|
sudo kill -HUP $(cat /run/sshd.pid)
|
|
|
|
If fail2ban is running on the remote, whitelist the bridge host IP so rapid
|
|
reconnect storms (e.g. after a key auth failure) do not trigger a ban.
|
|
Add the client IP to ignoreip in /etc/fail2ban/jail.local:
|
|
|
|
[DEFAULT]
|
|
ignoreip = 127.0.0.1/8 ::1 <your-bridge-host-ip>
|
|
|
|
Then reload: sudo systemctl reload fail2ban
|
|
|
|
Note: health_check.url must point to a LOCAL port (the local side of the
|
|
tunnel), not the remote forwarded port. For a reverse tunnel
|
|
(remote_port=18000, local_port=8000), the correct health check URL is
|
|
http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/...
|
|
For SSE endpoints (MCP), use a non-streaming endpoint from the same service
|
|
(e.g. the state-hub /state/health) since the health checker waits for the
|
|
response to complete.
|
|
|
|
|
|
DESIGN NOTES
|
|
------------
|
|
|
|
- No system daemons. Tunnel processes are managed as subprocesses; PIDs
|
|
are tracked in ~/.local/state/bridge/.
|
|
- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL
|
|
follows after 5 seconds if unresponsive.
|
|
- Actor attribution on every log event (human vs. automation) supports
|
|
audit traceability (FRS §5.7).
|
|
- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port
|
|
-i ssh_key ssh_user@host
|
|
- ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote
|
|
port is already in use. This is intentional — it forces a clean reconnect
|
|
rather than silently running without the port forward active.
|
|
|
|
|
|
REPO STRUCTURE
|
|
--------------
|
|
|
|
src/bridge/ Main source
|
|
tests/ Test suite
|
|
wiki/ PRD, FRS, OpsCatalog specification
|
|
workplans/ Custodian State Hub workplan files (BRIDGE-WP-*)
|
|
pyproject.toml Build config and dependencies
|