generated from coulomb/repo-seed
Add bridge maintenance cleanup to detect reverse tunnels whose remote port is bound but no longer forwards (zombie sshd sessions), kill the stale listeners on the remote host, and optionally restart the tunnel. Includes install-cron/uninstall-cron/show-cron helpers and README notes for the actcore-state-hub-bridge failure mode we hit on railiance01.
319 lines
9.4 KiB
Plaintext
319 lines
9.4 KiB
Plaintext
ops-bridge
|
|
==========
|
|
|
|
SSH reverse tunnel lifecycle manager. Keeps remote execution environments
|
|
(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub
|
|
so Claude Code sessions on those machines have full MCP connectivity.
|
|
|
|
|
|
WHAT IT DOES
|
|
------------
|
|
|
|
`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel:
|
|
|
|
- Is identified by a human-readable name (e.g. state-hub-coulombcore)
|
|
- Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host
|
|
- Auto-reconnects on drop using exponential backoff
|
|
- Optionally runs an HTTP health check to confirm the forwarded service
|
|
is actually reachable (not just the SSH process alive)
|
|
- Records structured audit events (bridge_started, bridge_connected,
|
|
health_check_failed, etc.) to a JSON log per tunnel
|
|
|
|
Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting
|
|
|
|
|
|
INSTALL
|
|
-------
|
|
|
|
Requires Python 3.11+ and uv (https://docs.astral.sh/uv/).
|
|
|
|
uv tool install /path/to/ops-bridge
|
|
|
|
This registers the `bridge` command globally. For development:
|
|
|
|
cd /path/to/ops-bridge
|
|
uv tool install -e .
|
|
|
|
Verify:
|
|
|
|
bridge --help
|
|
|
|
|
|
CONFIGURATION
|
|
-------------
|
|
|
|
Config file: ~/.config/bridge/tunnels.yaml
|
|
Override with: BRIDGE_CONFIG=/path/to/config.yaml
|
|
|
|
Minimal example:
|
|
|
|
tunnels:
|
|
state-hub-coulombcore:
|
|
host: coulombcore.local
|
|
remote_port: 18000
|
|
local_port: 8000
|
|
ssh_user: ubuntu
|
|
ssh_key: ~/.ssh/id_ops
|
|
actor: agent.claude-coulombcore
|
|
|
|
actors:
|
|
agent.claude-coulombcore:
|
|
class: automation
|
|
description: Claude Code agent on CoulombCore
|
|
|
|
With health check and reconnect policy:
|
|
|
|
tunnels:
|
|
state-hub-coulombcore:
|
|
host: coulombcore.local
|
|
remote_port: 18000
|
|
local_port: 8000
|
|
ssh_user: ubuntu
|
|
ssh_key: ~/.ssh/id_ops
|
|
actor: agent.claude-coulombcore
|
|
|
|
health_check:
|
|
url: http://127.0.0.1:18000/health # checked from the REMOTE host
|
|
interval_seconds: 30
|
|
timeout_seconds: 5
|
|
|
|
reconnect:
|
|
max_attempts: 0 # 0 = retry forever
|
|
backoff_initial: 5
|
|
backoff_max: 60
|
|
|
|
actors:
|
|
agent.claude-coulombcore:
|
|
class: automation # "human" or "automation"
|
|
description: Claude Code agent on CoulombCore
|
|
operator.bernd:
|
|
class: human
|
|
description: Bernd Worsch
|
|
|
|
Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor
|
|
Required actor fields: class (must be "human" or "automation")
|
|
|
|
|
|
CLI COMMANDS
|
|
------------
|
|
|
|
Lifecycle:
|
|
|
|
bridge up [TUNNEL] Start one tunnel, or all if no name given
|
|
bridge down [TUNNEL] Stop one tunnel, or all
|
|
bridge restart [TUNNEL] Restart one tunnel, or all
|
|
|
|
Observation:
|
|
|
|
bridge status Show all tunnels: state, uptime, last event
|
|
bridge status --json Machine-readable JSON output
|
|
bridge logs TUNNEL Tail the audit log for a tunnel
|
|
bridge logs TUNNEL --lines 100 --follow
|
|
|
|
Examples:
|
|
|
|
bridge up state-hub-coulombcore
|
|
bridge status
|
|
bridge logs state-hub-coulombcore --follow
|
|
bridge down state-hub-coulombcore
|
|
|
|
|
|
OPSCATALOG EXTENSION (optional)
|
|
--------------------------------
|
|
|
|
If you maintain a Git-backed YAML catalog of your infrastructure, point
|
|
bridge at it in your config:
|
|
|
|
catalog_path: ~/ops-infra/opscatalog/
|
|
|
|
Catalog layout:
|
|
|
|
opscatalog/
|
|
domains/
|
|
<domain-id>/
|
|
domain.yaml
|
|
targets/
|
|
<target-id>.yaml
|
|
bridges/
|
|
<bridge-id>.yaml
|
|
|
|
Then you can use:
|
|
|
|
bridge targets [--domain DOMAIN] List all targets (optionally filtered)
|
|
bridge targets show TARGET_ID Show full target metadata
|
|
bridge catalog list List domains with counts
|
|
bridge catalog validate Check catalog for consistency errors
|
|
bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata
|
|
|
|
Bridges defined in the catalog are resolved the same way as inline tunnels.
|
|
Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when
|
|
both define the same name.
|
|
|
|
|
|
STATE FILES
|
|
-----------
|
|
|
|
Runtime state is stored in ~/.local/state/bridge/:
|
|
|
|
{name}.pid Manager process ID
|
|
{name}.state Current bridge state (e.g. "connected")
|
|
{name}.log Audit log, one JSON object per line
|
|
|
|
Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir
|
|
|
|
|
|
AUDIT LOG FORMAT
|
|
----------------
|
|
|
|
Each event is one JSON object per line:
|
|
|
|
{
|
|
"ts": "2026-03-12T14:23:01.456789",
|
|
"tunnel": "state-hub-coulombcore",
|
|
"event": "bridge_connected",
|
|
"actor": "agent.claude-coulombcore",
|
|
"actor_class": "automation",
|
|
"detail": ""
|
|
}
|
|
|
|
Event types: bridge_started, bridge_connected, bridge_disconnected,
|
|
bridge_reconnecting, health_check_failed, health_check_recovered,
|
|
bridge_stopped
|
|
|
|
|
|
MCP INTEGRATION
|
|
---------------
|
|
|
|
OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents
|
|
can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as
|
|
first-class MCP tools — no Bash required, structured JSON in/out.
|
|
|
|
Available tools: bridge_up, bridge_down, bridge_restart, bridge_status,
|
|
bridge_logs, catalog_list_targets, catalog_show_target,
|
|
catalog_list_domains, catalog_validate, catalog_show_bridge
|
|
|
|
Available resources: bridge://status, catalog://domains, catalog://targets
|
|
|
|
Project-scope (auto, inside ops-bridge/):
|
|
Already configured in .mcp.json. Claude Code sessions inside this repo
|
|
see the tools automatically.
|
|
|
|
User-scope (machine-global, any repo):
|
|
python scripts/register_mcp.py
|
|
|
|
Human operator skill:
|
|
/bridge-status — natural-language tunnel health summary
|
|
(skill file: ~/.claude/plugins/ops-bridge/bridge-status.md)
|
|
|
|
Run the server directly (for debugging):
|
|
uv run python src/bridge/mcp_server/server.py
|
|
|
|
|
|
DEVELOPMENT
|
|
-----------
|
|
|
|
uv run pytest Run all tests
|
|
uv run pytest tests/test_cli.py -v Run a specific test file
|
|
uv run ruff check . Lint
|
|
|
|
Source layout:
|
|
|
|
src/bridge/
|
|
cli.py Typer CLI (entry point)
|
|
models.py Core dataclasses and enums
|
|
config.py Config loading from tunnels.yaml
|
|
manager.py Tunnel lifecycle (subprocess, reconnect loop)
|
|
state.py PID and state file management
|
|
audit.py Audit event logging
|
|
health.py HTTP health checker (async, httpx)
|
|
catalog/ OpsCatalog extension
|
|
|
|
|
|
SERVER PREREQUISITES
|
|
--------------------
|
|
|
|
For reliable auto-reconnect after reboots or network drops, the remote sshd
|
|
needs two settings in /etc/ssh/sshd_config:
|
|
|
|
ClientAliveInterval 30
|
|
ClientAliveCountMax 3
|
|
|
|
Without these, dead SSH sessions hold their remote port forward open (the OS
|
|
has not yet cleaned up the socket), so the next reconnect attempt hits
|
|
"remote port forwarding failed" and exits with code 255. With ClientAlive
|
|
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
|
|
|
|
NIGHTLY STALE-FORWARD CLEANUP
|
|
------------------------------
|
|
|
|
When a bridge client dies without tearing down its SSH session, the remote
|
|
host can keep port 18000 (etc.) bound to a zombie sshd listener. The port
|
|
accepts connections but never forwards them, which breaks in-cluster proxies
|
|
such as actcore-state-hub-bridge on railiance01.
|
|
|
|
Install a 03:00 local-time cron job that probes each reverse tunnel's remote
|
|
forward, kills stale listeners when the local service is healthy but the
|
|
remote forward is not, and restarts the tunnel:
|
|
|
|
bridge maintenance install-cron
|
|
|
|
Manual run:
|
|
|
|
bridge maintenance cleanup --restart
|
|
|
|
Inspect or remove the cron entry:
|
|
|
|
bridge maintenance show-cron
|
|
bridge maintenance uninstall-cron
|
|
|
|
Logs append to ~/.local/state/bridge/cleanup.log
|
|
|
|
Apply and reload (no disconnect):
|
|
|
|
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
|
|
sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config
|
|
sudo kill -HUP $(cat /run/sshd.pid)
|
|
|
|
If fail2ban is running on the remote, whitelist the bridge host IP so rapid
|
|
reconnect storms (e.g. after a key auth failure) do not trigger a ban.
|
|
Add the client IP to ignoreip in /etc/fail2ban/jail.local:
|
|
|
|
[DEFAULT]
|
|
ignoreip = 127.0.0.1/8 ::1 <your-bridge-host-ip>
|
|
|
|
Then reload: sudo systemctl reload fail2ban
|
|
|
|
Note: health_check.url must point to a LOCAL port (the local side of the
|
|
tunnel), not the remote forwarded port. For a reverse tunnel
|
|
(remote_port=18000, local_port=8000), the correct health check URL is
|
|
http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/...
|
|
For SSE endpoints (MCP), use a non-streaming endpoint from the same service
|
|
(e.g. the state-hub /state/health) since the health checker waits for the
|
|
response to complete.
|
|
|
|
|
|
DESIGN NOTES
|
|
------------
|
|
|
|
- No system daemons. Tunnel processes are managed as subprocesses; PIDs
|
|
are tracked in ~/.local/state/bridge/.
|
|
- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL
|
|
follows after 5 seconds if unresponsive.
|
|
- Actor attribution on every log event (human vs. automation) supports
|
|
audit traceability (FRS §5.7).
|
|
- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port
|
|
-i ssh_key ssh_user@host
|
|
- ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote
|
|
port is already in use. This is intentional — it forces a clean reconnect
|
|
rather than silently running without the port forward active.
|
|
|
|
|
|
REPO STRUCTURE
|
|
--------------
|
|
|
|
src/bridge/ Main source
|
|
tests/ Test suite
|
|
wiki/ PRD, FRS, OpsCatalog specification
|
|
workplans/ Custodian State Hub workplan files (BRIDGE-WP-*)
|
|
pyproject.toml Build config and dependencies
|