ops-bridge/README.txt

ops-bridge
==========

SSH reverse tunnel lifecycle manager. Keeps remote execution environments
(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub
so Claude Code sessions on those machines have full MCP connectivity.


WHAT IT DOES
------------

`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel:

  - Is identified by a human-readable name (e.g. state-hub-coulombcore)
  - Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host
  - Auto-reconnects on drop using exponential backoff
  - Optionally runs an HTTP health check to confirm the forwarded service
    is actually reachable (not just the SSH process alive)
  - Records structured audit events (bridge_started, bridge_connected,
    health_check_failed, etc.) to a JSON log per tunnel

Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting


INSTALL
-------

Requires Python 3.11+ and uv (https://docs.astral.sh/uv/).

  uv tool install /path/to/ops-bridge

This registers the `bridge` command globally. For development:

  cd /path/to/ops-bridge
  uv tool install -e .

Verify:

  bridge --help


CONFIGURATION
-------------

Config file: ~/.config/bridge/tunnels.yaml
Override with: BRIDGE_CONFIG=/path/to/config.yaml

Minimal example:

  tunnels:
    state-hub-coulombcore:
      host: coulombcore.local
      remote_port: 18000
      local_port: 8000
      ssh_user: ubuntu
      ssh_key: ~/.ssh/id_ops
      actor: agent.claude-coulombcore

  actors:
    agent.claude-coulombcore:
      class: automation
      description: Claude Code agent on CoulombCore

With health check and reconnect policy:

  tunnels:
    state-hub-coulombcore:
      host: coulombcore.local
      remote_port: 18000
      local_port: 8000
      ssh_user: ubuntu
      ssh_key: ~/.ssh/id_ops
      actor: agent.claude-coulombcore

      health_check:
        url: http://127.0.0.1:18000/health   # checked from the REMOTE host
        interval_seconds: 30
        timeout_seconds: 5

      reconnect:
        max_attempts: 0    # 0 = retry forever
        backoff_initial: 5
        backoff_max: 60

  actors:
    agent.claude-coulombcore:
      class: automation            # "human" or "automation"
      description: Claude Code agent on CoulombCore
    operator.bernd:
      class: human
      description: Bernd Worsch

Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor
Required actor fields:  class (must be "human" or "automation")


CLI COMMANDS
------------

Lifecycle:

  bridge up [TUNNEL]           Start one tunnel, or all if no name given
  bridge down [TUNNEL]         Stop one tunnel, or all
  bridge restart [TUNNEL]      Restart one tunnel, or all

Observation:

  bridge status                Show all tunnels: state, uptime, last event
  bridge status --json         Machine-readable JSON output
  bridge logs TUNNEL           Tail the audit log for a tunnel
  bridge logs TUNNEL --lines 100 --follow

Examples:

  bridge up state-hub-coulombcore
  bridge status
  bridge logs state-hub-coulombcore --follow
  bridge down state-hub-coulombcore


OPSCATALOG EXTENSION (optional)
--------------------------------

If you maintain a Git-backed YAML catalog of your infrastructure, point
bridge at it in your config:

  catalog_path: ~/ops-infra/opscatalog/

Catalog layout:

  opscatalog/
    domains/
      <domain-id>/
        domain.yaml
        targets/
          <target-id>.yaml
        bridges/
          <bridge-id>.yaml

Then you can use:

  bridge targets [--domain DOMAIN]   List all targets (optionally filtered)
  bridge targets show TARGET_ID      Show full target metadata
  bridge catalog list                List domains with counts
  bridge catalog validate            Check catalog for consistency errors
  bridge catalog show BRIDGE_ID      Show a catalog bridge's full metadata

Bridges defined in the catalog are resolved the same way as inline tunnels.
Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when
both define the same name.


STATE FILES
-----------

Runtime state is stored in ~/.local/state/bridge/:

  {name}.pid    Manager process ID
  {name}.state  Current bridge state (e.g. "connected")
  {name}.log    Audit log, one JSON object per line

Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir


AUDIT LOG FORMAT
----------------

Each event is one JSON object per line:

  {
    "ts": "2026-03-12T14:23:01.456789",
    "tunnel": "state-hub-coulombcore",
    "event": "bridge_connected",
    "actor": "agent.claude-coulombcore",
    "actor_class": "automation",
    "detail": ""
  }

Event types: bridge_started, bridge_connected, bridge_disconnected,
bridge_reconnecting, health_check_failed, health_check_recovered,
bridge_stopped


MCP INTEGRATION
---------------

OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents
can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as
first-class MCP tools — no Bash required, structured JSON in/out.

Available tools:  bridge_up, bridge_down, bridge_restart, bridge_status,
                  bridge_logs, catalog_list_targets, catalog_show_target,
                  catalog_list_domains, catalog_validate, catalog_show_bridge

Available resources:  bridge://status, catalog://domains, catalog://targets

Project-scope (auto, inside ops-bridge/):
  Already configured in .mcp.json. Claude Code sessions inside this repo
  see the tools automatically.

User-scope (machine-global, any repo):
  python scripts/register_mcp.py

Human operator skill:
  /bridge-status  —  natural-language tunnel health summary
  (skill file: ~/.claude/plugins/ops-bridge/bridge-status.md)

Run the server directly (for debugging):
  uv run python src/bridge/mcp_server/server.py


DEVELOPMENT
-----------

  uv run pytest                       Run all tests
  uv run pytest tests/test_cli.py -v  Run a specific test file
  uv run ruff check .                 Lint

Source layout:

  src/bridge/
    cli.py        Typer CLI (entry point)
    models.py     Core dataclasses and enums
    config.py     Config loading from tunnels.yaml
    manager.py    Tunnel lifecycle (subprocess, reconnect loop)
    state.py      PID and state file management
    audit.py      Audit event logging
    health.py     HTTP health checker (async, httpx)
    catalog/      OpsCatalog extension


SERVER PREREQUISITES
--------------------

For reliable auto-reconnect after reboots or network drops, the remote sshd
needs two settings in /etc/ssh/sshd_config:

  ClientAliveInterval 30
  ClientAliveCountMax 3

Without these, dead SSH sessions hold their remote port forward open (the OS
has not yet cleaned up the socket), so the next reconnect attempt hits
"remote port forwarding failed" and exits with code 255. With ClientAlive
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.

Apply and reload (no disconnect):

  sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
  sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config
  sudo kill -HUP $(cat /run/sshd.pid)

If fail2ban is running on the remote, whitelist the bridge host IP so rapid
reconnect storms (e.g. after a key auth failure) do not trigger a ban.
Add the client IP to ignoreip in /etc/fail2ban/jail.local:

  [DEFAULT]
  ignoreip = 127.0.0.1/8 ::1 <your-bridge-host-ip>

Then reload: sudo systemctl reload fail2ban

Note: health_check.url must point to a LOCAL port (the local side of the
tunnel), not the remote forwarded port. For a reverse tunnel
(remote_port=18000, local_port=8000), the correct health check URL is
http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/...
For SSE endpoints (MCP), use a non-streaming endpoint from the same service
(e.g. the state-hub /state/health) since the health checker waits for the
response to complete.


DESIGN NOTES
------------

- No system daemons. Tunnel processes are managed as subprocesses; PIDs
  are tracked in ~/.local/state/bridge/.
- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL
  follows after 5 seconds if unresponsive.
- Actor attribution on every log event (human vs. automation) supports
  audit traceability (FRS §5.7).
- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port
                           -i ssh_key ssh_user@host
- ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote
  port is already in use. This is intentional — it forces a clean reconnect
  rather than silently running without the port forward active.


REPO STRUCTURE
--------------

  src/bridge/       Main source
  tests/            Test suite
  wiki/             PRD, FRS, OpsCatalog specification
  workplans/        Custodian State Hub workplan files (BRIDGE-WP-*)
  pyproject.toml    Build config and dependencies