Files
ops-bridge/SCOPE.md
tegwick a55c685f89 feat(diagnostics): end-to-end tunnel check, stale state detection, MCP extensions
- diagnostics.py: TunnelCheckResult with SSH process liveness, port
  probe, and optional API health check; check_tunnel / check_all_tunnels
- cli.py: bridge status shows LIVE column and [STALE] marker when state
  says connected but PID is dead; bridge check wired to diagnostics
- state.py: read_raw_pid helper; _pid_alive exported for reuse
- capabilities.py: capabilities registry stubs
- mcp_server/server.py: expose check_tunnel and tunnel capabilities
  over MCP
- SCOPE.md: rapid orientation document
- workplans/OPS-WP-0001-diagnostics.md: workplan backing this feature
- tests: 207 passing (test_cli, test_mcp, test_diagnostics)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 15:07:47 +01:00

4.2 KiB

SCOPE

This file helps you quickly understand what this repository is about, when it is relevant, and when it is not. It is intentionally lightweight and may be incomplete.


One-liner

SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards.


Core Idea

Claude Code sessions run locally; the Custodian State Hub API runs locally. Remote machines (Railiance nodes, Temporal workers, Markitect services) need to reach the hub. Ops-bridge manages named SSH reverse tunnels with auto-reconnect, health checks, audit logging, and an MCP server so Claude Code can start/stop/inspect tunnels as tools.


In Scope

  • Named SSH reverse tunnel lifecycle (bridge up/down/restart/status/logs)
  • Auto-reconnect with exponential backoff and configurable retry policy
  • Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
  • Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
  • Actor attribution: per-tunnel actor class (human / automation) for audit traceability
  • PID + state file management in ~/.local/state/bridge/
  • MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
  • OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)

Out of Scope

  • Identity/credential management (uses existing SSH keys)
  • Long-running application hosting on remote machines (port-forward only, not deployment)
  • VPN or layer-3 connectivity
  • Monitoring/alerting beyond JSON audit logs
  • Replacing SSH for general interactive access

Relevant When

  • Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
  • Need audit trail of which actor (human vs. automation) started/stopped tunnels
  • Setting up a new machine in the Railiance ecosystem that must phone home to the hub
  • Diagnosing connectivity issues between local hub and remote services

Not Relevant When

  • All work is local (no remote services involved)
  • Manually running ssh -R is acceptable
  • No need for audit tracing of tunnel state changes

Current State

  • Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped)
  • Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated
  • Stability: stable tunnel lifecycle; tested under network drops and SSH failures
  • Usage: running in lab for daily Railiance/Temporal connectivity

How It Fits

  • Upstream dependencies: SSH (system), OpenSSH server on remote hosts
  • Downstream consumers: all remote Claude Code agents depend on ops-bridge to reach local hub MCP; activity-core Temporal server reachable via bridge tunnel
  • Often used with: the-custodian (health checks point to hub API), activity-core (Temporal port-forwarding)

Terminology

  • Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check
  • Also known as: "the bridge"
  • Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge

  • the-custodian — primary consumer; ops-bridge keeps remote agents connected to it
  • activity-core — Temporal server on remote reached via ops-bridge tunnel
  • railiance-cluster / railiance-infra — remote hosts that need to phone home

Provided Capabilities

type: infrastructure
title: SSH reverse tunnel connectivity
description: Named, auto-reconnecting SSH reverse tunnels with health checks and audit logging — keeps remote execution environments continuously connected to the local Custodian State Hub.
keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge]

Getting Oriented

  • Start with: README.txt (architecture, config format, CLI commands, MCP integration)
  • Key files / directories: ~/.config/bridge/tunnels.yaml (tunnel config), ~/.local/state/bridge/ (PID/state files)
  • Entry points: bridge --help; bridge up <tunnel-name>; MCP: bridge_status()