Files
ops-bridge/wiki/OpsBridge.md
tegwick 10c6fdaec9 feat(restart): route reverse tunnels through stale-forward cleanup
bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.
2026-06-21 20:12:13 +02:00

7.1 KiB

OpsBridge

Operations access for humans and agents

OpsBridge

Operations Access Bridges for Humans and Automation Agents

Modern IT infrastructure is automated, declarative, and continuously deployed. But when something breaks, real systems rarely behave exactly as expected.

Operators need to inspect, diagnose, and repair the running system — not the theoretical one described in infrastructure code.

OpsBridge provides a lightweight way to create controlled operational access paths between systems so humans and automation agents can investigate and resolve issues in live environments.

It is designed for the moment when intent meets reality.


Why OpsBridge Exists

Infrastructure teams increasingly rely on:

  • Infrastructure as Code
  • GitOps pipelines
  • Kubernetes and cloud orchestration
  • automated remediation
  • AI-assisted diagnostics

These systems define how infrastructure should behave.

But operators deal with how it actually behaves.

The gap between these two worlds creates practical problems:

  • debugging access requires ad-hoc SSH commands
  • operators rely on shell history or tribal knowledge
  • automation agents struggle to navigate infrastructure
  • incident response becomes slow and inconsistent

OpsBridge provides a simple operational layer that makes access paths explicit, observable, and reusable.


What OpsBridge Does

OpsBridge manages Access Bridges for Operations Tasks.

An access bridge is a temporary and controlled connectivity path between systems used for operations work.

Example:

Remote diagnostic host
        │
        │ HTTP request
        ▼
reverse SSH bridge
        ▼
local control service

OpsBridge lets operators and agents:

  • create bridges
  • inspect active bridges
  • reconnect bridges automatically
  • associate bridges with actors
  • track operational access events

All without introducing a VPN, overlay network, or heavy access platform.


Built for Human Operators and AI Agents

OpsBridge treats humans and automation as first-class actors.

Modern operations increasingly involve:

  • diagnostic agents
  • automated remediation
  • AI-assisted debugging
  • ephemeral execution environments

OpsBridge makes it possible to safely give these systems the temporary access they need to understand and repair infrastructure.

Every bridge is associated with an actor, making operational activity observable and attributable.


Introducing OpsCatalog

OpsBridge works even better when paired with OpsCatalog, a Git-based repository that captures the operational view of infrastructure.

Where DevOps tools describe how infrastructure should exist, OpsCatalog captures how operators actually interact with it.

OpsCatalog defines:

  • operational domains
  • infrastructure targets
  • operational bridges
  • debugging entry points
  • operational notes and procedures

Together, OpsBridge and OpsCatalog provide a shared operational map that helps teams navigate real infrastructure.


A New Layer in the Infrastructure Stack

OpsBridge fits between infrastructure automation and real-world operations.

Infrastructure as Code
        │
        │ expected state
        ▼
OpsCatalog
        │
        │ operations knowledge
        ▼
OpsBridge
        │
        │ access bridges
        ▼
Live Infrastructure

This layer allows operators and automation systems to work with the infrastructure that actually exists, not just the one defined in configuration.


Designed for Practical Operations

OpsBridge focuses on simplicity.

It is:

  • lightweight
  • CLI-driven
  • infrastructure-agnostic
  • automation-friendly
  • identity-integrated

It integrates with existing systems such as identity providers without replacing them.

No new network layer. No complex access gateway.

Just controlled operational access when you need it.


Example Workflow

Start a bridge:

bridge up state-hub-railiance01

Check active bridges:

bridge status

Investigate infrastructure targets:

bridge targets

Stop the bridge when finished:

bridge down state-hub-railiance01

OpsBridge handles the lifecycle so operators can focus on solving the problem.


Tunnel lifecycle commands

Command Purpose
bridge up Start tunnel(s) that are not already running
bridge down Stop tunnel(s) that are running
bridge restart Blank-slate recovery — get tunnel(s) operational again
bridge maintenance cleanup Proactive hygiene sweep without implying restart

bridge restart — blank-slate recovery

bridge restart means operational again, not merely cycling the local manager PID while a broken remote listener still holds the port.

For reverse tunnels (State Hub exposure on remote hosts), restart:

  1. Runs should_cleanup_tunnel to detect stale SSH remote forwards
  2. Clears orphan listeners on the remote host when needed
  3. Reconnects the tunnel (stop + start) only when cleanup was required

When the remote forward is already healthy, restart reports healthy and leaves the working tunnel running — no unnecessary disruption.

For local-direction tunnels (direction: local in tunnels.yaml, e.g. k3s-api-coulombcore), restart uses local stop/start only; no remote cleanup.

Use bridge maintenance cleanup for scheduled or manual hygiene without the restart contract. The nightly cron (bridge maintenance install-cron) runs maintenance cleanup --restart at 03:00.

Incident context: stale orphan sshd remote forwards after laptop sleep blocked bridge restart until operators discovered the maintenance subcommand. See state-hub/history/20260621-weekend-automation-assessment.md and BRIDGE-WP-0005 in this repo.

Host roles

Tunnels in ~/.config/bridge/tunnels.yaml serve three host roles:

Role Hosts Behaviour
Workstation origin WSL laptop Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake.
VPS remotes coulombcore, railiance01 Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot.
LAN builder haskelseed Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly.

Conditional remote cleanup before restart benefits all reverse tunnels. should_cleanup_tunnel skips healthy forwards — VPS tunnels with live working forwards are untouched.


The Philosophy Behind OpsBridge

Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:

the declared system and the experienced system and the needed system

DevOps describes how systems should work.

Operations deals with how systems actually behave.

OpsBridge exists to make that gap manageable.


OpsBridge in One Sentence

OpsBridge is a lightweight operations access layer that helps humans and automation agents investigate, repair and improve live infrastructure.

xxx