bridge restart now means blank-slate recovery: reverse tunnels run should_cleanup_tunnel and clear orphan remote listeners before reconnecting; healthy forwards are left running. Local-direction tunnels keep stop/start only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted, restarted, error) and exit non-zero on cleanup failure. Closes BRIDGE-WP-0005.
7.1 KiB
OpsBridge
Operations access for humans and agents
OpsBridge
Operations Access Bridges for Humans and Automation Agents
Modern IT infrastructure is automated, declarative, and continuously deployed. But when something breaks, real systems rarely behave exactly as expected.
Operators need to inspect, diagnose, and repair the running system — not the theoretical one described in infrastructure code.
OpsBridge provides a lightweight way to create controlled operational access paths between systems so humans and automation agents can investigate and resolve issues in live environments.
It is designed for the moment when intent meets reality.
Why OpsBridge Exists
Infrastructure teams increasingly rely on:
- Infrastructure as Code
- GitOps pipelines
- Kubernetes and cloud orchestration
- automated remediation
- AI-assisted diagnostics
These systems define how infrastructure should behave.
But operators deal with how it actually behaves.
The gap between these two worlds creates practical problems:
- debugging access requires ad-hoc SSH commands
- operators rely on shell history or tribal knowledge
- automation agents struggle to navigate infrastructure
- incident response becomes slow and inconsistent
OpsBridge provides a simple operational layer that makes access paths explicit, observable, and reusable.
What OpsBridge Does
OpsBridge manages Access Bridges for Operations Tasks.
An access bridge is a temporary and controlled connectivity path between systems used for operations work.
Example:
Remote diagnostic host
│
│ HTTP request
▼
reverse SSH bridge
▼
local control service
OpsBridge lets operators and agents:
- create bridges
- inspect active bridges
- reconnect bridges automatically
- associate bridges with actors
- track operational access events
All without introducing a VPN, overlay network, or heavy access platform.
Built for Human Operators and AI Agents
OpsBridge treats humans and automation as first-class actors.
Modern operations increasingly involve:
- diagnostic agents
- automated remediation
- AI-assisted debugging
- ephemeral execution environments
OpsBridge makes it possible to safely give these systems the temporary access they need to understand and repair infrastructure.
Every bridge is associated with an actor, making operational activity observable and attributable.
Introducing OpsCatalog
OpsBridge works even better when paired with OpsCatalog, a Git-based repository that captures the operational view of infrastructure.
Where DevOps tools describe how infrastructure should exist, OpsCatalog captures how operators actually interact with it.
OpsCatalog defines:
- operational domains
- infrastructure targets
- operational bridges
- debugging entry points
- operational notes and procedures
Together, OpsBridge and OpsCatalog provide a shared operational map that helps teams navigate real infrastructure.
A New Layer in the Infrastructure Stack
OpsBridge fits between infrastructure automation and real-world operations.
Infrastructure as Code
│
│ expected state
▼
OpsCatalog
│
│ operations knowledge
▼
OpsBridge
│
│ access bridges
▼
Live Infrastructure
This layer allows operators and automation systems to work with the infrastructure that actually exists, not just the one defined in configuration.
Designed for Practical Operations
OpsBridge focuses on simplicity.
It is:
- lightweight
- CLI-driven
- infrastructure-agnostic
- automation-friendly
- identity-integrated
It integrates with existing systems such as identity providers without replacing them.
No new network layer. No complex access gateway.
Just controlled operational access when you need it.
Example Workflow
Start a bridge:
bridge up state-hub-railiance01
Check active bridges:
bridge status
Investigate infrastructure targets:
bridge targets
Stop the bridge when finished:
bridge down state-hub-railiance01
OpsBridge handles the lifecycle so operators can focus on solving the problem.
Tunnel lifecycle commands
| Command | Purpose |
|---|---|
bridge up |
Start tunnel(s) that are not already running |
bridge down |
Stop tunnel(s) that are running |
bridge restart |
Blank-slate recovery — get tunnel(s) operational again |
bridge maintenance cleanup |
Proactive hygiene sweep without implying restart |
bridge restart — blank-slate recovery
bridge restart means operational again, not merely cycling the local manager
PID while a broken remote listener still holds the port.
For reverse tunnels (State Hub exposure on remote hosts), restart:
- Runs
should_cleanup_tunnelto detect stale SSH remote forwards - Clears orphan listeners on the remote host when needed
- Reconnects the tunnel (stop + start) only when cleanup was required
When the remote forward is already healthy, restart reports healthy and leaves
the working tunnel running — no unnecessary disruption.
For local-direction tunnels (direction: local in tunnels.yaml, e.g.
k3s-api-coulombcore), restart uses local stop/start only; no remote cleanup.
Use bridge maintenance cleanup for scheduled or manual hygiene without the
restart contract. The nightly cron (bridge maintenance install-cron) runs
maintenance cleanup --restart at 03:00.
Incident context: stale orphan sshd remote forwards after laptop sleep
blocked bridge restart until operators discovered the maintenance subcommand.
See state-hub/history/20260621-weekend-automation-assessment.md and
BRIDGE-WP-0005 in this repo.
Host roles
Tunnels in ~/.config/bridge/tunnels.yaml serve three host roles:
| Role | Hosts | Behaviour |
|---|---|---|
| Workstation origin | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
| VPS remotes | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
| LAN builder | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
Conditional remote cleanup before restart benefits all reverse tunnels.
should_cleanup_tunnel skips healthy forwards — VPS tunnels with live working
forwards are untouched.
The Philosophy Behind OpsBridge
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:
the declared system and the experienced system and the needed system
DevOps describes how systems should work.
Operations deals with how systems actually behave.
OpsBridge exists to make that gap manageable.
OpsBridge in One Sentence
OpsBridge is a lightweight operations access layer that helps humans and automation agents investigate, repair and improve live infrastructure.
xxx