Files
ops-bridge/wiki/OpsBridge.md
tegwick 10c6fdaec9 feat(restart): route reverse tunnels through stale-forward cleanup
bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.
2026-06-21 20:12:13 +02:00

260 lines
7.1 KiB
Markdown

OpsBridge
*Operations access for humans and agents*
# OpsBridge
**Operations Access Bridges for Humans and Automation Agents**
Modern IT infrastructure is automated, declarative, and continuously deployed.
But when something breaks, real systems rarely behave exactly as expected.
Operators need to **inspect, diagnose, and repair the running system** — not the theoretical one described in infrastructure code.
**OpsBridge** provides a lightweight way to create **controlled operational access paths** between systems so humans and automation agents can investigate and resolve issues in live environments.
It is designed for the moment when **intent meets reality**.
---
# Why OpsBridge Exists
Infrastructure teams increasingly rely on:
* Infrastructure as Code
* GitOps pipelines
* Kubernetes and cloud orchestration
* automated remediation
* AI-assisted diagnostics
These systems define **how infrastructure should behave**.
But operators deal with **how it actually behaves**.
The gap between these two worlds creates practical problems:
* debugging access requires ad-hoc SSH commands
* operators rely on shell history or tribal knowledge
* automation agents struggle to navigate infrastructure
* incident response becomes slow and inconsistent
OpsBridge provides a **simple operational layer** that makes access paths explicit, observable, and reusable.
---
# What OpsBridge Does
OpsBridge manages **Access Bridges for Operations Tasks**.
An access bridge is a temporary and controlled connectivity path between systems used for operations work.
Example:
```
Remote diagnostic host
│ HTTP request
reverse SSH bridge
local control service
```
OpsBridge lets operators and agents:
* create bridges
* inspect active bridges
* reconnect bridges automatically
* associate bridges with actors
* track operational access events
All without introducing a VPN, overlay network, or heavy access platform.
---
# Built for Human Operators and AI Agents
OpsBridge treats **humans and automation as first-class actors**.
Modern operations increasingly involve:
* diagnostic agents
* automated remediation
* AI-assisted debugging
* ephemeral execution environments
OpsBridge makes it possible to safely give these systems the **temporary access they need to understand and repair infrastructure**.
Every bridge is associated with an actor, making operational activity observable and attributable.
---
# Introducing OpsCatalog
OpsBridge works even better when paired with **OpsCatalog**, a Git-based repository that captures the operational view of infrastructure.
Where DevOps tools describe **how infrastructure should exist**, OpsCatalog captures **how operators actually interact with it**.
OpsCatalog defines:
* operational domains
* infrastructure targets
* operational bridges
* debugging entry points
* operational notes and procedures
Together, OpsBridge and OpsCatalog provide a shared operational map that helps teams navigate real infrastructure.
---
# A New Layer in the Infrastructure Stack
OpsBridge fits between infrastructure automation and real-world operations.
```
Infrastructure as Code
│ expected state
OpsCatalog
│ operations knowledge
OpsBridge
│ access bridges
Live Infrastructure
```
This layer allows operators and automation systems to work with **the infrastructure that actually exists**, not just the one defined in configuration.
---
# Designed for Practical Operations
OpsBridge focuses on simplicity.
It is:
* lightweight
* CLI-driven
* infrastructure-agnostic
* automation-friendly
* identity-integrated
It integrates with existing systems such as identity providers without replacing them.
No new network layer.
No complex access gateway.
Just controlled operational access when you need it.
---
# Example Workflow
Start a bridge:
```
bridge up state-hub-railiance01
```
Check active bridges:
```
bridge status
```
Investigate infrastructure targets:
```
bridge targets
```
Stop the bridge when finished:
```
bridge down state-hub-railiance01
```
OpsBridge handles the lifecycle so operators can focus on solving the problem.
---
# Tunnel lifecycle commands
| Command | Purpose |
|---------|---------|
| `bridge up` | Start tunnel(s) that are not already running |
| `bridge down` | Stop tunnel(s) that are running |
| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
## `bridge restart` — blank-slate recovery
`bridge restart` means *operational again*, not merely cycling the local manager
PID while a broken remote listener still holds the port.
For **reverse** tunnels (State Hub exposure on remote hosts), restart:
1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
2. Clears orphan listeners on the remote host when needed
3. Reconnects the tunnel (stop + start) only when cleanup was required
When the remote forward is already healthy, restart reports `healthy` and leaves
the working tunnel running — no unnecessary disruption.
For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
restart contract. The nightly cron (`bridge maintenance install-cron`) runs
`maintenance cleanup --restart` at 03:00.
**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
blocked `bridge restart` until operators discovered the maintenance subcommand.
See `state-hub/history/20260621-weekend-automation-assessment.md` and
`BRIDGE-WP-0005` in this repo.
## Host roles
Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
| Role | Hosts | Behaviour |
|------|-------|-----------|
| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
Conditional remote cleanup before restart benefits all reverse tunnels.
`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
forwards are untouched.
---
# The Philosophy Behind OpsBridge
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:
**the declared system**
and
**the experienced system**
and
**the needed system**
DevOps describes how systems should work.
Operations deals with how systems actually behave.
OpsBridge exists to make that gap manageable.
---
# OpsBridge in One Sentence
**OpsBridge is a lightweight operations access layer that helps humans and automation agents investigate, repair and improve live infrastructure.**
xxx