generated from coulomb/repo-seed
bridge restart now means blank-slate recovery: reverse tunnels run should_cleanup_tunnel and clear orphan remote listeners before reconnecting; healthy forwards are left running. Local-direction tunnels keep stop/start only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted, restarted, error) and exit non-zero on cleanup failure. Closes BRIDGE-WP-0005.
260 lines
7.1 KiB
Markdown
260 lines
7.1 KiB
Markdown
OpsBridge
|
|
|
|
*Operations access for humans and agents*
|
|
|
|
# OpsBridge
|
|
|
|
**Operations Access Bridges for Humans and Automation Agents**
|
|
|
|
Modern IT infrastructure is automated, declarative, and continuously deployed.
|
|
But when something breaks, real systems rarely behave exactly as expected.
|
|
|
|
Operators need to **inspect, diagnose, and repair the running system** — not the theoretical one described in infrastructure code.
|
|
|
|
**OpsBridge** provides a lightweight way to create **controlled operational access paths** between systems so humans and automation agents can investigate and resolve issues in live environments.
|
|
|
|
It is designed for the moment when **intent meets reality**.
|
|
|
|
---
|
|
|
|
# Why OpsBridge Exists
|
|
|
|
Infrastructure teams increasingly rely on:
|
|
|
|
* Infrastructure as Code
|
|
* GitOps pipelines
|
|
* Kubernetes and cloud orchestration
|
|
* automated remediation
|
|
* AI-assisted diagnostics
|
|
|
|
These systems define **how infrastructure should behave**.
|
|
|
|
But operators deal with **how it actually behaves**.
|
|
|
|
The gap between these two worlds creates practical problems:
|
|
|
|
* debugging access requires ad-hoc SSH commands
|
|
* operators rely on shell history or tribal knowledge
|
|
* automation agents struggle to navigate infrastructure
|
|
* incident response becomes slow and inconsistent
|
|
|
|
OpsBridge provides a **simple operational layer** that makes access paths explicit, observable, and reusable.
|
|
|
|
---
|
|
|
|
# What OpsBridge Does
|
|
|
|
OpsBridge manages **Access Bridges for Operations Tasks**.
|
|
|
|
An access bridge is a temporary and controlled connectivity path between systems used for operations work.
|
|
|
|
Example:
|
|
|
|
```
|
|
Remote diagnostic host
|
|
│
|
|
│ HTTP request
|
|
▼
|
|
reverse SSH bridge
|
|
▼
|
|
local control service
|
|
```
|
|
|
|
OpsBridge lets operators and agents:
|
|
|
|
* create bridges
|
|
* inspect active bridges
|
|
* reconnect bridges automatically
|
|
* associate bridges with actors
|
|
* track operational access events
|
|
|
|
All without introducing a VPN, overlay network, or heavy access platform.
|
|
|
|
---
|
|
|
|
# Built for Human Operators and AI Agents
|
|
|
|
OpsBridge treats **humans and automation as first-class actors**.
|
|
|
|
Modern operations increasingly involve:
|
|
|
|
* diagnostic agents
|
|
* automated remediation
|
|
* AI-assisted debugging
|
|
* ephemeral execution environments
|
|
|
|
OpsBridge makes it possible to safely give these systems the **temporary access they need to understand and repair infrastructure**.
|
|
|
|
Every bridge is associated with an actor, making operational activity observable and attributable.
|
|
|
|
---
|
|
|
|
# Introducing OpsCatalog
|
|
|
|
OpsBridge works even better when paired with **OpsCatalog**, a Git-based repository that captures the operational view of infrastructure.
|
|
|
|
Where DevOps tools describe **how infrastructure should exist**, OpsCatalog captures **how operators actually interact with it**.
|
|
|
|
OpsCatalog defines:
|
|
|
|
* operational domains
|
|
* infrastructure targets
|
|
* operational bridges
|
|
* debugging entry points
|
|
* operational notes and procedures
|
|
|
|
Together, OpsBridge and OpsCatalog provide a shared operational map that helps teams navigate real infrastructure.
|
|
|
|
---
|
|
|
|
# A New Layer in the Infrastructure Stack
|
|
|
|
OpsBridge fits between infrastructure automation and real-world operations.
|
|
|
|
```
|
|
Infrastructure as Code
|
|
│
|
|
│ expected state
|
|
▼
|
|
OpsCatalog
|
|
│
|
|
│ operations knowledge
|
|
▼
|
|
OpsBridge
|
|
│
|
|
│ access bridges
|
|
▼
|
|
Live Infrastructure
|
|
```
|
|
|
|
This layer allows operators and automation systems to work with **the infrastructure that actually exists**, not just the one defined in configuration.
|
|
|
|
---
|
|
|
|
# Designed for Practical Operations
|
|
|
|
OpsBridge focuses on simplicity.
|
|
|
|
It is:
|
|
|
|
* lightweight
|
|
* CLI-driven
|
|
* infrastructure-agnostic
|
|
* automation-friendly
|
|
* identity-integrated
|
|
|
|
It integrates with existing systems such as identity providers without replacing them.
|
|
|
|
No new network layer.
|
|
No complex access gateway.
|
|
|
|
Just controlled operational access when you need it.
|
|
|
|
---
|
|
|
|
# Example Workflow
|
|
|
|
Start a bridge:
|
|
|
|
```
|
|
bridge up state-hub-railiance01
|
|
```
|
|
|
|
Check active bridges:
|
|
|
|
```
|
|
bridge status
|
|
```
|
|
|
|
Investigate infrastructure targets:
|
|
|
|
```
|
|
bridge targets
|
|
```
|
|
|
|
Stop the bridge when finished:
|
|
|
|
```
|
|
bridge down state-hub-railiance01
|
|
```
|
|
|
|
OpsBridge handles the lifecycle so operators can focus on solving the problem.
|
|
|
|
---
|
|
|
|
# Tunnel lifecycle commands
|
|
|
|
| Command | Purpose |
|
|
|---------|---------|
|
|
| `bridge up` | Start tunnel(s) that are not already running |
|
|
| `bridge down` | Stop tunnel(s) that are running |
|
|
| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
|
|
| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
|
|
|
|
## `bridge restart` — blank-slate recovery
|
|
|
|
`bridge restart` means *operational again*, not merely cycling the local manager
|
|
PID while a broken remote listener still holds the port.
|
|
|
|
For **reverse** tunnels (State Hub exposure on remote hosts), restart:
|
|
|
|
1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
|
|
2. Clears orphan listeners on the remote host when needed
|
|
3. Reconnects the tunnel (stop + start) only when cleanup was required
|
|
|
|
When the remote forward is already healthy, restart reports `healthy` and leaves
|
|
the working tunnel running — no unnecessary disruption.
|
|
|
|
For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
|
|
`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
|
|
|
|
Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
|
|
restart contract. The nightly cron (`bridge maintenance install-cron`) runs
|
|
`maintenance cleanup --restart` at 03:00.
|
|
|
|
**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
|
|
blocked `bridge restart` until operators discovered the maintenance subcommand.
|
|
See `state-hub/history/20260621-weekend-automation-assessment.md` and
|
|
`BRIDGE-WP-0005` in this repo.
|
|
|
|
## Host roles
|
|
|
|
Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
|
|
|
|
| Role | Hosts | Behaviour |
|
|
|------|-------|-----------|
|
|
| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
|
|
| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
|
|
| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
|
|
|
|
Conditional remote cleanup before restart benefits all reverse tunnels.
|
|
`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
|
|
forwards are untouched.
|
|
|
|
---
|
|
|
|
# The Philosophy Behind OpsBridge
|
|
|
|
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:
|
|
|
|
**the declared system**
|
|
and
|
|
**the experienced system**
|
|
and
|
|
**the needed system**
|
|
|
|
DevOps describes how systems should work.
|
|
|
|
Operations deals with how systems actually behave.
|
|
|
|
OpsBridge exists to make that gap manageable.
|
|
|
|
---
|
|
|
|
# OpsBridge in One Sentence
|
|
|
|
**OpsBridge is a lightweight operations access layer that helps humans and automation agents investigate, repair and improve live infrastructure.**
|
|
|
|
|
|
xxx
|