OpsBridge *Operations access for humans and agents* # OpsBridge **Operations Access Bridges for Humans and Automation Agents** Modern IT infrastructure is automated, declarative, and continuously deployed. But when something breaks, real systems rarely behave exactly as expected. Operators need to **inspect, diagnose, and repair the running system** — not the theoretical one described in infrastructure code. **OpsBridge** provides a lightweight way to create **controlled operational access paths** between systems so humans and automation agents can investigate and resolve issues in live environments. It is designed for the moment when **intent meets reality**. --- # Why OpsBridge Exists Infrastructure teams increasingly rely on: * Infrastructure as Code * GitOps pipelines * Kubernetes and cloud orchestration * automated remediation * AI-assisted diagnostics These systems define **how infrastructure should behave**. But operators deal with **how it actually behaves**. The gap between these two worlds creates practical problems: * debugging access requires ad-hoc SSH commands * operators rely on shell history or tribal knowledge * automation agents struggle to navigate infrastructure * incident response becomes slow and inconsistent OpsBridge provides a **simple operational layer** that makes access paths explicit, observable, and reusable. --- # What OpsBridge Does OpsBridge manages **Access Bridges for Operations Tasks**. An access bridge is a temporary and controlled connectivity path between systems used for operations work. Example: ``` Remote diagnostic host │ │ HTTP request ▼ reverse SSH bridge ▼ local control service ``` OpsBridge lets operators and agents: * create bridges * inspect active bridges * reconnect bridges automatically * associate bridges with actors * track operational access events All without introducing a VPN, overlay network, or heavy access platform. --- # Built for Human Operators and AI Agents OpsBridge treats **humans and automation as first-class actors**. Modern operations increasingly involve: * diagnostic agents * automated remediation * AI-assisted debugging * ephemeral execution environments OpsBridge makes it possible to safely give these systems the **temporary access they need to understand and repair infrastructure**. Every bridge is associated with an actor, making operational activity observable and attributable. --- # Introducing OpsCatalog OpsBridge works even better when paired with **OpsCatalog**, a Git-based repository that captures the operational view of infrastructure. Where DevOps tools describe **how infrastructure should exist**, OpsCatalog captures **how operators actually interact with it**. OpsCatalog defines: * operational domains * infrastructure targets * operational bridges * debugging entry points * operational notes and procedures Together, OpsBridge and OpsCatalog provide a shared operational map that helps teams navigate real infrastructure. --- # A New Layer in the Infrastructure Stack OpsBridge fits between infrastructure automation and real-world operations. ``` Infrastructure as Code │ │ expected state ▼ OpsCatalog │ │ operations knowledge ▼ OpsBridge │ │ access bridges ▼ Live Infrastructure ``` This layer allows operators and automation systems to work with **the infrastructure that actually exists**, not just the one defined in configuration. --- # Designed for Practical Operations OpsBridge focuses on simplicity. It is: * lightweight * CLI-driven * infrastructure-agnostic * automation-friendly * identity-integrated It integrates with existing systems such as identity providers without replacing them. No new network layer. No complex access gateway. Just controlled operational access when you need it. --- # Example Workflow Start a bridge: ``` bridge up state-hub-railiance01 ``` Check active bridges: ``` bridge status ``` Investigate infrastructure targets: ``` bridge targets ``` Stop the bridge when finished: ``` bridge down state-hub-railiance01 ``` OpsBridge handles the lifecycle so operators can focus on solving the problem. --- # Tunnel lifecycle commands | Command | Purpose | |---------|---------| | `bridge up` | Start tunnel(s) that are not already running | | `bridge down` | Stop tunnel(s) that are running | | `bridge restart` | Blank-slate recovery — get tunnel(s) operational again | | `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart | ## `bridge restart` — blank-slate recovery `bridge restart` means *operational again*, not merely cycling the local manager PID while a broken remote listener still holds the port. For **reverse** tunnels (State Hub exposure on remote hosts), restart: 1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards 2. Clears orphan listeners on the remote host when needed 3. Reconnects the tunnel (stop + start) only when cleanup was required When the remote forward is already healthy, restart reports `healthy` and leaves the working tunnel running — no unnecessary disruption. For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g. `k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup. Use `bridge maintenance cleanup` for scheduled or manual hygiene without the restart contract. The nightly cron (`bridge maintenance install-cron`) runs `maintenance cleanup --restart` at 03:00. **Incident context:** stale orphan `sshd` remote forwards after laptop sleep blocked `bridge restart` until operators discovered the maintenance subcommand. See `state-hub/history/20260621-weekend-automation-assessment.md` and `BRIDGE-WP-0005` in this repo. ## Host roles Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles: | Role | Hosts | Behaviour | |------|-------|-----------| | **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. | | **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. | | **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. | Conditional remote cleanup before restart benefits all reverse tunnels. `should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working forwards are untouched. --- # The Philosophy Behind OpsBridge Infrastructure teams succeed or fail based on how effectively they bridge the gaps between: **the declared system** and **the experienced system** and **the needed system** DevOps describes how systems should work. Operations deals with how systems actually behave. OpsBridge exists to make that gap manageable. --- # OpsBridge in One Sentence **OpsBridge is a lightweight operations access layer that helps humans and automation agents investigate, repair and improve live infrastructure.** xxx