Compare commits

...

33 Commits

Author SHA1 Message Date
6572a2ac99 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-07-03:
  - update .custodian-brief.md for ops-bridge
2026-07-03 18:52:51 +02:00
ce0aa728b1 tunnels: optional remote_host forward destination (default 127.0.0.1)
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-07-02 14:18:18 +02:00
00671f5133 Normalize agent instructions and workplan frontmatter (STATE-WP-0067)
- Align agent files with on-disk workplan prefixes (infer from workplan ids)
- Set workplan domain to registered domain_slug; add topic_slug where applicable
- Repair frontmatter delimiter formatting; migrate legacy task status literals
- Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
2026-06-22 23:16:27 +02:00
09f2cd4b7a Mark .repo-classification.yaml human-reviewed (CUST-WP-0050 T02)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:40:44 +02:00
c3b4fb9d55 Reclassify as tooling (CUST-WP-0050 T02)
Apply the new 'tooling' category (reusable internal tooling/infrastructure)
from the Repo Classification Standard. First-pass agent classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 03:06:02 +02:00
fab7409c66 Add repo classification (CUST-WP-0050 T02)
First-pass agent classification per the Repo Classification Standard v1.0
(canon-repo-classification); pending human review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 02:44:47 +02:00
1dd664c792 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-21:
  - update .custodian-brief.md for ops-bridge
2026-06-21 20:12:38 +02:00
10c6fdaec9 feat(restart): route reverse tunnels through stale-forward cleanup
bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.
2026-06-21 20:12:13 +02:00
8c11acc00c docs(ops-bridge): BRIDGE-WP-0005 restart includes remote cleanup
Add workplan to make bridge restart perform conditional stale-forward
cleanup before start (blank-slate recovery). Refines topology for laptop
workstation origin, intermittently offline haskelseed, and stable VPS
remotes (coulombcore, railiance01). Origin: STATE-WP-0063 tunnel incident.
Registered in State Hub via fix-consistency.
2026-06-21 20:02:18 +02:00
499b8781cc chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-21:
  - update .custodian-brief.md for ops-bridge
2026-06-21 20:02:10 +02:00
4e9882909f feat(maintenance): nightly stale SSH forward cleanup at 03:00
Add bridge maintenance cleanup to detect reverse tunnels whose remote
port is bound but no longer forwards (zombie sshd sessions), kill the
stale listeners on the remote host, and optionally restart the tunnel.

Includes install-cron/uninstall-cron/show-cron helpers and README notes
for the actcore-state-hub-bridge failure mode we hit on railiance01.
2026-06-19 15:59:27 +02:00
a6857fb8f7 Add credential routing instructions for all agent runtimes
Propagate shared credential-routing section (Codex, Claude, Grok, llm-connect)
from state-hub template via scripts/propagate_credential_routing.py.
2026-06-18 22:48:39 +02:00
675772ab3b Add capability registry scaffold (REUSE-WP-0014-T06 B04) 2026-06-16 01:55:58 +02:00
6eb0b1c52f Fixing bridge to haskelseed 2026-06-14 19:46:06 +02:00
d949f3e93e Refresh agent instruction files 2026-05-18 16:55:47 +02:00
de984736ca feat(cli): add bridge conventions and link from actor errors
Surfaces the actor naming rules (adm-/agt-/atm- prefixes, legacy class
aliases) so users hitting a ConfigError have an in-CLI way to read the
spec without grepping the wiki.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:21:37 +02:00
28ecef121e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15:
  - update .custodian-brief.md for ops-bridge
2026-05-15 12:19:50 +02:00
860c08f1db chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15:
  - update .custodian-brief.md for ops-bridge
2026-05-15 09:39:01 +02:00
bd169a07e2 feat(directive): implement BRIDGE-WP-0004 AccessManagementDirective alignment
- ActorType enum (adm/agt/atm) replaces actor_class string; config validates
  naming convention (adm-*/agt-*/atm-*) with hard ConfigError on mismatch;
  legacy 'human'/'automation' values accepted with DeprecationWarning
- cert_command: pluggable shell string run before each SSH launch; cert written
  to state dir; -i cert appended to SSH command alongside -i key
- TTL-aware cert refresh: parses Valid-to via ssh-keygen -L; pre-emptive restart
  5 min before expiry (no backoff, no attempt increment); CERT_EXPIRING logged
- CertAcquisitionError: cert failures trigger normal backoff/retry loop
- cert_identity: Key ID parsed from cert and recorded in BRIDGE_CONNECTED event
- bridge cert-status: new CLI command; exit 1 on expired cert; --json flag
- 233 tests passing, ruff clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:38:29 +02:00
22601ef3e6 chore(workplans): sync BRIDGE-WP-0004 and WARDEN-WP-0001 tasks to state hub
Both workplans had been registered as active workstreams but tasks were
never ingested — the markdown checkbox format was invisible to the
consistency checker, which requires task code blocks. Activated both
workplans (draft→active) and added task blocks with state_hub_task_id
for all 19 tasks (9 + 10).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:29:51 +02:00
569de1497c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-06:
  - update .custodian-brief.md for ops-bridge
2026-05-06 04:24:17 +02:00
fafd04ed2e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-06:
  - update .custodian-brief.md for ops-bridge
2026-05-06 02:41:26 +02:00
c1d87b47df Added INTENT.md file 2026-05-02 23:17:22 +02:00
204bf48bc8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-01:
  - update .custodian-brief.md for ops-bridge
2026-05-01 23:22:08 +02:00
595c495f7c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-01:
  - update .custodian-brief.md for ops-bridge
2026-05-01 23:07:50 +02:00
90eda27a14 Scope update from repo-scoping refactor 2026-05-01 12:28:27 +02:00
1361727e15 Added untracked workplans 2026-04-25 17:06:05 +02:00
18e3c118dd chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for ops-bridge
2026-04-21 02:14:25 +02:00
621de64ee0 chore: merge origin/main — reconcile divergent branches
Integrates remote changes (session protocol, .custodian-brief.md, MCP
SSE/HTTP mode, workplan OPS-WP-0002 completion) with local changes
(AccessManagementDirective alignment, architecture docs, BRIDGE-WP-0004
and WARDEN-WP-0001 workplans).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 01:05:11 +00:00
f3a7236c5d docs: align architecture and scope with AccessManagementDirective
Expands architecture constraints and SCOPE.md to reflect the three-actor
vocabulary (adm/agt/atm), two credential modes (static key + cert_command),
and ops-warden boundary. Adds directive wiki doc and two new workplans
(BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 00:59:38 +00:00
4f3c8646b3 feat(mcp): SSE/HTTP mode, workplan OPS-WP-0002 done
- Add --http flag to MCP server for SSE transport on port 8002
- Add make mcp-http / mcp-stop targets
- Pin fastmcp<3.1.0 to stabilize dependency
- Update session-protocol: Step 0 tunnel health check before orient
- Mark OPS-WP-0002 and all its tasks done

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 14:10:49 +01:00
431beef31b chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-03-26:
  - update .custodian-brief.md for ops-bridge
2026-03-26 22:46:07 +01:00
1c7c6eedf8 chore(session): read .custodian-brief.md before MCP call in session init
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:48:52 +01:00
52 changed files with 3087 additions and 288 deletions

20
.claude/rules/agents.md Normal file
View File

@@ -0,0 +1,20 @@
## Kaizen Agents
Specialized agent personas available on demand via the state-hub MCP.
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
Common agents:
| Agent | Category | When to use |
|-------|----------|-------------|
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
| `test-maintenance` | testing | Diagnose and fix failing tests |
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
| `keepaTodofile` | process | Maintain TODO.md during work |
| `project-management` | process | Track status, determine next steps |
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
All 17 agents: call `list_kaizen_agents()` for the full list.

View File

@@ -1,31 +1,8 @@
## Architecture ## Architecture
OpsBridge has two logical components: <!-- TODO: Describe the key design decisions and component structure.
Key modules, data flows, external integrations, state machines, etc. -->
**1. OpsBridge — tunnel lifecycle manager** (this repo)
Manages named SSH reverse tunnels defined in `~/.config/bridge/tunnels.yaml`.
Each tunnel runs in a subprocess with a reconnect backoff loop; PIDs are tracked
in `~/.local/state/bridge/`. Bridge states: `stopped → starting → connected →
degraded → failed`. The `degraded` state means SSH is up but the optional HTTP
health check is failing.
**2. OpsCatalog — operations knowledge repository** (planned extension)
A Git-backed YAML catalog of operations domains, targets, bridges, and actor
classes. OpsBridge consumes this catalog to resolve bridge identifiers and
orient operators. Schema examples are in `wiki/OpsCatalogSpecification.md`.
The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
targets/, bridges/, docs/}`.
Key design constraints:
- OpsBridge owns lifecycle management only; it does not own identity/credentials
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
in config, CLI args, and log filenames must stay consistent
- Actor attribution (human operator vs. automation agent) is tracked per bridge
for audit log traceability (FRS §5.7)
Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
## Quick Reference ## Quick Reference
`~/the-custodian/state-hub/mcp_server/TOOLS.md` `~/state-hub/mcp_server/TOOLS.md` — MCP tool reference

View File

@@ -0,0 +1,50 @@
# Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes**`warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`

View File

@@ -0,0 +1,38 @@
## First Session Protocol
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
The project is registered but work has not yet been structured.
**Step 1 — Read, don't write**
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
- Scan repo root: README, directory structure, existing code or docs
**Step 2 — Survey in-progress work**
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
**Step 3 — Propose workstreams to Bernd**
Propose 13 workstreams — each a coherent strand, weeks to months, anchored to a
roadmap phase. **Wait for approval before creating.**
**Step 4 — Create workplan file first, then DB record (ADR-001)**
```
workplans/BRIDGE-WP-NNNN-<slug>.md ← write this first
```
Then register in the hub:
```
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
```
**Step 5 — Record the setup**
```
add_progress_event(
summary="First session: structured infotech into N workstreams, M tasks",
event_type="milestone",
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
detail={"workstreams": [...], "tasks_created": M}
)
```
<!-- Delete or archive this file once past first session -->

View File

@@ -1,6 +1,8 @@
## Repo boundary ## Repo boundary
This repo owns **tunnel lifecycle management only**. It does not own: This repo owns **ops-bridge** only. It does not own:
- State hub code → `the-custodian/state-hub/`
- SSH key management → `railiance-infra/` (S1) or user dotfiles <!-- TODO: List what belongs in adjacent repos, e.g.:
- Ansible/provisioning`railiance-infra/` - SSH key management → railiance-infra/
- State hub code → state-hub/
-->

View File

@@ -1,7 +1,5 @@
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution **Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
environments (COULOMBCORE, Railiance nodes) connected to the local Custodian
State Hub so Claude Code sessions on those machines have full MCP connectivity.
**Domain:** custodian **Domain:** infotech
**Repo slug:** ops-bridge **Repo slug:** ops-bridge
**Repo ID:** 1bf99f56-6e94-4379-a9ea-295a4c181889 **Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a

View File

@@ -1,24 +1,85 @@
## Custodian State Hub Integration ## Session Protocol
State Hub: http://127.0.0.1:8000 Dev Hub (State Hub API): http://127.0.0.1:8000
MCP server name in `~/.claude.json`: `dev-hub`
### Session Protocol
**Step 1 — Orient** **Step 1 — Orient**
Read the offline-safe brief first — it works without a live hub connection:
```bash
cat .custodian-brief.md
``` ```
get_domain_summary("custodian") Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
```
get_domain_summary("infotech")
```
If MCP tools are unavailable in the current agent session, use the REST API:
```bash
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
```
If the hub is offline: `cd ~/state-hub && make api`
**Step 2 — Check inbox**
With MCP tools:
```
get_messages(to_agent="ops-bridge", unread_only=True)
```
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
requests before proceeding.
Without MCP tools:
```bash
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
| python3 -m json.tool
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
``` ```
**Step 2 — Scan workplans** **Step 3 — Scan workplans**
``` ```bash
ls workplans/ ls workplans/
``` ```
For each file with `status: ready`, `active`, or `blocked`, note pending
`wait`/`todo`/`progress` tasks.
**During work:** use `record_decision()`, `add_progress_event()`, `resolve_decision()`. **Step 4 — Present brief**
**Session close:** `add_progress_event()` with workstream_id. 1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
2. **Pending tasks** from `workplans/` + any `[repo:ops-bridge]` hub tasks
3. **Goal guidance** — if `goal_guidance` in summary:
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
- `alignment_warnings`: flag if active work is not aligned with current goal
4. **Suggested next action** — highest-priority open item
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
If workplan files were modified, run from `~/the-custodian/state-hub/`: If no workstreams: follow First Session Protocol (`first-session.md`).
```bash
make fix-consistency REPO=ops-bridge **During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
**Session close:**
With MCP tools:
``` ```
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
```
Without MCP tools:
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
```
If workplan files were modified, ensure the local copy is up to date first:
```bash
git -C <repo_path> pull --ff-only
cd ~/state-hub && make fix-consistency REPO=ops-bridge
```
For repos where implementation runs on a remote machine (e.g. CoulombCore),
use the combined target which pulls before fixing:
```bash
cd ~/state-hub && make fix-consistency-remote REPO=ops-bridge
```
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
until you pull — intentional to prevent clobbering remote progress.

View File

@@ -1,46 +1,19 @@
## What this repo builds
A CLI tool (`bridge`) that manages named SSH reverse tunnels:
```
bridge up [TUNNEL] # start tunnel(s)
bridge down [TUNNEL] # stop tunnel(s)
bridge restart [TUNNEL] # restart tunnel(s)
bridge status # show all tunnels: state, uptime, last health check
bridge logs [TUNNEL] # tail reconnect log
```
Config file: `~/.config/bridge/tunnels.yaml`
Each tunnel:
- Named (e.g. `state-hub-coulombcore`)
- Reverse SSH port-forward: `ssh -R remote_port:127.0.0.1:local_port host`
- Auto-reconnects on drop (backoff loop)
- Optional HTTP health check to confirm the forwarded service is reachable
PRD: `workplans/BRIDGE-WP-0001-initial-implementation.md`
## Stack ## Stack
- **Language:** Python 3.11+ <!-- TODO: Fill in language, frameworks, and key dependencies -->
- **CLI framework:** Typer - **Language:**
- **Dependencies:** typer, pyyaml, httpx - **Key deps:**
- **Packaging:** `uv tool install` (single command install, no venv activation)
- **No system daemons** — process management is internal, PID tracked in
`~/.local/state/bridge/`
## Dev Commands ## Dev Commands
```bash ```bash
# Install locally for development # TODO: Fill in the standard commands for this repo
uv tool install -e .
# Install dependencies
# Run tests # Run tests
uv run pytest
# Run a single test # Lint / type check
uv run pytest tests/test_tunnel.py::test_name -v
# Lint # Build / package (if applicable)
uv run ruff check .
``` ```

View File

@@ -1,6 +1,40 @@
### Workplan Convention (ADR-001) ## Workplan Convention (ADR-001)
File location: `workplans/BRIDGE-WP-NNNN-<slug>.md` File location: `workplans/BRIDGE-WP-NNNN-<slug>.md`
Prefix: `BRIDGE-WP` ID prefix: `BRIDGE-WP-`
<!-- Ralph Loop rules are defined globally in ~/.claude/CLAUDE.md — do not duplicate here --> Work items originate as files in this repo **before** being registered in the hub.
Canonical workplan/workstream frontmatter statuses are:
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
Use `proposed` for a newly drafted plan, `ready` after review against current
repo state, and `finished` when implementation is complete. `stalled` and
`needs_review` are derived health labels, not stored statuses.
Closed workplans may be moved to `workplans/archived/` with a completion-date
prefix: `YYMMDD-BRIDGE-WP-NNNN-<slug>.md`. The frontmatter id remains
unchanged; the prefix is only for quick visual reference.
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
directly. Promote anything requiring analysis, design, approval, dependencies, or
multiple planned phases into a normal workplan.
Ecosystem todos from other agents arrive as `[repo:ops-bridge]` hub tasks —
visible at session start. Pick one up by creating the workplan file, then registering
the workstream.
Task blocks use this shape:
```task
id: BRIDGE-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
```
Status progression is `todo``progress``done`; use `wait` for waiting or
blocked work and `cancel` for stopped work.
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->

7
.codex/config.toml Normal file
View File

@@ -0,0 +1,7 @@
[mcp_servers.ops-bridge]
command = "uv"
args = [
"run",
"python",
"src/bridge/mcp_server/server.py",
]

18
.custodian-brief.md Normal file
View File

@@ -0,0 +1,18 @@
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
# Custodian Brief — ops-bridge
**Domain:** infotech
**Last synced:** 2026-07-03 16:52 UTC
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
## Active Workstreams
*(none — repo may need first-session setup)*
---
## MCP Orientation (when available)
If the state-hub MCP server is reachable, call:
`get_domain_summary("infotech")`
This provides richer cross-domain context.
If the MCP call fails, use this file as your orientation source.

26
.repo-classification.yaml Normal file
View File

@@ -0,0 +1,26 @@
# Repo classification (Repo Classification Standard v1.0).
repo_classification:
standard: Repo Classification Standard
version: '1.0'
classified_at: '2026-06-22'
classified_by: human
category: tooling
domain: infotech
secondary_domains: []
capability_tags:
- operations
- access-control
- platform
- observability
- orchestration
business_stake:
- operations
- technology
- automation
business_mechanics:
- control
- operation
- adaptation
notes: SSH reverse-tunnel lifecycle manager keeping remote environments connected to the
State Hub. Operational tooling -> product.

219
AGENTS.md Normal file
View File

@@ -0,0 +1,219 @@
# ops-bridge — Agent Instructions
## Repo Identity
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
**Domain:** infotech
**Repo slug:** ops-bridge
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
**Workplan prefix:** `BRIDGE-WP-`
---
## State Hub Integration
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
there is no MCP server for Codex agents.
| Context | URL |
|---------|-----|
| Local workstation | `http://127.0.0.1:8000` |
| Remote via tunnel | `http://127.0.0.1:18000` |
### Orient at session start
```bash
# Offline brief — works without hub connection
cat .custodian-brief.md
# Active workstreams for this domain
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
| python3 -m json.tool
# Check inbox
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
| python3 -m json.tool
```
Mark a message read:
```bash
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
### Log progress (required at session close)
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{
"summary": "what was done",
"event_type": "note",
"author": "codex",
"workstream_id": "<uuid>",
"task_id": "<uuid>"
}'
```
Omit `workstream_id` / `task_id` when not applicable.
### Update task status
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"status": "progress"}'
# values: wait | todo | progress | done | cancel
```
### Flag a task for human review
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"needs_human": true, "intervention_note": "reason"}'
```
---
## Session Protocol
**Start:**
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
2. Check inbox: `GET /messages/?to_agent=ops-bridge&unread_only=true`; mark read
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
**During work:**
- Update task statuses in workplan files as tasks progress
- Record significant decisions via `POST /decisions/`
**Close:**
1. Update workplan file task statuses to reflect progress
2. Log: `POST /progress/` with a summary of what changed
3. Note for the custodian operator: after workplan file changes, run from
`~/state-hub`:
```bash
make fix-consistency REPO=ops-bridge
```
This syncs task status from files into the hub DB.
---
## Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
<!-- REPO-AGENTS-EXTENSIONS -->
<!-- Append repo-specific agent instructions below this marker.
The state-hub template sync preserves content after this line. -->
---
## Workplan Convention (ADR-001)
Work items originate as files in this repo — not in the hub. The hub is a
read/cache/index layer that rebuilds from files.
**File location:** `workplans/OPS-WP-NNNN-<slug>.md`
**Archived location:** finished workplans may move to
`workplans/archived/YYMMDD-OPS-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
the completion/archive date; the frontmatter `id` does not change.
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
this only for low-risk work completed directly; create a normal workplan for
anything needing analysis, design, approval, dependencies, or multiple phases.
**Frontmatter:**
```yaml
---
id: OPS-WP-NNNN
type: workplan
title: "..."
domain: infotech
repo: ops-bridge
status: proposed | ready | active | blocked | backlog | finished | archived
owner: codex
topic_slug: ...
created: "YYYY-MM-DD"
updated: "YYYY-MM-DD"
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
---
```
Use `proposed` for a new draft, `ready` after review against current repo
state, and `finished` after implementation. `stalled` and `needs_review` are
derived health labels, not frontmatter statuses.
**Task block format** (one per `##` section):
```
## Task Title
` ` `task
id: OPS-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
` ` `
Task description text.
```
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
To create a new workplan:
1. Write the file following the format above
2. Notify the custodian operator to run `make fix-consistency REPO=ops-bridge`
(or send a message to the hub agent via `POST /messages/`)

View File

@@ -1,8 +1,12 @@
# ops-bridge — Claude Code Instructions # ops-bridge — Claude Code Instructions
@SCOPE.md
@.claude/rules/repo-identity.md @.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md @.claude/rules/session-protocol.md
@.claude/rules/first-session.md
@.claude/rules/workplan-convention.md @.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md @.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md @.claude/rules/architecture.md
@.claude/rules/repo-boundary.md @.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md

92
INTENT.md Normal file
View File

@@ -0,0 +1,92 @@
# INTENT
## Purpose
This repository exists to provide a **reliable, inspectable, and controllable connectivity layer**
between distributed dev, build, test and execution environments for dev and ops personal human and agentic.
Its role is to ensure that remote machines can **consistently and safely “phone home”** without requiring complex network infrastructure or manual intervention.
---
## Primary Utility
The repository provides a **managed SSH reverse tunneling system** that:
* Maintains continuous connectivity between remote systems and a central hub
* Makes connectivity **observable, auditable, and controllable**
* Exposes this capability as both a **CLI tool and an MCP-accessible service**
It transforms raw SSH port-forwarding into a **first-class operational primitive**.
---
## Intended Users
* Human operators (`adm`) managing infrastructure and connectivity
* LLM-based agents (`agt`) requiring stable access to local services
* Deterministic automations (`atm`) coordinating distributed workloads
---
## Strategic Role in the System
This repository acts as the **connectivity backbone** of the custodian ecosystem:
* It enables remote agents and services to participate in a **locally anchored control plane**
* It decouples **execution location** from **control location**
* It supports a **hub-and-spoke topology** where the Custodian State Hub remains central
---
## Strategic Boundaries
This repository is **not** intended to:
* Replace SSH as a general-purpose access mechanism
* Act as a credential authority or security policy engine
* Provide full network virtualization (e.g., VPN, mesh networking)
* Host or orchestrate application workloads
Its responsibility ends at **secure, observable, and managed connectivity via tunnels**.
---
## Design Principles
* **Continuity over convenience**
Connectivity must persist across failures without manual recovery
* **Observability as a first-class concern**
All lifecycle events must be traceable and attributable
* **Actor-aware operations**
Every action is tied to a clearly defined actor type (`adm`, `agt`, `atm`)
* **Pluggable security integration**
Works with both static keys and external certificate authorities without owning them
* **Toolability**
All capabilities should be accessible programmatically (MCP) and operationally (CLI)
---
## Maturity Target
A mature version of this repository should:
* Provide **fully autonomous tunnel lifecycle management** across heterogeneous environments
* Integrate seamlessly with **centralized access control and certificate systems**
* Serve as a **standardized connectivity primitive** across all Custodian-managed systems
* Offer **complete operational transparency** for all connectivity-related actions
* Be robust enough to act as the **default connectivity layer** for distributed agent systems
---
## Stability Note
Changes to this file represent a **deliberate shift in repository purpose or role** within the system architecture.
Such changes should be rare and made with explicit intent.

View File

@@ -1,10 +1,31 @@
.PHONY: test lint install .DEFAULT_GOAL := help
test: .PHONY: help setup test lint install mcp-http mcp-stop cron-install-cron cron-uninstall-cron
help: ## List available make targets
@awk 'BEGIN {FS = ":.*## "}; /^[a-zA-Z0-9_.-]+:.*## / {printf " %-16s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
setup: ## Sync dependencies and install the bridge CLI wrapper
uv sync --all-groups
uv tool install -e . --force
test: ## Run the test suite
uv run pytest uv run pytest
lint: lint: ## Run ruff lint checks
uv run ruff check . uv run ruff check .
install: install: ## Install the bridge CLI wrapper
uv tool install -e . uv tool install -e . --force
mcp-http: ## Start MCP server in SSE mode (default port 8002)
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
mcp-stop: ## Stop MCP server running on port 8002
@lsof -ti:$${BRIDGE_MCP_PORT:-8002} | xargs -r kill -TERM && echo "MCP server stopped" || echo "No MCP server running on port $${BRIDGE_MCP_PORT:-8002}"
cron-install-cron: ## Install 03:00 nightly stale-forward cleanup cron
bridge maintenance install-cron
cron-uninstall-cron: ## Remove nightly stale-forward cleanup cron
bridge maintenance uninstall-cron

View File

@@ -243,6 +243,31 @@ has not yet cleaned up the socket), so the next reconnect attempt hits
"remote port forwarding failed" and exits with code 255. With ClientAlive "remote port forwarding failed" and exits with code 255. With ClientAlive
enabled, sshd evicts stale sessions within ~90 seconds and frees the port. enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
NIGHTLY STALE-FORWARD CLEANUP
------------------------------
When a bridge client dies without tearing down its SSH session, the remote
host can keep port 18000 (etc.) bound to a zombie sshd listener. The port
accepts connections but never forwards them, which breaks in-cluster proxies
such as actcore-state-hub-bridge on railiance01.
Install a 03:00 local-time cron job that probes each reverse tunnel's remote
forward, kills stale listeners when the local service is healthy but the
remote forward is not, and restarts the tunnel:
bridge maintenance install-cron
Manual run:
bridge maintenance cleanup --restart
Inspect or remove the cron entry:
bridge maintenance show-cron
bridge maintenance uninstall-cron
Logs append to ~/.local/state/bridge/cleanup.log
Apply and reload (no disconnect): Apply and reload (no disconnect):
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config

View File

@@ -8,7 +8,7 @@
## One-liner ## One-liner
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.
--- ---
@@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## In Scope ## In Scope
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`) - Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
- Auto-reconnect with exponential backoff and configurable retry policy - Auto-reconnect with exponential backoff and configurable retry policy
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote) - Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.) - Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
- Actor attribution: per-tunnel actor class (human / automation) for audit traceability - Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
works without any CA or external tooling
- **cert_command mode** (optional): pluggable shell command that issues a short-lived
CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
`cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
- PID + state file management in `~/.local/state/bridge/` - PID + state file management in `~/.local/state/bridge/`
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools - MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges) - OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
@@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Out of Scope ## Out of Scope
- Identity/credential management (uses existing SSH keys) - Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
certs via the `cert_command` interface but never signs anything itself)
- SSH key generation for human admins (self-service: `ssh-keygen`)
- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
- Long-running application hosting on remote machines (port-forward only, not deployment) - Long-running application hosting on remote machines (port-forward only, not deployment)
- VPN or layer-3 connectivity - VPN or layer-3 connectivity
- Monitoring/alerting beyond JSON audit logs - Monitoring/alerting beyond JSON audit logs
@@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Relevant When ## Relevant When
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP - Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
- Need audit trail of which actor (human vs. automation) started/stopped tunnels - Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub - Setting up a new machine in the Railiance ecosystem that must phone home to the hub
- Diagnosing connectivity issues between local hub and remote services - Diagnosing connectivity issues between local hub and remote services
- Checking certificate validity for active tunnels (`bridge cert-status`)
- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials
--- ---
@@ -60,8 +71,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Current State ## Current State
- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped) - Status: active (v0.1 core complete; AccessManagementDirective alignment done — BRIDGE-WP-0004)
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated - Implementation: ~80% — CLI tunneling fully functional, MCP integration working, health
checks and audit logging complete; ActorType enum (adm/agt/atm) enforced; cert_command
mode implemented with TTL-aware refresh and cert_identity audit logging; OpsCatalog
framework present but not yet populated
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures - Stability: stable tunnel lifecycle; tested under network drops and SSH failures
- Usage: running in lab for daily Railiance/Temporal connectivity - Usage: running in lab for daily Railiance/Temporal connectivity
@@ -77,17 +91,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Terminology ## Terminology
- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check - Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
cert_command, cert_identity
- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
- Also known as: "the bridge" - Also known as: "the bridge"
- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge - Potentially confusing: "bridge state" is a tunnel-specific state machine
(stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)
--- ---
## Related / Overlapping Repositories ## Related / Overlapping
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it - `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
`cert_command` when short-lived certificates are required
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel - `activity-core` — Temporal server on remote reached via ops-bridge tunnel
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home - `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
host-side principal deployment (`/etc/ssh/auth_principals/`)
--- ---
@@ -105,5 +126,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge
## Getting Oriented ## Getting Oriented
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration) - Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files) - Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; MCP: `bridge_status()` `~/.local/state/bridge/` (PID/state/cert files)
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
MCP: `bridge_status()`
- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)

View File

@@ -11,7 +11,7 @@ dependencies = [
"typer>=0.12", "typer>=0.12",
"pyyaml>=6.0", "pyyaml>=6.0",
"httpx>=0.27", "httpx>=0.27",
"fastmcp>=2.0.0", "fastmcp>=2.0.0,<3.1.0",
] ]
[project.scripts] [project.scripts]

12
registry/README.md Normal file
View File

@@ -0,0 +1,12 @@
# Capability Registry
Markdown-first capability index for federation and reuse planning.
## Authoring
1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
2. Add the row to `indexes/capabilities.yaml`.
3. Run `reuse-surface validate` from a checkout with the CLI installed.
4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
Federation contract: reuse-surface `docs/RegistryFederation.md`.

View File

View File

@@ -0,0 +1,4 @@
version: 1
updated: '2026-06-16'
domain: helix_forge
capabilities: []

View File

@@ -16,6 +16,7 @@ class AuditEvent(str, Enum):
HEALTH_CHECK_FAILED = "health_check_failed" HEALTH_CHECK_FAILED = "health_check_failed"
HEALTH_CHECK_RECOVERED = "health_check_recovered" HEALTH_CHECK_RECOVERED = "health_check_recovered"
BRIDGE_STOPPED = "bridge_stopped" BRIDGE_STOPPED = "bridge_stopped"
CERT_EXPIRING = "cert_expiring"
def _default_state_dir() -> Path: def _default_state_dir() -> Path:
@@ -34,19 +35,22 @@ class AuditLogger:
tunnel: str, tunnel: str,
event: AuditEvent, event: AuditEvent,
actor: str, actor: str,
actor_class: str, actor_type: str,
detail: str = "", detail: str = "",
cert_identity: Optional[str] = None,
) -> None: ) -> None:
self._dir.mkdir(parents=True, exist_ok=True) self._dir.mkdir(parents=True, exist_ok=True)
entry: Dict[str, Any] = { entry: Dict[str, Any] = {
"timestamp": datetime.now(timezone.utc).isoformat(), "timestamp": datetime.now(timezone.utc).isoformat(),
"tunnel": tunnel, "tunnel": tunnel,
"actor": actor, "actor": actor,
"actor_class": actor_class, "actor_type": actor_type,
"event": event.value, "event": event.value,
} }
if detail: if detail:
entry["detail"] = detail entry["detail"] = detail
if cert_identity:
entry["cert_identity"] = cert_identity
with self._log_path(tunnel).open("a") as f: with self._log_path(tunnel).open("a") as f:
f.write(json.dumps(entry) + "\n") f.write(json.dumps(entry) + "\n")

View File

@@ -73,6 +73,11 @@ CAPABILITIES: list[Capability] = [
description="End-to-end tunnel diagnostics via SSH: SSH PID alive + remote port listening", description="End-to-end tunnel diagnostics via SSH: SSH PID alive + remote port listening",
required_access_modes=frozenset({"cli", "mcp"}), required_access_modes=frozenset({"cli", "mcp"}),
), ),
Capability(
name="bridge_cert_status",
description="Show certificate status for tunnels using cert_command mode",
required_access_modes=frozenset({"cli"}),
),
] ]
CAPABILITIES_BY_NAME: dict[str, Capability] = {c.name: c for c in CAPABILITIES} CAPABILITIES_BY_NAME: dict[str, Capability] = {c.name: c for c in CAPABILITIES}

328
src/bridge/cleanup.py Normal file
View File

@@ -0,0 +1,328 @@
"""Nightly maintenance: detect and clear stale SSH remote port forwards."""
from __future__ import annotations
import subprocess
from dataclasses import dataclass
from typing import Optional
from urllib.parse import urlparse, urlunparse
import httpx
from bridge.diagnostics import _remote_port_probe_command, check_tunnel
from bridge.manager import TunnelManager
from bridge.models import TunnelConfig
from bridge.state import StateManager
@dataclass
class CleanupAction:
tunnel: str
action: str # skipped | healthy | cleaned | cleaned_and_restarted | error
detail: str = ""
@dataclass
class CleanupReport:
actions: list[CleanupAction]
@property
def cleaned_count(self) -> int:
return sum(1 for a in self.actions if a.action.startswith("cleaned"))
def remote_forward_health_url(cfg: TunnelConfig) -> Optional[str]:
"""Map the local health_check URL to the remote forwarded port."""
if cfg.health_check is None or cfg.direction == "local":
return None
parsed = urlparse(cfg.health_check.url)
if not parsed.hostname:
return None
netloc = f"{parsed.hostname}:{cfg.remote_port}"
return urlunparse(parsed._replace(netloc=netloc))
def _ssh_base_cmd(cfg: TunnelConfig) -> list[str]:
from pathlib import Path
return [
"ssh",
"-i",
str(Path(cfg.ssh_key).expanduser()),
"-o",
"BatchMode=yes",
"-o",
"ConnectTimeout=10",
"-o",
"StrictHostKeyChecking=accept-new",
f"{cfg.ssh_user}@{cfg.host}",
]
def _run_ssh(cfg: TunnelConfig, remote_command: str, *, timeout: float = 30) -> subprocess.CompletedProcess[str]:
return subprocess.run(
[*_ssh_base_cmd(cfg), remote_command],
capture_output=True,
text=True,
timeout=timeout,
)
def remote_port_listening(cfg: TunnelConfig) -> bool:
proc = _run_ssh(cfg, _remote_port_probe_command(cfg.remote_port), timeout=15)
return proc.stdout.strip() == "ok"
def probe_remote_forward(cfg: TunnelConfig) -> tuple[bool, str]:
"""Return (healthy, detail) for the remote forwarded service."""
url = remote_forward_health_url(cfg)
if url is None:
return True, "no remote health url configured"
timeout = cfg.health_check.timeout_seconds if cfg.health_check else 5
remote_cmd = (
f"curl -sf --max-time {timeout} {url!r} >/dev/null "
"&& echo ok || echo fail"
)
try:
proc = _run_ssh(cfg, remote_cmd, timeout=timeout + 15)
except subprocess.TimeoutExpired:
return False, "remote health probe timed out"
output = proc.stdout.strip()
if output == "ok":
return True, "remote forward healthy"
if proc.returncode != 0 and proc.stderr.strip():
return False, proc.stderr.strip()
return False, "remote forward unhealthy"
def local_service_healthy(cfg: TunnelConfig) -> Optional[bool]:
if cfg.health_check is None:
return None
try:
resp = httpx.get(
cfg.health_check.url,
timeout=cfg.health_check.timeout_seconds,
)
return resp.is_success
except Exception:
return False
def _remote_cleanup_script(port: int) -> str:
return f"""set -eu
port={port}
pids=""
if command -v lsof >/dev/null 2>&1; then
pids=$(sudo -n lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
if [ -z "$pids" ]; then
pids=$(lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
fi
fi
if [ -z "$pids" ] && command -v fuser >/dev/null 2>&1; then
pids=$(fuser -n tcp $port 2>/dev/null | tr -s ' ' '\\n' | grep -E '^[0-9]+$' || true)
fi
if [ -z "$pids" ]; then
echo "no_listeners"
exit 0
fi
echo "killing:$pids"
for pid in $pids; do
kill "$pid" 2>/dev/null || sudo -n kill "$pid" 2>/dev/null || true
done
sleep 1
if ss -tln 2>/dev/null | grep -q ":$port "; then
echo "still_listening"
else
echo "cleared"
fi
"""
def clear_stale_remote_binding(cfg: TunnelConfig) -> tuple[bool, str]:
try:
proc = _run_ssh(cfg, _remote_cleanup_script(cfg.remote_port), timeout=30)
except subprocess.TimeoutExpired:
return False, "remote cleanup timed out"
output = proc.stdout.strip()
if "cleared" in output:
return True, output
if "no_listeners" in output:
return True, "no listeners found"
if "still_listening" in output:
return False, output
detail = output or proc.stderr.strip() or f"exit {proc.returncode}"
return False, detail
def should_cleanup_tunnel(
cfg: TunnelConfig,
state_mgr: StateManager,
) -> tuple[bool, str]:
"""Decide whether a reverse tunnel's remote binding looks stale."""
if cfg.direction == "local":
return False, "local tunnel"
if not remote_port_listening(cfg):
return False, "remote port closed"
remote_ok, remote_detail = probe_remote_forward(cfg)
if remote_ok:
return False, remote_detail
check = check_tunnel(cfg, state_mgr)
local_ok = local_service_healthy(cfg)
if local_ok is True and not remote_ok:
return True, f"stale forward: {remote_detail}"
if check.ssh_process != "ok" and check.remote_port == "listening":
return True, f"orphan forward while ssh {check.ssh_process}: {remote_detail}"
if check.ssh_process == "ok" and not remote_ok:
return True, f"broken forward with live client: {remote_detail}"
return False, remote_detail
def cleanup_tunnel(
cfg: TunnelConfig,
state_mgr: StateManager,
*,
restart: bool,
) -> CleanupAction:
name = cfg.name
try:
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
if not needed:
return CleanupAction(name, "healthy", reason)
ok, detail = clear_stale_remote_binding(cfg)
if not ok:
return CleanupAction(name, "error", f"cleanup failed: {detail}")
if not restart:
return CleanupAction(name, "cleaned", f"{reason}; {detail}")
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
was_running = mgr.is_running()
if was_running:
mgr.stop()
mgr.start()
action = "cleaned_and_restarted"
verb = "restarted" if was_running else "started"
return CleanupAction(name, action, f"{reason}; {verb} tunnel; {detail}")
except Exception as exc:
return CleanupAction(name, "error", str(exc))
def restart_tunnel(
cfg: TunnelConfig,
state_mgr: StateManager,
) -> CleanupAction:
"""Restart one tunnel with blank-slate recovery for reverse tunnels."""
if cfg.direction == "local":
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
mgr.stop()
mgr.start()
return CleanupAction(cfg.name, "restarted", "local tunnel stop/start")
return cleanup_tunnel(cfg, state_mgr, restart=True)
def restart_all_tunnels(
cfg,
state_mgr: StateManager,
) -> list[CleanupAction]:
"""Restart every inline tunnel (reverse via cleanup path, local via stop/start)."""
return [restart_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]
def cleanup_all_tunnels(
cfg,
state_mgr: StateManager,
*,
restart: bool,
tunnel_name: Optional[str] = None,
) -> CleanupReport:
tunnels = cfg.tunnels.values()
if tunnel_name is not None:
if tunnel_name not in cfg.tunnels:
raise KeyError(tunnel_name)
tunnels = [cfg.tunnels[tunnel_name]]
actions = [
cleanup_tunnel(tcfg, state_mgr, restart=restart)
for tcfg in tunnels
if tcfg.direction != "local"
]
return CleanupReport(actions=actions)
CRON_MARKER = "# ops-bridge: maintenance cleanup"
CRON_SCHEDULE = "0 3 * * *"
CRON_LOG = "~/.local/state/bridge/cleanup.log"
def build_cron_line() -> str:
bridge_bin = "~/.local/bin/bridge"
return (
f"{CRON_SCHEDULE} BRIDGE_CONFIG=~/.config/bridge/tunnels.yaml "
f"{bridge_bin} maintenance cleanup --restart "
f">> {CRON_LOG} 2>&1 {CRON_MARKER}"
)
def read_installed_cron() -> Optional[str]:
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
if proc.returncode != 0:
return None
for line in proc.stdout.splitlines():
if CRON_MARKER in line:
return line.strip()
return None
def install_cleanup_cron() -> tuple[bool, str]:
existing = read_installed_cron()
if existing:
return False, f"cron already installed: {existing}"
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
current = proc.stdout if proc.returncode == 0 else ""
new_line = build_cron_line()
body = current.rstrip("\n")
if body:
body += "\n"
body += new_line + "\n"
write = subprocess.run(
["crontab", "-"],
input=body,
capture_output=True,
text=True,
)
if write.returncode != 0:
return False, write.stderr.strip() or "crontab write failed"
return True, new_line
def uninstall_cleanup_cron() -> tuple[bool, str]:
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
if proc.returncode != 0:
return False, "no crontab installed"
kept = [
line
for line in proc.stdout.splitlines()
if CRON_MARKER not in line
]
if len(kept) == len(proc.stdout.splitlines()):
return False, "cleanup cron not found"
body = "\n".join(kept).rstrip("\n")
if body:
body += "\n"
write = subprocess.run(
["crontab", "-"],
input=body,
capture_output=True,
text=True,
)
if write.returncode != 0:
return False, write.stderr.strip() or "crontab write failed"
return True, "removed cleanup cron entry"

View File

@@ -4,12 +4,24 @@ from __future__ import annotations
import dataclasses import dataclasses
import json import json
import os import os
import subprocess
from datetime import datetime
from pathlib import Path from pathlib import Path
from typing import Optional from typing import Optional
import typer import typer
from bridge.audit import AuditLogger from bridge.audit import AuditLogger
from bridge.cleanup import (
CleanupAction,
build_cron_line,
cleanup_all_tunnels,
install_cleanup_cron,
read_installed_cron,
restart_all_tunnels,
restart_tunnel,
uninstall_cleanup_cron,
)
from bridge.config import ConfigError, load_config from bridge.config import ConfigError, load_config
from bridge.diagnostics import check_all_tunnels, check_tunnel from bridge.diagnostics import check_all_tunnels, check_tunnel
from bridge.manager import TunnelManager from bridge.manager import TunnelManager
@@ -23,9 +35,11 @@ app = typer.Typer(
targets_app = typer.Typer(help="Inspect infrastructure targets from the OpsCatalog.") targets_app = typer.Typer(help="Inspect infrastructure targets from the OpsCatalog.")
catalog_app = typer.Typer(help="Inspect and validate the OpsCatalog.") catalog_app = typer.Typer(help="Inspect and validate the OpsCatalog.")
maintenance_app = typer.Typer(help="Scheduled maintenance for tunnel hygiene.")
app.add_typer(targets_app, name="targets") app.add_typer(targets_app, name="targets")
app.add_typer(catalog_app, name="catalog") app.add_typer(catalog_app, name="catalog")
app.add_typer(maintenance_app, name="maintenance")
def _state_dir() -> Path: def _state_dir() -> Path:
@@ -142,27 +156,37 @@ def down(
raise typer.Exit(2) raise typer.Exit(2)
def _emit_restart_actions(actions: list[CleanupAction]) -> None:
any_error = False
for action in actions:
typer.echo(f"{action.tunnel}: {action.action}{action.detail}")
if action.action == "error":
any_error = True
if any_error:
raise typer.Exit(1)
@app.command() @app.command()
def restart( def restart(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"), tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
): ):
"""Restart one or all tunnels.""" """Restart one or all tunnels.
Reverse tunnels run conditional remote stale-forward cleanup before
reconnecting; healthy forwards are left running. Local-direction tunnels
use local stop/start only.
"""
cfg = _load_or_exit() cfg = _load_or_exit()
sd = _state_dir() sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
if tunnel: if tunnel:
tcfg = _resolve_tunnel(cfg, tunnel) tcfg = _resolve_tunnel(cfg, tunnel)
mgr = TunnelManager(tcfg, state_dir=sd) actions = [restart_tunnel(tcfg, state_mgr)]
mgr.stop()
mgr.start()
typer.echo(f"Restarted tunnel '{tunnel}'.")
else: else:
for name in _all_tunnel_names(cfg): actions = restart_all_tunnels(cfg, state_mgr)
tcfg = cfg.tunnels[name]
mgr = TunnelManager(tcfg, state_dir=sd) _emit_restart_actions(actions)
mgr.stop()
mgr.start()
typer.echo(f"Restarted tunnel '{name}'.")
@app.command() @app.command()
@@ -357,6 +381,84 @@ def _print_check_table(results):
typer.echo(_fmt(row)) typer.echo(_fmt(row))
@app.command("cert-status")
def cert_status(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Show certificate status for tunnels using cert_command mode."""
cfg = _load_or_exit()
sd = _state_dir()
names = [tunnel] if tunnel else list(cfg.tunnels.keys())
rows = []
any_expired = False
for name in names:
cert_file = sd / f"{name}-cert.pub"
if not cert_file.exists():
rows.append({"tunnel": name, "mode": "static-key", "cert_file": None})
continue
try:
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_file)],
capture_output=True, text=True, check=False,
)
info = {"tunnel": name, "mode": "cert", "cert_file": str(cert_file)}
for line in result.stdout.splitlines():
line = line.strip()
if line.startswith("Key ID:"):
info["key_id"] = line.split(":", 1)[1].strip().strip('"')
elif line.startswith("Valid:"):
parts = line.split()
if len(parts) >= 5 and parts[1] == "from" and parts[3] == "to":
info["valid_from"] = parts[2]
info["valid_until"] = parts[4]
try:
expires = datetime.fromisoformat(parts[4])
now = datetime.now()
remaining = expires - now
if remaining.total_seconds() <= 0:
info["expired"] = True
any_expired = True
else:
info["expired"] = False
mins = int(remaining.total_seconds() // 60)
info["ttl_remaining"] = f"{mins}m"
except ValueError:
pass
rows.append(info)
except FileNotFoundError:
rows.append({"tunnel": name, "mode": "cert", "error": "ssh-keygen not found"})
if as_json:
typer.echo(json.dumps(rows, indent=2))
else:
for row in rows:
mode = row.get("mode", "unknown")
if mode == "static-key":
typer.echo(f"{row['tunnel']} static-key / no cert")
elif "error" in row:
typer.echo(f"{row['tunnel']} ERROR: {row['error']}")
else:
parts = [row["tunnel"]]
if "key_id" in row:
parts.append(f"id={row['key_id']}")
if "valid_from" in row:
parts.append(f"from={row['valid_from']}")
if "valid_until" in row:
parts.append(f"until={row['valid_until']}")
if row.get("expired"):
parts.append("EXPIRED")
elif "ttl_remaining" in row:
parts.append(f"ttl={row['ttl_remaining']}")
typer.echo(" ".join(parts))
if any_expired:
raise typer.Exit(1)
# ─── targets commands ───────────────────────────────────────────────────────── # ─── targets commands ─────────────────────────────────────────────────────────
@targets_app.callback(invoke_without_command=True) @targets_app.callback(invoke_without_command=True)
@@ -553,3 +655,119 @@ def catalog_show(
if b.target in cat.targets: if b.target in cat.targets:
t = cat.targets[b.target] t = cat.targets[b.target]
typer.echo(f"Target: {t.description or t.id} ({t.kind})") typer.echo(f"Target: {t.description or t.id} ({t.kind})")
_CONVENTIONS_TEXT = """\
Actor Naming Conventions (from AccessManagementDirective.md §2)
Every actor declared under `actors:` in ~/.config/bridge/tunnels.yaml must have
a `class` field, and the actor name must start with the class-specific prefix:
class prefix purpose
----- ------ ------------------------------------------------------------
adm adm- Human operator (interactive shell when needed)
agt agt- LLM-powered autonomous agent (Claude Code, etc.)
atm atm- Deterministic script / cron job / pipeline
Legacy class aliases (deprecated, still accepted with a warning):
human -> adm
automation -> atm
Examples:
adm-bernd: { class: adm, description: Bernd Worsch }
agt-claude-coulombcore: { class: agt, description: Claude Code on CoulombCore }
atm-backup-daily: { class: atm, description: Nightly DB backup }
Full specification:
<ops-bridge repo>/wiki/AccessManagementDirective.md
"""
@maintenance_app.command("cleanup")
def maintenance_cleanup(
tunnel: Optional[str] = typer.Argument(
None,
help="Tunnel name (omit for all reverse tunnels)",
),
restart: bool = typer.Option(
False,
"--restart",
help="Restart tunnels after clearing stale remote bindings",
),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Clear stale SSH remote port forwards that block tunnel reconnects."""
cfg = _load_or_exit()
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
try:
report = cleanup_all_tunnels(
cfg,
state_mgr,
restart=restart,
tunnel_name=tunnel,
)
except KeyError:
typer.echo(f"Error: tunnel '{tunnel}' not found in config", err=True)
raise typer.Exit(1)
if as_json:
payload = {
"cleaned_count": report.cleaned_count,
"actions": [
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
for a in report.actions
],
}
typer.echo(json.dumps(payload, indent=2))
return
if not report.actions:
typer.echo("No reverse tunnels configured.")
return
for action in report.actions:
typer.echo(f"{action.tunnel}: {action.action}{action.detail}")
typer.echo(f"done ({report.cleaned_count} cleaned)")
@maintenance_app.command("install-cron")
def maintenance_install_cron():
"""Install a 03:00 daily cron job for `bridge maintenance cleanup --restart`."""
installed, message = install_cleanup_cron()
if installed:
typer.echo("Installed nightly cleanup cron:")
typer.echo(f" {message}")
else:
typer.echo(message)
raise typer.Exit(2)
@maintenance_app.command("uninstall-cron")
def maintenance_uninstall_cron():
"""Remove the nightly cleanup cron job."""
removed, message = uninstall_cleanup_cron()
if removed:
typer.echo(message)
else:
typer.echo(message)
raise typer.Exit(2)
@maintenance_app.command("show-cron")
def maintenance_show_cron():
"""Show the configured nightly cleanup cron line."""
existing = read_installed_cron()
if existing:
typer.echo(existing)
else:
typer.echo("Nightly cleanup cron is not installed.")
typer.echo("Would install:")
typer.echo(f" {build_cron_line()}")
@app.command()
def conventions():
"""Show the actor naming conventions enforced by tunnels.yaml."""
typer.echo(_CONVENTIONS_TEXT)

View File

@@ -2,13 +2,14 @@
from __future__ import annotations from __future__ import annotations
import os import os
import warnings
from dataclasses import dataclass from dataclasses import dataclass
from pathlib import Path from pathlib import Path
from typing import Dict, Optional from typing import Dict, Optional
import yaml import yaml
from bridge.models import ActorInfo, HealthCheckConfig, ReconnectPolicy, TunnelConfig from bridge.models import ActorInfo, ActorType, HealthCheckConfig, ReconnectPolicy, TunnelConfig
class ConfigError(Exception): class ConfigError(Exception):
@@ -91,6 +92,10 @@ def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
if direction not in ("reverse", "local"): if direction not in ("reverse", "local"):
raise ConfigError(f"Tunnel '{name}' direction must be 'reverse' or 'local', got: {direction!r}") raise ConfigError(f"Tunnel '{name}' direction must be 'reverse' or 'local', got: {direction!r}")
cert_command = data.get("cert_command") or None
if cert_command is not None:
cert_command = str(cert_command)
return TunnelConfig( return TunnelConfig(
name=name, name=name,
host=str(data["host"]), host=str(data["host"]),
@@ -102,6 +107,39 @@ def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
reconnect=reconnect, reconnect=reconnect,
health_check=health_check, health_check=health_check,
direction=direction, direction=direction,
remote_host=str(data.get("remote_host", "127.0.0.1")),
cert_command=cert_command,
)
_LEGACY_CLASS_MAP = {
"human": ActorType.ADM,
"automation": ActorType.ATM,
}
_ACTOR_TYPE_PREFIXES = {
ActorType.ADM: "adm-",
ActorType.AGT: "agt-",
ActorType.ATM: "atm-",
}
def _parse_actor_type(name: str, raw_class: str) -> ActorType:
if raw_class in _LEGACY_CLASS_MAP:
warnings.warn(
f"Actor '{name}': class '{raw_class}' is deprecated; "
f"use '{_LEGACY_CLASS_MAP[raw_class].value}' instead.",
DeprecationWarning,
stacklevel=4,
)
return _LEGACY_CLASS_MAP[raw_class]
try:
return ActorType(raw_class)
except ValueError:
raise ConfigError(
f"Actor '{name}' has unknown class '{raw_class}'; "
f"must be one of: adm, agt, atm (or legacy: human, automation). "
f"Run `bridge conventions` for the full naming rules."
) )
@@ -112,9 +150,17 @@ def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
raise ConfigError(f"Actor '{name}' must be a mapping") raise ConfigError(f"Actor '{name}' must be a mapping")
if "class" not in data: if "class" not in data:
raise ConfigError(f"Actor '{name}' missing required field: class") raise ConfigError(f"Actor '{name}' missing required field: class")
actor_type = _parse_actor_type(name, str(data["class"]))
required_prefix = _ACTOR_TYPE_PREFIXES[actor_type]
if not name.startswith(required_prefix):
raise ConfigError(
f"Actor '{name}' has type '{actor_type.value}' but name must start "
f"with '{required_prefix}' (got '{name}'). "
f"Run `bridge conventions` for the full naming rules."
)
actors[name] = ActorInfo( actors[name] = ActorInfo(
name=name, name=name,
actor_class=str(data["class"]), actor_type=actor_type,
description=str(data.get("description", "")), description=str(data.get("description", "")),
) )
return actors return actors

View File

@@ -1,6 +1,7 @@
"""End-to-end tunnel diagnostics for OpsBridge.""" """End-to-end tunnel diagnostics for OpsBridge."""
from __future__ import annotations from __future__ import annotations
import socket
import subprocess import subprocess
import time import time
from dataclasses import dataclass from dataclasses import dataclass
@@ -13,6 +14,38 @@ from bridge.models import BridgeState, TunnelConfig
from bridge.state import StateManager, _pid_alive from bridge.state import StateManager, _pid_alive
def _remote_port_probe_command(remote_port: int) -> str:
"""Build a portable remote shell probe for a listening TCP port."""
return (
f"port={remote_port}; "
"if command -v ss >/dev/null 2>&1; then "
"ss -tnlp 2>/dev/null | grep -q \":$port \" && echo ok || echo closed; "
"elif command -v netstat >/dev/null 2>&1; then "
"netstat -tnlp 2>/dev/null | "
"grep -q \"[.:]$port[[:space:]]\" && echo ok || echo closed; "
"else "
"hex=$(printf '%04X' \"$port\"); "
"awk -v p=\":$hex\" "
"'NR > 1 && $4 == \"0A\" && index($2, p) { found = 1 } "
"END { print found ? \"ok\" : \"closed\" }' "
"/proc/net/tcp /proc/net/tcp6 2>/dev/null; "
"fi"
)
def _probe_local_port(local_port: int) -> str:
"""Check whether the local side of an SSH -L tunnel is accepting TCP."""
try:
with socket.create_connection(("127.0.0.1", local_port), timeout=5):
return "listening"
except ConnectionRefusedError:
return "closed"
except socket.timeout:
return "error:timeout"
except OSError as e:
return f"error:{e}"
@dataclass @dataclass
class TunnelCheckResult: class TunnelCheckResult:
tunnel: str tunnel: str
@@ -52,7 +85,10 @@ def check_tunnel(cfg: TunnelConfig, state_mgr: StateManager) -> TunnelCheckResul
and ssh_process != "ok" and ssh_process != "ok"
) )
# 3. SSH probe for remote port # 3. Port probe: reverse tunnels listen remotely; local tunnels listen here.
if cfg.direction == "local":
remote_port = _probe_local_port(cfg.local_port)
else:
key_path = str(Path(cfg.ssh_key).expanduser()) key_path = str(Path(cfg.ssh_key).expanduser())
cmd = [ cmd = [
"ssh", "ssh",
@@ -61,7 +97,7 @@ def check_tunnel(cfg: TunnelConfig, state_mgr: StateManager) -> TunnelCheckResul
"-o", "ConnectTimeout=5", "-o", "ConnectTimeout=5",
"-o", "StrictHostKeyChecking=accept-new", "-o", "StrictHostKeyChecking=accept-new",
f"{cfg.ssh_user}@{cfg.host}", f"{cfg.ssh_user}@{cfg.host}",
f"ss -tnlp 2>/dev/null | grep -q ':{cfg.remote_port} ' && echo ok || echo closed", _remote_port_probe_command(cfg.remote_port),
] ]
try: try:
proc = subprocess.run( proc = subprocess.run(

View File

@@ -6,35 +6,102 @@ import os
import signal import signal
import subprocess import subprocess
import time import time
from datetime import datetime, timedelta
from pathlib import Path from pathlib import Path
from typing import List, Optional from typing import List, Optional
from bridge.audit import AuditEvent, AuditLogger from bridge.audit import AuditEvent, AuditLogger
from bridge.health import HealthChecker from bridge.health import HealthChecker
from bridge.models import BridgeState, TunnelConfig from bridge.models import BridgeState, CertAcquisitionError, TunnelConfig
from bridge.state import StateManager from bridge.state import StateManager
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
def build_ssh_command(cfg: TunnelConfig) -> List[str]: def _actor_type_from_name(name: str) -> str:
for prefix in ("adm", "agt", "atm"):
if name.startswith(f"{prefix}-"):
return prefix
return "unknown"
def build_ssh_command(cfg: TunnelConfig, cert_path: Optional[Path] = None) -> List[str]:
"""Build the SSH tunnel command (reverse -R or local -L).""" """Build the SSH tunnel command (reverse -R or local -L)."""
key = os.path.expanduser(cfg.ssh_key) key = os.path.expanduser(cfg.ssh_key)
if cfg.direction == "local": if cfg.direction == "local":
forward_flag = ["-L", f"{cfg.local_port}:127.0.0.1:{cfg.remote_port}"] forward_flag = ["-L", f"{cfg.local_port}:{cfg.remote_host}:{cfg.remote_port}"]
else: else:
forward_flag = ["-R", f"{cfg.remote_port}:127.0.0.1:{cfg.local_port}"] forward_flag = ["-R", f"{cfg.remote_port}:{cfg.remote_host}:{cfg.local_port}"]
return [ cmd = [
"ssh", "ssh",
"-N", "-N",
*forward_flag, *forward_flag,
"-i", key, "-i", key,
]
if cert_path is not None:
cmd += ["-i", str(cert_path)]
cmd += [
"-o", "ServerAliveInterval=10", "-o", "ServerAliveInterval=10",
"-o", "ServerAliveCountMax=3", "-o", "ServerAliveCountMax=3",
"-o", "ExitOnForwardFailure=yes", "-o", "ExitOnForwardFailure=yes",
"-o", "StrictHostKeyChecking=accept-new", "-o", "StrictHostKeyChecking=accept-new",
f"{cfg.ssh_user}@{cfg.host}", f"{cfg.ssh_user}@{cfg.host}",
] ]
return cmd
def _run_cert_command(cfg: TunnelConfig, state_dir: Path) -> Optional[Path]:
"""Run cert_command and write cert to state dir. Returns cert path or None."""
if cfg.cert_command is None:
return None
result = subprocess.run(
cfg.cert_command,
shell=True,
capture_output=True,
text=True,
)
if result.returncode != 0:
raise CertAcquisitionError(result.stderr.strip())
cert_path = state_dir / f"{cfg.name}-cert.pub"
cert_path.write_text(result.stdout)
return cert_path
def _parse_cert_identity(cert_path: Path) -> Optional[str]:
"""Parse Key ID from ssh-keygen -L output."""
try:
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_path)],
capture_output=True,
text=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line.startswith("Key ID:"):
return line.split(":", 1)[1].strip().strip('"')
except Exception:
pass
return None
def _parse_cert_expiry(cert_path: Path) -> Optional[datetime]:
"""Parse Valid-before datetime from ssh-keygen -L output."""
try:
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_path)],
capture_output=True,
text=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line.startswith("Valid:"):
# "Valid: from 2026-05-15T10:00:00 to 2026-05-15T22:00:00"
parts = line.split()
if len(parts) >= 5 and parts[3] == "to":
return datetime.fromisoformat(parts[4])
except Exception:
pass
return None
class TunnelManager: class TunnelManager:
@@ -56,7 +123,8 @@ class TunnelManager:
return self._state.is_running(self._cfg.name) return self._state.is_running(self._cfg.name)
def _actor_info(self): def _actor_info(self):
return self._cfg.actor, "unknown" actor = self._cfg.actor
return actor, _actor_type_from_name(actor)
def _next_backoff(self, attempt: int) -> int: def _next_backoff(self, attempt: int) -> int:
initial = self._cfg.reconnect.backoff_initial initial = self._cfg.reconnect.backoff_initial
@@ -71,12 +139,12 @@ class TunnelManager:
return return
self._state.write_state(self._cfg.name, BridgeState.STARTING) self._state.write_state(self._cfg.name, BridgeState.STARTING)
actor, actor_class = self._actor_info() actor, actor_type = self._actor_info()
self._audit.log( self._audit.log(
tunnel=self._cfg.name, tunnel=self._cfg.name,
event=AuditEvent.BRIDGE_STARTED, event=AuditEvent.BRIDGE_STARTED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
) )
pid = os.fork() pid = os.fork()
@@ -99,7 +167,7 @@ class TunnelManager:
tunnel=self._cfg.name, tunnel=self._cfg.name,
event=AuditEvent.BRIDGE_STOPPED, event=AuditEvent.BRIDGE_STOPPED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
) )
os._exit(0) os._exit(0)
@@ -131,12 +199,12 @@ class TunnelManager:
self._state.clear_pid(self._cfg.name) self._state.clear_pid(self._cfg.name)
self._state.write_state(self._cfg.name, BridgeState.STOPPED) self._state.write_state(self._cfg.name, BridgeState.STOPPED)
actor, actor_class = self._actor_info() actor, actor_type = self._actor_info()
self._audit.log( self._audit.log(
tunnel=self._cfg.name, tunnel=self._cfg.name,
event=AuditEvent.BRIDGE_STOPPED, event=AuditEvent.BRIDGE_STOPPED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
) )
def _run_loop(self) -> None: def _run_loop(self) -> None:
@@ -144,11 +212,11 @@ class TunnelManager:
import asyncio import asyncio
cfg = self._cfg cfg = self._cfg
actor, actor_class = self._actor_info() actor, actor_type = self._actor_info()
attempt = 0 attempt = 0
max_attempts = cfg.reconnect.max_attempts # 0 = infinite max_attempts = cfg.reconnect.max_attempts # 0 = infinite
state_dir = self._state._dir
# Setup signal handler for graceful shutdown
_stop = [False] _stop = [False]
def _on_term(signum, frame): def _on_term(signum, frame):
@@ -162,7 +230,31 @@ class TunnelManager:
self._state.write_state(cfg.name, BridgeState.FAILED) self._state.write_state(cfg.name, BridgeState.FAILED)
break break
cmd = build_ssh_command(cfg) # Acquire cert before each SSH launch (T3, T7)
try:
cert_path = _run_cert_command(cfg, state_dir)
except CertAcquisitionError as e:
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.BRIDGE_DISCONNECTED,
actor=actor,
actor_type=actor_type,
detail=f"cert acquisition failed: {e}",
)
attempt += 1
if max_attempts > 0 and attempt >= max_attempts:
self._state.write_state(cfg.name, BridgeState.FAILED)
break
backoff = self._next_backoff(attempt - 1)
self._state.write_state(cfg.name, BridgeState.RECONNECTING)
log.info("Cert acquisition failed, retrying in %ds", backoff)
time.sleep(backoff)
continue
cert_identity = _parse_cert_identity(cert_path) if cert_path else None
cert_expires_at = _parse_cert_expiry(cert_path) if cert_path else None
cmd = build_ssh_command(cfg, cert_path=cert_path)
log.info("Starting SSH: %s", " ".join(cmd)) log.info("Starting SSH: %s", " ".join(cmd))
self._state.write_state(cfg.name, BridgeState.STARTING) self._state.write_state(cfg.name, BridgeState.STARTING)
@@ -174,24 +266,30 @@ class TunnelManager:
tunnel=cfg.name, tunnel=cfg.name,
event=AuditEvent.BRIDGE_DISCONNECTED, event=AuditEvent.BRIDGE_DISCONNECTED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
detail="ssh binary not found", detail="ssh binary not found",
) )
break break
# Wait briefly then assume connected if still running
time.sleep(2) time.sleep(2)
_ttl_refresh = False
if proc.poll() is None: if proc.poll() is None:
self._state.write_state(cfg.name, BridgeState.CONNECTED) self._state.write_state(cfg.name, BridgeState.CONNECTED)
self._audit.log( self._audit.log(
tunnel=cfg.name, tunnel=cfg.name,
event=AuditEvent.BRIDGE_CONNECTED, event=AuditEvent.BRIDGE_CONNECTED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
cert_identity=cert_identity,
) )
attempt = 0 attempt = 0
# Health check loop def _check_ttl() -> bool:
"""Return True if cert is within 5 min of expiry and SSH should restart."""
if cert_expires_at is None:
return False
return datetime.now() >= cert_expires_at - timedelta(minutes=5)
if cfg.health_check: if cfg.health_check:
checker = HealthChecker( checker = HealthChecker(
url=cfg.health_check.url, url=cfg.health_check.url,
@@ -199,6 +297,18 @@ class TunnelManager:
) )
health_failing = False health_failing = False
while not _stop[0] and proc.poll() is None: while not _stop[0] and proc.poll() is None:
if _check_ttl():
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.CERT_EXPIRING,
actor=actor,
actor_type=actor_type,
cert_identity=cert_identity,
detail=str(cert_expires_at),
)
proc.terminate()
_ttl_refresh = True
break
result = asyncio.run(checker.check()) result = asyncio.run(checker.check())
if result.ok: if result.ok:
if health_failing: if health_failing:
@@ -208,7 +318,7 @@ class TunnelManager:
tunnel=cfg.name, tunnel=cfg.name,
event=AuditEvent.HEALTH_CHECK_RECOVERED, event=AuditEvent.HEALTH_CHECK_RECOVERED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
) )
else: else:
if not health_failing: if not health_failing:
@@ -218,21 +328,36 @@ class TunnelManager:
tunnel=cfg.name, tunnel=cfg.name,
event=AuditEvent.HEALTH_CHECK_FAILED, event=AuditEvent.HEALTH_CHECK_FAILED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
detail=result.error or f"HTTP {result.status_code}", detail=result.error or f"HTTP {result.status_code}",
) )
time.sleep(cfg.health_check.interval_seconds) time.sleep(cfg.health_check.interval_seconds)
else: else:
while not _stop[0] and proc.poll() is None: while not _stop[0] and proc.poll() is None:
if _check_ttl():
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.CERT_EXPIRING,
actor=actor,
actor_type=actor_type,
cert_identity=cert_identity,
detail=str(cert_expires_at),
)
proc.terminate()
_ttl_refresh = True
break
time.sleep(1) time.sleep(1)
# SSH exited if _ttl_refresh:
# Planned cert refresh — don't count as failure, no backoff
continue
if proc.poll() is not None: if proc.poll() is not None:
self._audit.log( self._audit.log(
tunnel=cfg.name, tunnel=cfg.name,
event=AuditEvent.BRIDGE_DISCONNECTED, event=AuditEvent.BRIDGE_DISCONNECTED,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
detail=f"exit code {proc.returncode}", detail=f"exit code {proc.returncode}",
) )
@@ -248,7 +373,7 @@ class TunnelManager:
tunnel=cfg.name, tunnel=cfg.name,
event=AuditEvent.BRIDGE_RECONNECTING, event=AuditEvent.BRIDGE_RECONNECTING,
actor=actor, actor=actor,
actor_class=actor_class, actor_type=actor_type,
detail=f"retry {attempt}, backoff {backoff}s", detail=f"retry {attempt}, backoff {backoff}s",
) )
log.info("Reconnecting in %ds (attempt %d)", backoff, attempt) log.info("Reconnecting in %ds (attempt %d)", backoff, attempt)

View File

@@ -169,19 +169,22 @@ def bridge_down(tunnel: Optional[str] = None) -> dict:
def bridge_restart(tunnel: Optional[str] = None) -> dict: def bridge_restart(tunnel: Optional[str] = None) -> dict:
"""Restart one or all configured tunnels. """Restart one or all configured tunnels.
Reverse tunnels run conditional remote stale-forward cleanup before
reconnecting; healthy forwards are left running.
Args: Args:
tunnel: Tunnel name to restart. If omitted, restarts all inline tunnels. tunnel: Tunnel name to restart. If omitted, restarts all inline tunnels.
Returns: Returns:
{"restarted": [...]} or {"error": "..."} {"actions": [{"tunnel", "action", "detail"}, ...]} or {"error": "..."}
""" """
cfg, err = _load_cfg_or_error() cfg, err = _load_cfg_or_error()
if err: if err:
return err return err
from bridge.manager import TunnelManager from bridge.cleanup import restart_all_tunnels, restart_tunnel
sd = _state_dir() sd = _state_dir()
restarted = [] state_mgr = StateManager(state_dir=sd)
if tunnel: if tunnel:
from bridge.catalog.loader import load_catalog from bridge.catalog.loader import load_catalog
@@ -196,18 +199,19 @@ def bridge_restart(tunnel: Optional[str] = None) -> dict:
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels) tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound: except BridgeNotFound:
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"} return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
mgr = TunnelManager(tcfg, state_dir=sd) actions = [restart_tunnel(tcfg, state_mgr)]
mgr.stop()
mgr.start()
restarted.append(tunnel)
else: else:
for name, tcfg in cfg.tunnels.items(): actions = restart_all_tunnels(cfg, state_mgr)
mgr = TunnelManager(tcfg, state_dir=sd)
mgr.stop()
mgr.start()
restarted.append(name)
return {"restarted": restarted} payload = {
"actions": [
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
for a in actions
],
}
if any(a.action == "error" for a in actions):
payload["error"] = "one or more tunnels failed to restart"
return payload
@mcp.tool() @mcp.tool()
@@ -513,4 +517,13 @@ def resource_catalog_targets() -> str:
# --------------------------------------------------------------------------- # ---------------------------------------------------------------------------
if __name__ == "__main__": if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="OpsBridge MCP server")
parser.add_argument("--http", action="store_true", help="Run in SSE/HTTP mode instead of stdio")
args = parser.parse_args()
if args.http:
port = int(os.environ.get("BRIDGE_MCP_PORT", "8002"))
mcp.run(transport="sse", host="127.0.0.1", port=port)
else:
mcp.run(transport="stdio") mcp.run(transport="stdio")

View File

@@ -15,6 +15,16 @@ class BridgeState(str, Enum):
FAILED = "failed" FAILED = "failed"
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
class CertAcquisitionError(Exception):
"""Raised when cert_command fails to produce a certificate."""
@dataclass @dataclass
class ReconnectPolicy: class ReconnectPolicy:
max_attempts: int = 0 # 0 = infinite max_attempts: int = 0 # 0 = infinite
@@ -41,10 +51,15 @@ class TunnelConfig:
reconnect: ReconnectPolicy = field(default_factory=ReconnectPolicy) reconnect: ReconnectPolicy = field(default_factory=ReconnectPolicy)
health_check: Optional[HealthCheckConfig] = None health_check: Optional[HealthCheckConfig] = None
direction: str = "reverse" # "reverse" (-R) or "local" (-L) direction: str = "reverse" # "reverse" (-R) or "local" (-L)
# Forward-destination host as seen from the remote end (direction "local")
# or from this workstation (direction "reverse"). Defaults to loopback;
# set e.g. a k3s ClusterIP to tunnel to an in-cluster Service.
remote_host: str = "127.0.0.1"
cert_command: Optional[str] = None
@dataclass @dataclass
class ActorInfo: class ActorInfo:
name: str name: str
actor_class: str # "human" or "automation" actor_type: ActorType
description: str = "" description: str = ""

View File

@@ -23,10 +23,10 @@ VALID_CONFIG = textwrap.dedent("""\
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
""") """)
@@ -38,10 +38,10 @@ VALID_CONFIG_WITH_CATALOG = textwrap.dedent("""\
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
catalog_path: {catalog_path} catalog_path: {catalog_path}
""") """)

View File

@@ -22,7 +22,7 @@ class TestAuditLogger:
tunnel="my-tunnel", tunnel="my-tunnel",
event=AuditEvent.BRIDGE_STARTED, event=AuditEvent.BRIDGE_STARTED,
actor="operator.bernd", actor="operator.bernd",
actor_class="human", actor_type="adm",
) )
log_file = log_dir / "my-tunnel.log" log_file = log_dir / "my-tunnel.log"
assert log_file.exists() assert log_file.exists()
@@ -32,7 +32,7 @@ class TestAuditLogger:
tunnel="my-tunnel", tunnel="my-tunnel",
event=AuditEvent.BRIDGE_STARTED, event=AuditEvent.BRIDGE_STARTED,
actor="operator.bernd", actor="operator.bernd",
actor_class="human", actor_type="adm",
) )
lines = (log_dir / "my-tunnel.log").read_text().strip().splitlines() lines = (log_dir / "my-tunnel.log").read_text().strip().splitlines()
assert len(lines) == 1 assert len(lines) == 1
@@ -40,12 +40,12 @@ class TestAuditLogger:
assert entry["tunnel"] == "my-tunnel" assert entry["tunnel"] == "my-tunnel"
assert entry["event"] == "bridge_started" assert entry["event"] == "bridge_started"
assert entry["actor"] == "operator.bernd" assert entry["actor"] == "operator.bernd"
assert entry["actor_class"] == "human" assert entry["actor_type"] == "adm"
assert "timestamp" in entry assert "timestamp" in entry
def test_multiple_events_append(self, logger, log_dir): def test_multiple_events_append(self, logger, log_dir):
for event in [AuditEvent.BRIDGE_STARTED, AuditEvent.BRIDGE_CONNECTED, AuditEvent.BRIDGE_STOPPED]: for event in [AuditEvent.BRIDGE_STARTED, AuditEvent.BRIDGE_CONNECTED, AuditEvent.BRIDGE_STOPPED]:
logger.log(tunnel="t", event=event, actor="a", actor_class="human") logger.log(tunnel="t", event=event, actor="a", actor_type="adm")
lines = (log_dir / "t.log").read_text().strip().splitlines() lines = (log_dir / "t.log").read_text().strip().splitlines()
assert len(lines) == 3 assert len(lines) == 3
@@ -54,7 +54,7 @@ class TestAuditLogger:
tunnel="t", tunnel="t",
event=AuditEvent.HEALTH_CHECK_FAILED, event=AuditEvent.HEALTH_CHECK_FAILED,
actor="a", actor="a",
actor_class="automation", actor_type="atm",
detail="connection refused", detail="connection refused",
) )
entry = json.loads((log_dir / "t.log").read_text().strip()) entry = json.loads((log_dir / "t.log").read_text().strip())
@@ -72,15 +72,15 @@ class TestAuditLogger:
def test_timestamp_is_iso8601(self, logger, log_dir): def test_timestamp_is_iso8601(self, logger, log_dir):
from datetime import datetime from datetime import datetime
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_class="human") logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
entry = json.loads((log_dir / "t.log").read_text().strip()) entry = json.loads((log_dir / "t.log").read_text().strip())
# Should parse without error # Should parse without error
dt = datetime.fromisoformat(entry["timestamp"]) dt = datetime.fromisoformat(entry["timestamp"])
assert dt.tzinfo is not None or True # UTC or naive both acceptable assert dt.tzinfo is not None or True # UTC or naive both acceptable
def test_read_events(self, logger, log_dir): def test_read_events(self, logger, log_dir):
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_class="human") logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_type="adm")
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_class="human") logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
events = logger.read_events("t") events = logger.read_events("t")
assert len(events) == 2 assert len(events) == 2
assert events[0]["event"] == "bridge_started" assert events[0]["event"] == "bridge_started"

130
tests/test_cleanup.py Normal file
View File

@@ -0,0 +1,130 @@
"""Tests for stale SSH forward cleanup."""
from __future__ import annotations
import textwrap
from unittest.mock import MagicMock, patch
from typer.testing import CliRunner
from bridge.cleanup import (
CleanupAction,
build_cron_line,
cleanup_all_tunnels,
remote_forward_health_url,
should_cleanup_tunnel,
)
from bridge.cli import app
from bridge.config import load_config
from bridge.models import HealthCheckConfig, TunnelConfig
from bridge.state import StateManager
def _tunnel(**overrides) -> TunnelConfig:
base = dict(
name="state-hub-railiance01",
host="92.205.62.239",
remote_port=18000,
local_port=8000,
ssh_user="tegwick",
ssh_key="~/.ssh/id_ops",
actor="agt-claude-railiance01",
health_check=HealthCheckConfig(
url="http://127.0.0.1:8000/state/health",
timeout_seconds=5,
),
)
base.update(overrides)
return TunnelConfig(**base)
class TestRemoteForwardHealthUrl:
def test_maps_local_port_to_remote(self):
cfg = _tunnel()
assert remote_forward_health_url(cfg) == "http://127.0.0.1:18000/state/health"
def test_returns_none_for_local_tunnel(self):
cfg = _tunnel(direction="local")
assert remote_forward_health_url(cfg) is None
class TestShouldCleanupTunnel:
def test_skips_healthy_remote_forward(self, tmp_path):
cfg = _tunnel()
state_mgr = StateManager(state_dir=tmp_path)
with (
patch("bridge.cleanup.remote_port_listening", return_value=True),
patch("bridge.cleanup.probe_remote_forward", return_value=(True, "ok")),
):
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
assert needed is False
def test_detects_stale_forward_when_local_ok_remote_fails(self, tmp_path):
cfg = _tunnel()
state_mgr = StateManager(state_dir=tmp_path)
with (
patch("bridge.cleanup.remote_port_listening", return_value=True),
patch("bridge.cleanup.probe_remote_forward", return_value=(False, "timeout")),
patch("bridge.cleanup.local_service_healthy", return_value=True),
patch(
"bridge.cleanup.check_tunnel",
return_value=MagicMock(ssh_process="ok", remote_port="listening"),
),
):
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
assert needed is True
assert "stale forward" in reason
class TestCleanupAllTunnels:
def test_reports_cleaned_tunnel(self, tmp_path, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "tunnels.yaml"))
(tmp_path / "tunnels.yaml").write_text(
textwrap.dedent(
"""\
tunnels:
state-hub-railiance01:
host: 92.205.62.239
remote_port: 18000
local_port: 8000
ssh_user: tegwick
ssh_key: ~/.ssh/id_ops
actor: agt-claude-railiance01
health_check:
url: http://127.0.0.1:8000/state/health
actors:
agt-claude-railiance01:
class: agt
"""
)
)
cfg = load_config()
state_mgr = StateManager(state_dir=tmp_path / "state")
with patch(
"bridge.cleanup.cleanup_tunnel",
return_value=CleanupAction("state-hub-railiance01", "cleaned", "cleared"),
):
report = cleanup_all_tunnels(cfg, state_mgr, restart=False)
assert report.cleaned_count == 1
assert report.actions[0].action == "cleaned"
class TestMaintenanceCli:
def test_cleanup_help(self):
runner = CliRunner()
result = runner.invoke(app, ["maintenance", "cleanup", "--help"])
assert result.exit_code == 0
assert "restart" in result.output.lower()
def test_show_cron_prints_template_when_not_installed(self):
runner = CliRunner()
with patch("bridge.cli.read_installed_cron", return_value=None):
result = runner.invoke(app, ["maintenance", "show-cron"])
assert result.exit_code == 0
assert "0 3 * * *" in result.output
def test_build_cron_line_contains_marker():
line = build_cron_line()
assert "0 3 * * *" in line
assert "maintenance cleanup --restart" in line
assert "ops-bridge: maintenance cleanup" in line

View File

@@ -17,10 +17,10 @@ VALID_CONFIG = textwrap.dedent("""\
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
""") """)
@@ -266,22 +266,146 @@ class TestCheckCommand:
assert result.exit_code == 1 assert result.exit_code == 1
REVERSE_CONFIG = VALID_CONFIG
LOCAL_TUNNEL_CONFIG = textwrap.dedent("""\
tunnels:
k3s-api:
host: host.local
remote_port: 6443
local_port: 6443
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
direction: local
actors:
adm-bernd:
class: adm
description: Bernd
""")
class TestRestartCommand: class TestRestartCommand:
def test_restart_unknown_tunnel_exit_1(self, env): def test_restart_unknown_tunnel_exit_1(self, env):
result = runner.invoke(app, ["restart", "nonexistent"], env=env) result = runner.invoke(app, ["restart", "nonexistent"], env=env)
assert result.exit_code == 1 assert result.exit_code == 1
def test_restart_help_mentions_remote_cleanup(self):
result = runner.invoke(app, ["restart", "--help"])
assert result.exit_code == 0
assert "stale-forward" in result.output.lower() or "remote" in result.output.lower()
@pytest.mark.capability("bridge_restart") @pytest.mark.capability("bridge_restart")
@pytest.mark.access_mode("cli") @pytest.mark.access_mode("cli")
def test_restart_calls_stop_then_start(self, env): def test_restart_reverse_tunnel_delegates_to_cleanup(self, env):
with patch("bridge.cli.TunnelManager") as mock_mgr_cls: from bridge.cleanup import CleanupAction
with patch("bridge.cli.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel", "healthy", "remote forward healthy"
)
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
assert result.exit_code == 0
mock_restart.assert_called_once()
assert "test-tunnel: healthy" in result.output
def test_restart_reverse_tunnel_reports_cleaned_and_restarted(self, env):
from bridge.cleanup import CleanupAction
with patch("bridge.cli.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel",
"cleaned_and_restarted",
"stale forward; restarted tunnel; cleared",
)
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
assert result.exit_code == 0
assert "cleaned_and_restarted" in result.output
def test_restart_reverse_tunnel_error_exit_1(self, env):
from bridge.cleanup import CleanupAction
with patch("bridge.cli.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel", "error", "cleanup failed: still_listening"
)
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
assert result.exit_code == 1
assert "error" in result.output
def test_restart_local_tunnel_uses_stop_start(self, tmp_path, state_dir):
config_file = tmp_path / "tunnels.yaml"
config_file.write_text(LOCAL_TUNNEL_CONFIG)
env = {
"BRIDGE_CONFIG": str(config_file),
"BRIDGE_STATE_DIR": str(state_dir),
}
with patch("bridge.cleanup.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock() mock_mgr = MagicMock()
mock_mgr_cls.return_value = mock_mgr mock_mgr_cls.return_value = mock_mgr
call_order = [] call_order = []
mock_mgr.stop.side_effect = lambda: call_order.append("stop") mock_mgr.stop.side_effect = lambda: call_order.append("stop")
mock_mgr.start.side_effect = lambda: call_order.append("start") mock_mgr.start.side_effect = lambda: call_order.append("start")
result = runner.invoke(app, ["restart", "test-tunnel"], env=env) result = runner.invoke(app, ["restart", "k3s-api"], env=env)
assert result.exit_code == 0 assert result.exit_code == 0
assert call_order == ["stop", "start"] assert call_order == ["stop", "start"]
assert "k3s-api: restarted" in result.output
class TestCertStatusCommand:
@pytest.mark.capability("bridge_cert_status")
@pytest.mark.access_mode("cli")
def test_cert_status_no_cert_shows_static_key(self, env, state_dir):
result = runner.invoke(app, ["cert-status"], env=env)
assert result.exit_code == 0
assert "static-key" in result.output
def test_cert_status_json_no_cert(self, env, state_dir):
result = runner.invoke(app, ["cert-status", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert data[0]["mode"] == "static-key"
def test_cert_status_exit_1_on_expired(self, env, state_dir, tmp_path):
# Write a fake cert file in state dir; mock ssh-keygen to report expired
state_dir.mkdir(parents=True, exist_ok=True)
cert_file = state_dir / "test-tunnel-cert.pub"
cert_file.write_text("fake cert")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout=(
"test-tunnel-cert.pub:\n"
" Key ID: \"agt-test\"\n"
" Valid: from 2026-01-01T00:00:00 to 2026-01-02T00:00:00\n"
),
returncode=0,
)
result = runner.invoke(app, ["cert-status"], env=env)
assert result.exit_code == 1
assert "EXPIRED" in result.output
def test_cert_status_json_with_cert(self, env, state_dir):
state_dir.mkdir(parents=True, exist_ok=True)
cert_file = state_dir / "test-tunnel-cert.pub"
cert_file.write_text("fake cert")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout=(
"test-tunnel-cert.pub:\n"
" Key ID: \"agt-test\"\n"
" Valid: from 2030-01-01T00:00:00 to 2030-01-02T00:00:00\n"
),
returncode=0,
)
result = runner.invoke(app, ["cert-status", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert data[0]["mode"] == "cert"
assert data[0]["key_id"] == "agt-test"
assert data[0]["expired"] is False

View File

@@ -1,9 +1,11 @@
"""Tests for config loading.""" """Tests for config loading."""
import textwrap import textwrap
import warnings
import pytest import pytest
from bridge.config import ConfigError, load_config from bridge.config import ConfigError, load_config
from bridge.models import ActorType
VALID_YAML = textwrap.dedent("""\ VALID_YAML = textwrap.dedent("""\
@@ -14,7 +16,7 @@ VALID_YAML = textwrap.dedent("""\
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore actor: agt-claude-coulombcore
health_check: health_check:
url: http://127.0.0.1:18000/health url: http://127.0.0.1:18000/health
interval_seconds: 30 interval_seconds: 30
@@ -25,11 +27,11 @@ VALID_YAML = textwrap.dedent("""\
backoff_max: 60 backoff_max: 60
actors: actors:
agent.claude-coulombcore: agt-claude-coulombcore:
class: automation class: agt
description: Claude Code agent on CoulombCore description: Claude Code agent on CoulombCore
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd Worsch description: Bernd Worsch
""") """)
@@ -50,7 +52,7 @@ def test_load_valid_config(config_file, monkeypatch):
assert t.remote_port == 18000 assert t.remote_port == 18000
assert t.local_port == 8000 assert t.local_port == 8000
assert t.ssh_user == "ubuntu" assert t.ssh_user == "ubuntu"
assert t.actor == "agent.claude-coulombcore" assert t.actor == "agt-claude-coulombcore"
def test_health_check_loaded(config_file, monkeypatch): def test_health_check_loaded(config_file, monkeypatch):
@@ -74,10 +76,10 @@ def test_reconnect_policy_loaded(config_file, monkeypatch):
def test_actors_loaded(config_file, monkeypatch): def test_actors_loaded(config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file)) monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config() cfg = load_config()
assert "agent.claude-coulombcore" in cfg.actors assert "agt-claude-coulombcore" in cfg.actors
a = cfg.actors["agent.claude-coulombcore"] a = cfg.actors["agt-claude-coulombcore"]
assert a.actor_class == "automation" assert a.actor_type == ActorType.AGT
assert "operator.bernd" in cfg.actors assert "adm-bernd" in cfg.actors
def test_missing_required_field_raises(tmp_path, monkeypatch): def test_missing_required_field_raises(tmp_path, monkeypatch):
@@ -118,12 +120,180 @@ def test_tunnel_without_health_check(tmp_path, monkeypatch):
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_rsa ssh_key: ~/.ssh/id_rsa
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
""")) """))
monkeypatch.setenv("BRIDGE_CONFIG", str(f)) monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config() cfg = load_config()
assert cfg.tunnels["simple"].health_check is None assert cfg.tunnels["simple"].health_check is None
class TestActorTypeValidation:
def test_canonical_agt_accepted(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: agt-claude
actors:
agt-claude:
class: agt
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.actors["agt-claude"].actor_type == ActorType.AGT
def test_canonical_atm_accepted(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: atm-backup
actors:
atm-backup:
class: atm
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.actors["atm-backup"].actor_type == ActorType.ATM
def test_wrong_prefix_raises_config_error(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: adm-bernd
actors:
adm-bernd:
class: agt
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="must start with 'agt-'"):
load_config()
def test_missing_prefix_raises_config_error(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: operator.bernd
actors:
operator.bernd:
class: adm
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="must start with 'adm-'"):
load_config()
def test_unknown_class_raises_config_error(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: adm-bernd
actors:
adm-bernd:
class: wizard
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="unknown class"):
load_config()
def test_legacy_human_maps_to_adm_with_warning(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: adm-bernd
actors:
adm-bernd:
class: human
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
cfg = load_config()
assert cfg.actors["adm-bernd"].actor_type == ActorType.ADM
assert any("deprecated" in str(x.message).lower() for x in w)
def test_legacy_automation_maps_to_atm_with_warning(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: atm-cron
actors:
atm-cron:
class: automation
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
cfg = load_config()
assert cfg.actors["atm-cron"].actor_type == ActorType.ATM
assert any("deprecated" in str(x.message).lower() for x in w)
class TestCertCommandConfig:
def test_cert_command_parsed(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: agt-bridge
cert_command: "warden sign agt-bridge --pubkey /tmp/k.pub"
actors:
agt-bridge:
class: agt
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.tunnels["t"].cert_command == "warden sign agt-bridge --pubkey /tmp/k.pub"
def test_no_cert_command_is_none(self, config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
assert cfg.tunnels["state-hub-coulombcore"].cert_command is None

View File

@@ -6,7 +6,11 @@ from unittest.mock import MagicMock, patch
import pytest import pytest
from bridge.diagnostics import TunnelCheckResult, check_all_tunnels, check_tunnel from bridge.diagnostics import (
_remote_port_probe_command,
check_all_tunnels,
check_tunnel,
)
from bridge.models import BridgeState, TunnelConfig from bridge.models import BridgeState, TunnelConfig
from bridge.state import StateManager from bridge.state import StateManager
@@ -20,7 +24,7 @@ def tcfg():
local_port=8000, local_port=8000,
ssh_user="ubuntu", ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops", ssh_key="~/.ssh/id_ops",
actor="operator.bernd", actor="adm-bernd",
) )
@@ -32,6 +36,14 @@ def state_mgr(tmp_path):
class TestCheckTunnel: class TestCheckTunnel:
def test_remote_port_probe_has_minimal_host_fallback(self):
"""Remote probe supports minimal hosts without ss/netstat."""
command = _remote_port_probe_command(18000)
assert "command -v ss" in command
assert "command -v netstat" in command
assert "/proc/net/tcp" in command
assert "/proc/net/tcp6" in command
def test_no_pid(self, tcfg, state_mgr): def test_no_pid(self, tcfg, state_mgr):
"""No PID file → ssh_process='no_pid', ok=False.""" """No PID file → ssh_process='no_pid', ok=False."""
with patch("bridge.diagnostics.subprocess.run") as mock_run: with patch("bridge.diagnostics.subprocess.run") as mock_run:
@@ -83,6 +95,29 @@ class TestCheckTunnel:
assert result.remote_port == "closed" assert result.remote_port == "closed"
assert result.ok is False assert result.ok is False
def test_local_direction_checks_local_port(self, tcfg, state_mgr):
"""Local tunnels verify the local listener instead of a remote -R port."""
local_cfg = TunnelConfig(
name="local-tunnel",
host="haskelseed.local",
remote_port=1234,
local_port=11234,
ssh_user="root",
ssh_key="~/.ssh/id_ops",
actor="adm-bernd",
direction="local",
)
state_mgr.write_pid("local-tunnel", 12345)
with (
patch("bridge.diagnostics._pid_alive", return_value=True),
patch("bridge.diagnostics._probe_local_port", return_value="listening"),
patch("bridge.diagnostics.subprocess.run") as mock_run,
):
result = check_tunnel(local_cfg, state_mgr)
mock_run.assert_not_called()
assert result.remote_port == "listening"
assert result.ok is True
def test_ssh_timeout(self, tcfg, state_mgr): def test_ssh_timeout(self, tcfg, state_mgr):
"""SSH probe timeout → remote_port='error:timeout'.""" """SSH probe timeout → remote_port='error:timeout'."""
state_mgr.write_pid("test-tunnel", 12345) state_mgr.write_pid("test-tunnel", 12345)
@@ -114,7 +149,7 @@ class TestCheckTunnel:
local_port=8000, local_port=8000,
ssh_user="ubuntu", ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops", ssh_key="~/.ssh/id_ops",
actor="operator.bernd", actor="adm-bernd",
health_check=HealthCheckConfig(url="http://127.0.0.1:8000/health"), health_check=HealthCheckConfig(url="http://127.0.0.1:8000/health"),
) )
state_mgr.write_pid("test-tunnel", 12345) state_mgr.write_pid("test-tunnel", 12345)
@@ -135,7 +170,8 @@ class TestCheckAllTunnels:
def test_check_all_iterates_tunnels(self, tmp_path): def test_check_all_iterates_tunnels(self, tmp_path):
"""check_all_tunnels returns one result per tunnel in cfg.""" """check_all_tunnels returns one result per tunnel in cfg."""
from bridge.config import load_config from bridge.config import load_config
import textwrap, os import textwrap
import os
cfg_file = tmp_path / "tunnels.yaml" cfg_file = tmp_path / "tunnels.yaml"
cfg_file.write_text(textwrap.dedent("""\ cfg_file.write_text(textwrap.dedent("""\
@@ -146,17 +182,17 @@ class TestCheckAllTunnels:
local_port: 8001 local_port: 8001
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
t2: t2:
host: h2.local host: h2.local
remote_port: 18002 remote_port: 18002
local_port: 8002 local_port: 8002
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
""")) """))
os.environ["BRIDGE_CONFIG"] = str(cfg_file) os.environ["BRIDGE_CONFIG"] = str(cfg_file)

View File

@@ -18,14 +18,14 @@ MINIMAL_CONFIG = textwrap.dedent("""\
local_port: 8000 local_port: 8000
ssh_user: testuser ssh_user: testuser
ssh_key: ~/.ssh/id_rsa ssh_key: ~/.ssh/id_rsa
actor: operator.bernd actor: adm-bernd
reconnect: reconnect:
max_attempts: 2 max_attempts: 2
backoff_initial: 1 backoff_initial: 1
backoff_max: 2 backoff_max: 2
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
""") """)
@@ -51,7 +51,7 @@ def tunnel_cfg():
local_port=8000, local_port=8000,
ssh_user="testuser", ssh_user="testuser",
ssh_key="~/.ssh/id_rsa", ssh_key="~/.ssh/id_rsa",
actor="operator.bernd", actor="adm-bernd",
reconnect=ReconnectPolicy(max_attempts=2, backoff_initial=1, backoff_max=2), reconnect=ReconnectPolicy(max_attempts=2, backoff_initial=1, backoff_max=2),
) )
@@ -142,7 +142,7 @@ class TestHealthCheckDegradedPath:
local_port=8001, local_port=8001,
ssh_user="u", ssh_user="u",
ssh_key="k", ssh_key="k",
actor="operator.bernd", actor="adm-bernd",
reconnect=ReconnectPolicy(max_attempts=1, backoff_initial=1, backoff_max=1), reconnect=ReconnectPolicy(max_attempts=1, backoff_initial=1, backoff_max=1),
health_check=hc_cfg, health_check=hc_cfg,
) )

View File

@@ -3,6 +3,8 @@ import os
import signal import signal
from unittest.mock import MagicMock, patch from unittest.mock import MagicMock, patch
from dataclasses import replace
import pytest import pytest
from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
@@ -38,6 +40,16 @@ class TestBuildSshCommand:
assert "-i" in cmd assert "-i" in cmd
assert "ubuntu@host.local" in cmd assert "ubuntu@host.local" in cmd
def test_remote_host_override_local(self, tunnel_cfg):
cfg = replace(tunnel_cfg, direction="local", remote_host="10.43.103.154")
cmd = build_ssh_command(cfg)
assert "-L" in cmd
assert f"{cfg.local_port}:10.43.103.154:{cfg.remote_port}" in cmd
def test_remote_host_default_loopback(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg)
assert "18000:127.0.0.1:8000" in cmd
def test_server_alive_options(self, tunnel_cfg): def test_server_alive_options(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg) cmd = build_ssh_command(tunnel_cfg)
assert "-o" in cmd assert "-o" in cmd
@@ -105,3 +117,99 @@ class TestTunnelManager:
def test_is_running_false_initially(self, tunnel_cfg, state_dir): def test_is_running_false_initially(self, tunnel_cfg, state_dir):
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir) mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
assert not mgr.is_running() assert not mgr.is_running()
class TestBuildSshCommandWithCert:
def test_no_cert_path_omits_extra_i(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg)
assert cmd.count("-i") == 1
def test_cert_path_appends_after_key(self, tunnel_cfg, tmp_path):
cert = tmp_path / "test-cert.pub"
cert.write_text("cert")
cmd = build_ssh_command(tunnel_cfg, cert_path=cert)
i_indices = [i for i, x in enumerate(cmd) if x == "-i"]
assert len(i_indices) == 2
key_idx, cert_idx = i_indices
assert not cmd[key_idx + 1].endswith("-cert.pub") # key comes first
assert cmd[cert_idx + 1] == str(cert)
class TestRunCertCommand:
def test_returns_none_when_no_cert_command(self, tunnel_cfg, tmp_path):
from bridge.manager import _run_cert_command
assert _run_cert_command(tunnel_cfg, tmp_path) is None
def test_writes_cert_and_returns_path(self, tunnel_cfg, tmp_path):
from bridge.manager import _run_cert_command
tunnel_cfg.cert_command = "echo 'ssh-rsa-cert AAAA'"
path = _run_cert_command(tunnel_cfg, tmp_path)
assert path is not None
assert path.exists()
assert "ssh-rsa-cert" in path.read_text()
def test_raises_on_nonzero_exit(self, tunnel_cfg, tmp_path):
from bridge.manager import _run_cert_command
from bridge.models import CertAcquisitionError
tunnel_cfg.cert_command = "exit 1"
with pytest.raises(CertAcquisitionError):
_run_cert_command(tunnel_cfg, tmp_path)
class TestActorTypeFromName:
def test_adm_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("adm-bernd") == "adm"
def test_agt_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("agt-claude") == "agt"
def test_atm_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("atm-cron") == "atm"
def test_unknown_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("operator.bernd") == "unknown"
class TestTtlRefresh:
def test_parse_cert_expiry_returns_none_for_missing_file(self, tmp_path):
from bridge.manager import _parse_cert_expiry
missing = tmp_path / "no.pub"
result = _parse_cert_expiry(missing)
assert result is None
def test_parse_cert_identity_returns_none_for_missing_file(self, tmp_path):
from bridge.manager import _parse_cert_identity
missing = tmp_path / "no.pub"
result = _parse_cert_identity(missing)
assert result is None
def test_parse_cert_identity_from_keygen_output(self, tmp_path):
from unittest.mock import patch, MagicMock
from bridge.manager import _parse_cert_identity
cert = tmp_path / "test.pub"
cert.write_text("fake")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout='test.pub:\n Key ID: "agt-bridge"\n',
returncode=0,
)
result = _parse_cert_identity(cert)
assert result == "agt-bridge"
def test_parse_cert_expiry_from_keygen_output(self, tmp_path):
from unittest.mock import patch, MagicMock
from bridge.manager import _parse_cert_expiry
cert = tmp_path / "test.pub"
cert.write_text("fake")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout="test.pub:\n Valid: from 2026-05-15T10:00:00 to 2030-05-15T22:00:00\n",
returncode=0,
)
result = _parse_cert_expiry(cert)
assert result is not None
assert result.year == 2030

View File

@@ -49,10 +49,10 @@ def _simple_config(tmp_path: Path) -> Path:
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
""")) """))
@@ -66,10 +66,10 @@ def _catalog_config(tmp_path: Path, catalog_dir: Path) -> Path:
local_port: 8000 local_port: 8000
ssh_user: ubuntu ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops ssh_key: ~/.ssh/id_ops
actor: operator.bernd actor: adm-bernd
actors: actors:
operator.bernd: adm-bernd:
class: human class: adm
description: Bernd description: Bernd
catalog_path: {catalog_dir} catalog_path: {catalog_dir}
""")) """))
@@ -237,22 +237,22 @@ class TestMcpBridgeDown:
class TestMcpBridgeRestart: class TestMcpBridgeRestart:
@pytest.mark.capability("bridge_restart") @pytest.mark.capability("bridge_restart")
@pytest.mark.access_mode("mcp") @pytest.mark.access_mode("mcp")
async def test_bridge_restart_calls_stop_then_start(self, env_simple): async def test_bridge_restart_delegates_to_cleanup(self, env_simple):
with patch("bridge.manager.TunnelManager") as mock_cls: from bridge.cleanup import CleanupAction
mock_mgr = MagicMock()
call_order = [] with patch("bridge.cleanup.restart_tunnel") as mock_restart:
mock_mgr.stop.side_effect = lambda: call_order.append("stop") mock_restart.return_value = CleanupAction(
mock_mgr.start.side_effect = lambda: call_order.append("start") "test-tunnel", "healthy", "remote forward healthy"
mock_cls.return_value = mock_mgr )
from fastmcp import Client from fastmcp import Client
async with Client(mcp) as c: async with Client(mcp) as c:
result = await c.call_tool("bridge_restart", {"tunnel": "test-tunnel"}) result = await c.call_tool("bridge_restart", {"tunnel": "test-tunnel"})
data = _data(result) data = _data(result)
assert "restarted" in data assert data["actions"][0]["tunnel"] == "test-tunnel"
assert "test-tunnel" in data["restarted"] assert data["actions"][0]["action"] == "healthy"
assert call_order == ["stop", "start"] mock_restart.assert_called_once()
async def test_bridge_restart_unknown_tunnel(self, env_simple): async def test_bridge_restart_unknown_tunnel(self, env_simple):
from fastmcp import Client from fastmcp import Client
@@ -278,8 +278,8 @@ class TestMcpBridgeLogs:
_json.dumps({ _json.dumps({
"timestamp": "2026-01-01T00:00:00+00:00", "timestamp": "2026-01-01T00:00:00+00:00",
"tunnel": "test-tunnel", "tunnel": "test-tunnel",
"actor": "operator.bernd", "actor": "adm-bernd",
"actor_class": "human", "actor_type": "adm",
"event": "bridge_started", "event": "bridge_started",
}) + "\n" }) + "\n"
) )

View File

@@ -69,6 +69,7 @@ class TestTunnelConfig:
class TestActorInfo: class TestActorInfo:
def test_fields(self): def test_fields(self):
a = ActorInfo(name="operator.bernd", actor_class="human", description="Bernd") from bridge.models import ActorType
assert a.name == "operator.bernd" a = ActorInfo(name="adm-bernd", actor_type=ActorType.ADM, description="Bernd")
assert a.actor_class == "human" assert a.name == "adm-bernd"
assert a.actor_type == ActorType.ADM

18
uv.lock generated
View File

@@ -345,7 +345,7 @@ wheels = [
[[package]] [[package]]
name = "fastmcp" name = "fastmcp"
version = "3.1.0" version = "3.0.2"
source = { registry = "https://pypi.org/simple" } source = { registry = "https://pypi.org/simple" }
dependencies = [ dependencies = [
{ name = "authlib" }, { name = "authlib" },
@@ -365,14 +365,13 @@ dependencies = [
{ name = "python-dotenv" }, { name = "python-dotenv" },
{ name = "pyyaml" }, { name = "pyyaml" },
{ name = "rich" }, { name = "rich" },
{ name = "uncalled-for" },
{ name = "uvicorn" }, { name = "uvicorn" },
{ name = "watchfiles" }, { name = "watchfiles" },
{ name = "websockets" }, { name = "websockets" },
] ]
sdist = { url = "https://files.pythonhosted.org/packages/0a/70/862026c4589441f86ad3108f05bfb2f781c6b322ad60a982f40b303b47d7/fastmcp-3.1.0.tar.gz", hash = "sha256:e25264794c734b9977502a51466961eeecff92a0c2f3b49c40c070993628d6d0", size = 17347083 } sdist = { url = "https://files.pythonhosted.org/packages/11/6b/1a7ec89727797fb07ec0928e9070fa2f45e7b35718e1fe01633a34c35e45/fastmcp-3.0.2.tar.gz", hash = "sha256:6bd73b4a3bab773ee6932df5249dcbcd78ed18365ed0aeeb97bb42702a7198d7", size = 17239351 }
wheels = [ wheels = [
{ url = "https://files.pythonhosted.org/packages/17/07/516f5b20d88932e5a466c2216b628e5358a71b3a9f522215607c3281de05/fastmcp-3.1.0-py3-none-any.whl", hash = "sha256:b1f73b56fd3b0cb2bd9e2a144fc650d5cc31587ed129d996db7710e464ae8010", size = 633749 }, { url = "https://files.pythonhosted.org/packages/0a/5a/f410a9015cfde71adf646dab4ef2feae49f92f34f6050fcfb265eb126b30/fastmcp-3.0.2-py3-none-any.whl", hash = "sha256:f513d80d4b30b54749fe8950116b1aab843f3c293f5cb971fc8665cb48dbb028", size = 606268 },
] ]
[[package]] [[package]]
@@ -664,7 +663,7 @@ dev = [
[package.metadata] [package.metadata]
requires-dist = [ requires-dist = [
{ name = "fastmcp", specifier = ">=2.0.0" }, { name = "fastmcp", specifier = ">=2.0.0,<3.1.0" },
{ name = "httpx", specifier = ">=0.27" }, { name = "httpx", specifier = ">=0.27" },
{ name = "pyyaml", specifier = ">=6.0" }, { name = "pyyaml", specifier = ">=6.0" },
{ name = "typer", specifier = ">=0.12" }, { name = "typer", specifier = ">=0.12" },
@@ -1297,15 +1296,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/dc/9b/47798a6c91d8bdb567fe2698fe81e0c6b7cb7ef4d13da4114b41d239f65d/typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7", size = 14611 }, { url = "https://files.pythonhosted.org/packages/dc/9b/47798a6c91d8bdb567fe2698fe81e0c6b7cb7ef4d13da4114b41d239f65d/typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7", size = 14611 },
] ]
[[package]]
name = "uncalled-for"
version = "0.2.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/02/7c/b5b7d8136f872e3f13b0584e576886de0489d7213a12de6bebf29ff6ebfc/uncalled_for-0.2.0.tar.gz", hash = "sha256:b4f8fdbcec328c5a113807d653e041c5094473dd4afa7c34599ace69ccb7e69f", size = 49488 }
wheels = [
{ url = "https://files.pythonhosted.org/packages/ff/7f/4320d9ce3be404e6310b915c3629fe27bf1e2f438a1a7a3cb0396e32e9a9/uncalled_for-0.2.0-py3-none-any.whl", hash = "sha256:2c0bd338faff5f930918f79e7eb9ff48290df2cb05fcc0b40a7f334e55d4d85f", size = 11351 },
]
[[package]] [[package]]
name = "uvicorn" name = "uvicorn"
version = "0.41.0" version = "0.41.0"

View File

@@ -0,0 +1,203 @@
AccessManagementDirective
*Practical host access control management *
# AccessManagementDirective
**Document Title:** SSH Access Management Directive
**Version:** 1.1 (Production-Ready Revision Post-SWOT Improvements)
**Date:** 28 March 2026
**Audience:** Operations Department
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
**Author:** Grok (on behalf of the team)
**Status:** Official Directive All ops personnel, agents, and automation pipelines MUST follow this.
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
## 0. Prerequisites
Before bootstrapping, the following must be in place:
- Ansible (or equivalent config-management tool) with a central inventory.
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
- GitOps repository containing the authoritative principals inventory.
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
## 1. Concept Overview
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
**Why this model?**
- A central CA signs short-lived certificates for every login.
- No more manual key copying, key sprawl, or painful revocation.
- Built-in expiration, role-based principals, and auditability.
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
**Core Principles**
- **Least privilege** Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
- **Short-lived credentials** Certificates expire automatically (2448 h for admins, 424 h for agents, 18 h for automations).
- **One CA, many issuers** A single offline User CA whose public key is trusted by every host.
- **Automation-first** All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
- **Separation of concerns**
- **Admins (adm)**: Human operators (full interactive shell when needed).
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
## 2. Actor Definitions & Access Model
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|------------|-------------------|-------------|------------------------------|---------------------------|
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 2448 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 424 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 18 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
**Certificate Naming Convention**
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
**LLM-Agent Risk Clarification**
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
## 3. Bootstrapping the System (One-Time Setup)
### 3.1. Create the CA (do this once, offline)
```bash
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
```
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
- Rotate the CA key itself every 23 years using the same bootstrap playbook.
- Public key: `ca_user.pub`
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
- Copy `ca_user.pub``/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
- Update `/etc/ssh/sshd_config`:
```bash
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PubkeyAuthentication yes
PasswordAuthentication no
PermitRootLogin no
```
- Create principals directory and files from the central Git inventory.
- `systemctl restart sshd`
### 3.3. Initial Admin Access
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
## 4. Automatic Management of Access Rights
### 4.1. Daily / On-Demand Workflow
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
- Example inventory snippet:
```yaml
hosts:
- name: prod-db-01
allowed_principals:
adm: [adm-full]
agt: [agt-incident-resolver-v2]
atm: [atm-backup-daily, atm-logrotate]
```
3. **Revocation & Rotation**
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
- Agents/automations never store long-lived private keys on disk.
4. **Concrete Agent & Automation Wrapper Example** (Python snippet place in `/usr/local/bin/ops-ssh-wrapper`)
```python
#!/usr/bin/env python3
import subprocess, os, tempfile
# Request short-lived cert from Vault
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
f.write(cert.encode())
cert_path = f.name
# Load into ssh-agent and exec the real command
subprocess.run(["ssh-add", cert_path])
os.execvp(sys.argv[1], sys.argv[1:])
```
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
### 4.2. Human UX Guidance
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
### 4.3. Emergency Break-Glass Procedure
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
4. After recovery, immediately rotate the CA and run a full scorecard.
## 5. AccessManagement Scorecard (Checklist)
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
| Category | Check | Target | Tool |
|----------|-------|--------|------|
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
| **Score** | ≥ 10/10 = **Operational** | - | - |
**Scorecard Execution Command** (run from ops laptop):
```bash
ansible all -m command -a "ssh-access-scorecard.sh" --become
```
## 6. Scope & Operational Boundaries
### 6.1. When Bootstrapping Is Officially Closed
The system is **fully operational** when **ALL** of the following are true:
- Scorecard passes 10/10 on every host.
- Central Git repo contains the authoritative principals inventory.
- First three admins have successfully used signed certificates for 7 consecutive days.
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
- CI/CD pipeline for host config updates is green and runs hourly.
- Emergency break-glass procedure has been tested once.
**Declaration:** Ops Lead signs off with date in the Git commit message.
### 6.2. Scope Boundary When to Switch to Sophisticated Tooling
Stay with **native OpenSSH CA + Ansible + Vault** while:
- ≤ 200 hosts
- ≤ 50 distinct agent/automation identities
- No regulatory requirement for SSO or full session recording
**Switch triggers** (any one):
- > 200 hosts OR rapid daily growth
- Need for human SSO (Okta/Google) integration
- Requirement for audited web-based SSH sessions or just-in-time access approval
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
- Audit/compliance demands central policy engine or session recording
**Recommended next-level tools** (in order):
1. **Teleport** Best for mixed human + agent workloads (SSO + Machine ID).
2. **HashiCorp Vault SSH + Boundary** When you already use Vault heavily.
3. **step-ca + smallstep** If you prefer a pure open-source CA with OIDC.
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
## 7. Enforcement & Review
- **Quarterly review** of this directive and scorecard results.
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
- **Questions / improvements** → create PR against this file in the ops repo.
**End of Document**
Approved for immediate use across all production and staging environments.
xxx

View File

@@ -157,31 +157,82 @@ Just controlled operational access when you need it.
Start a bridge: Start a bridge:
``` ```
ob up hostA=hostB bridge up state-hub-railiance01
``` ```
Check active bridges: Check active bridges:
``` ```
ob status bridge status
``` ```
Investigate infrastructure targets: Investigate infrastructure targets:
``` ```
ob targets bridge targets
``` ```
Stop the bridge when finished: Stop the bridge when finished:
``` ```
ob down hostA=hostB bridge down state-hub-railiance01
``` ```
OpsBridge handles the lifecycle so operators can focus on solving the problem. OpsBridge handles the lifecycle so operators can focus on solving the problem.
--- ---
# Tunnel lifecycle commands
| Command | Purpose |
|---------|---------|
| `bridge up` | Start tunnel(s) that are not already running |
| `bridge down` | Stop tunnel(s) that are running |
| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
## `bridge restart` — blank-slate recovery
`bridge restart` means *operational again*, not merely cycling the local manager
PID while a broken remote listener still holds the port.
For **reverse** tunnels (State Hub exposure on remote hosts), restart:
1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
2. Clears orphan listeners on the remote host when needed
3. Reconnects the tunnel (stop + start) only when cleanup was required
When the remote forward is already healthy, restart reports `healthy` and leaves
the working tunnel running — no unnecessary disruption.
For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
restart contract. The nightly cron (`bridge maintenance install-cron`) runs
`maintenance cleanup --restart` at 03:00.
**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
blocked `bridge restart` until operators discovered the maintenance subcommand.
See `state-hub/history/20260621-weekend-automation-assessment.md` and
`BRIDGE-WP-0005` in this repo.
## Host roles
Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
| Role | Hosts | Behaviour |
|------|-------|-----------|
| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
Conditional remote cleanup before restart benefits all reverse tunnels.
`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
forwards are untouched.
---
# The Philosophy Behind OpsBridge # The Philosophy Behind OpsBridge
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between: Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:

View File

@@ -0,0 +1,56 @@
---
id: ADHOC-2026-06-14
type: workplan
title: "Ad hoc ops-bridge fixes for 2026-06-14"
domain: custodian
repo: ops-bridge
status: finished
owner: codex
topic_slug: ops-bridge
created: "2026-06-14"
updated: "2026-06-14"
state_hub_workstream_id: "fbc2ef7e-626f-4c6a-bdf8-c69bf29097ce"
---
## Fix haskelseed bridge diagnostics
```task
id: ADHOC-2026-06-14-T01
status: done
priority: medium
state_hub_task_id: "ffe6b8d8-889c-4ec4-8b64-00b77f86e39f"
```
`haskelseed` is an Alpine host without `ss`, so `bridge check` reported
reverse tunnel ports as closed even while SSH reverse listeners were present.
Updated diagnostics to fall back from `ss` to `netstat` and then
`/proc/net/tcp`/`tcp6`. Also fixed local-direction diagnostics so
`nix-daemon-haskelseed` checks the local `-L` listener instead of probing a
remote reverse port.
Verification:
- `state-hub-haskelseed` responded through `127.0.0.1:18000/state/health`.
- `bridge check --json` reported all configured tunnels `ok: true`.
- `python3 -m pytest tests/test_cli.py tests/test_diagnostics.py` passed.
## Make default target safe and add setup
```task
id: ADHOC-2026-06-14-T02
status: done
priority: medium
state_hub_task_id: "3b932955-0d75-4b95-9821-92bfa2dadbd0"
```
Changed `make` to default to a help listing that only shows targets with
`##` comments. Added `make setup` to run `uv sync --all-groups` and reinstall
the editable `bridge` CLI wrapper through `uv tool install -e . --force`.
Verification:
- `uv sync --all-groups` succeeded and installed the project environment.
- `make` listed targets only and did not run tests or setup.
- `make setup` succeeded and installed the `bridge` executable.
- `make test` passed all 235 tests.
- `make lint` passed.

View File

@@ -2,7 +2,7 @@
id: BRIDGE-WP-0001 id: BRIDGE-WP-0001
type: workplan type: workplan
title: "OpsBridge Initial Implementation" title: "OpsBridge Initial Implementation"
domain: custodian domain: infotech
repo: ops-bridge repo: ops-bridge
status: completed status: completed
owner: Bernd owner: Bernd

View File

@@ -2,7 +2,7 @@
id: BRIDGE-WP-0002 id: BRIDGE-WP-0002
type: workplan type: workplan
title: "OpsCatalog Extension" title: "OpsCatalog Extension"
domain: custodian domain: infotech
repo: ops-bridge repo: ops-bridge
status: completed status: completed
owner: Bernd owner: Bernd

View File

@@ -2,7 +2,7 @@
id: BRIDGE-WP-0003 id: BRIDGE-WP-0003
type: workplan type: workplan
title: "OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage" title: "OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage"
domain: custodian domain: infotech
repo: ops-bridge repo: ops-bridge
status: done status: done
owner: Bernd owner: Bernd

View File

@@ -0,0 +1,340 @@
---
id: BRIDGE-WP-0004
type: workplan
title: "AccessManagementDirective Alignment"
domain: infotech
repo: ops-bridge
status: done
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7"
---
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
preserving full backward compatibility with the existing static-key mode.
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
---
## Goal
After this workplan:
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
handled transparently by the tunnel manager.
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
the directive, with config validation that enforces naming conventions.
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
§5 SIEM traceability requirement.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
---
## Design Decisions
### Static key mode stays first-class
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
explicitly supported for:
- Lab/dev environments without a CA
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
- Environments below the directive's complexity threshold
### cert_command interface
```yaml
# tunnels.yaml — optional cert_command field
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
on tunnel stop.
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
### TTL-aware cert refresh
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
failure, no reconnect backoff triggered.
If `cert_command` is absent, no TTL logic runs.
### Actor type model
`actor_class: str # "human" | "automation"` is replaced by:
```python
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
```
Backward-compat mapping at config load time: `"human"``adm`, `"automation"``atm`.
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
canonical values.
Config validation: if `actor` name is set, it must start with the prefix matching its type
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
SIEM auditability.
---
## Tasks
### T1 — ActorType enum
```task
id: BRIDGE-WP-0004-T1
state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
status: done
priority: high
```
- [x] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
- [x] `config.py`: accept legacy `"human"``ActorType.ADM` and `"automation"`
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
- [x] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
`atm-*` for ATM; raise `ConfigError` on mismatch
- [x] Update `manager.py` / `audit.py` call sites: `actor_class``actor_type.value`
- [x] Update tests
### T2 — cert_command config field
```task
id: BRIDGE-WP-0004-T2
state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
status: done
priority: high
```
- [x] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
- [x] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
content (shell-level freedom intentional)
- [x] Document in config example / SCOPE.md
### T3 — Cert acquisition in manager
```task
id: BRIDGE-WP-0004-T3
state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
status: done
priority: high
```
- [x] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
- If `cfg.cert_command` is None: return None (static key mode)
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
- [x] `build_ssh_command`: accept optional `cert_path`; when set, insert
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
- [x] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
so every reconnect gets a fresh cert
### T4 — cert_identity in audit log
```task
id: BRIDGE-WP-0004-T4
state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
status: done
priority: high
```
- [x] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
extract `Key ID` (the `-I` value from signing time)
- [x] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
JSON entry when present
- [x] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
- [x] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
### T5 — TTL-aware cert refresh
```task
id: BRIDGE-WP-0004-T5
state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
status: done
priority: high
```
- [x] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
from `ssh-keygen -L` output → `cert_expires_at: datetime`
- [x] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
on each iteration
- [x] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
next iteration)
- [x] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
- [x] If `cert_command` is absent, skip all TTL logic entirely
### T6 — `bridge cert-status` command
```task
id: BRIDGE-WP-0004-T6
state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
status: done
priority: medium
```
- [x] `cli.py`: add `cert-status [TUNNEL]` subcommand
- [x] For each tunnel (or the named one): read cert file from state dir if present,
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
time-to-expiry (or "static key / no cert" if absent)
- [x] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
- [x] `--json` flag for machine-readable output
### T7 — CertAcquisitionError handling
```task
id: BRIDGE-WP-0004-T7
state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
status: done
priority: high
```
- [x] New exception `CertAcquisitionError` in `models.py`
- [x] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
(cert failures are transient — e.g., Vault briefly unreachable)
- [x] After `max_attempts` consecutive cert failures, transition to `FAILED` state
### T8 — SCOPE.md and documentation updates
```task
id: BRIDGE-WP-0004-T8
state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
status: done
priority: medium
```
- [x] Update `SCOPE.md`: Current State updated to reflect completion; directive alignment done
- [x] `wiki/OpsBridgeFrs.md` §5.7 already covers actor attribution abstractly — no changes needed
- [x] `.claude/rules/architecture.md` already documents cert_command mode and actor vocab
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency (deferred)
### T9 — Tests
```task
id: BRIDGE-WP-0004-T9
state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
status: done
priority: high
```
- [x] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
cert_command parse
- [x] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
args; verify `CertAcquisitionError` on non-zero exit; TTL logic helpers
- [x] `test_audit.py`: `cert_identity` field; actor_type rename
- [x] `test_cli.py`: `cert-status` exit codes; JSON output shape
- [x] 233 tests, 0 failures
---
## Config Schema — Before / After
### Before
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: ops-agent
ssh_key: ~/.ssh/id_ed25519
actor: automation-agent
actors:
automation-agent:
class: automation
description: "state hub bridge agent"
```
### After (static key mode — unchanged behavior)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
### After (cert_command mode — ops-warden or any CA)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
---
## Acceptance Criteria
- [x] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
warning only); tunnel behaves identically
- [x] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
- [x] Config with `cert_command` set: SSH process launched with both `-i key` and
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
- [x] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
no TTL logic runs
- [x] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
- [x] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
- [x] `bridge cert-status` shows valid cert info; exits 1 on expired cert
- [x] All tests pass: `uv run pytest` (233 passed)
- [x] All lints pass: `uv run ruff check .`

View File

@@ -0,0 +1,194 @@
---
id: BRIDGE-WP-0005
type: workplan
title: "Restart includes remote cleanup (blank-slate recovery)"
domain: infotech
repo: ops-bridge
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
---
# BRIDGE-WP-0005 — Restart includes remote cleanup
**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
`sshd` remote forward on Railiance01 port `18000` blocked
`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
**Operator expectation:** `bridge restart` should mean *operational again* — a
blank-slate recovery — not merely "cycle the local manager PID while a broken
remote listener still holds the port."
## Topology and failure modes (refined)
Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
Cleanup policy must respect all of them.
### A. Workstation (laptop WSL) — tunnel **origin**
The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
remote hosts:
| Remote host | Tunnels (reverse) | Role |
|-------------|-------------------|------|
| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
local bridge processes without graceful remote SSH teardown. Orphan `sshd`
listeners on **all three remotes** are common after wake — especially
`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
### B. Haskelseed — also intermittently offline
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
different networks. The same orphan-forward pattern applies to its reverse ports
when the workstation-side tunnel dies uncleanly.
### C. VPS remotes (coulombcore, railiance01)
Normally always-on. Maintenance reboots clear remote kernel state, but:
- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
with a dead local SSH child;
- when the laptop returns, orphan forwards from the **previous** session may
still block new `-R` binds if the VPS did not reboot.
**Conclusion:** conditional remote cleanup before restart benefits **all reverse
tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
skips healthy forwards — VPS tunnels with live working forwards are untouched.
### D. Local-direction tunnels — no remote cleanup
`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
forward mode from workstation to remote services. They do not bind remote reverse
ports for State Hub. **`restart` stays local stop/start only** for these.
## Design (decided)
| Command | Behaviour after this workplan |
|---------|-------------------------------|
| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
| `bridge up` | Out of scope here (see T4 optional follow-up). |
Implementation sketch: replace the body of `cli.restart()` with a call to
`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
or per-tunnel `cleanup_tunnel` when a single tunnel is named.
Emit the same action summary strings cleanup already uses (`healthy`,
`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
## Out of scope
- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
positive during T2).
- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
- Renaming tunnels or changing `tunnels.yaml` host entries.
---
## T1 — Wire restart through cleanup path
```task
id: BRIDGE-WP-0005-T01
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
```
Refactor `bridge/cli.py` `restart()` so reverse tunnels call
`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
`TunnelManager.stop()` + `start()`.
Requirements:
- Single-tunnel and all-tunnel restart both work.
- Local-direction tunnels keep stop/start only.
- Exit codes: preserve todays semantics where practical; exit non-zero if any
named tunnel ends in `CleanupAction.action == "error"`.
- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
etc.), not only "Restarted tunnel".
## T2 — Tests and regression coverage
```task
id: BRIDGE-WP-0005-T02
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
```
Update `tests/test_cli.py`:
- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
reverse tunnels.
- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
invoked), local-direction tunnel (no cleanup call).
- Reuse mocks from `tests/test_cleanup.py` patterns.
`make test` and `make lint` pass.
## T3 — Operator docs and CLI help
```task
id: BRIDGE-WP-0005-T03
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
```
Document the blank-slate restart contract:
- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
- `bridge restart --help` — mention conditional remote stale-forward cleanup.
- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
maintenance — matching this workplan's topology section.
- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
incident note (one line each way).
## T4 — Optional: reconnect-loop hygiene (stretch)
```task
id: BRIDGE-WP-0005-T04
status: cancel
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
```
Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
once after repeated exit-255 bind failures (laptop wake without operator running
`bridge restart`). Defer unless T1T3 are done; mark `cancel` if heuristic risk
outweighs benefit.
**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
loop risks killing a legitimately healthy orphan forward owned by another session
or operator. `bridge restart` now covers the operator-facing blank-slate path;
nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
wake-from-sleep reconnect failures remain frequent after a month of observation.
## T5 — Live verification on workstation + VPS
```task
id: BRIDGE-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
```
After T1T2 ship, verify on real config:
1. **railiance01**`state-hub-mcp-railiance01` was `reconnecting` with stale
forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
`connected`.
2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat
path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
3. **coulombcore**`bridge restart state-hub-coulombcore` reported `healthy`,
PID unchanged (4116), forward undisturbed.
State Hub progress logged (2026-06-21). Workplan marked `finished`.

View File

@@ -2,7 +2,7 @@
id: OPS-WP-0001 id: OPS-WP-0001
type: workplan type: workplan
title: "ops-bridge diagnostics and flow improvements" title: "ops-bridge diagnostics and flow improvements"
domain: custodian domain: infotech
repo: ops-bridge repo: ops-bridge
status: done status: done
owner: claude owner: claude

View File

@@ -2,13 +2,13 @@
id: OPS-WP-0002 id: OPS-WP-0002
type: workplan type: workplan
title: "Agent Usability — MCP Registration, Skill, and Worker Orientation" title: "Agent Usability — MCP Registration, Skill, and Worker Orientation"
domain: custodian domain: infotech
repo: ops-bridge repo: ops-bridge
status: active status: done
owner: custodian owner: custodian
topic_slug: custodian topic_slug: custodian
created: "2026-03-21" created: "2026-03-21"
updated: "2026-03-21" updated: "2026-03-26"
depends_on: OPS-WP-0001 depends_on: OPS-WP-0001
state_hub_workstream_id: "c195cc40-8be7-462e-be26-a7d6bda34cd5" state_hub_workstream_id: "c195cc40-8be7-462e-be26-a7d6bda34cd5"
--- ---
@@ -74,7 +74,7 @@ worker agents:
```task ```task
id: OPS-WP-0002-T01 id: OPS-WP-0002-T01
status: todo status: done
priority: high priority: high
state_hub_task_id: "27fc6fa1-6d0e-438a-b4a3-c6091931da88" state_hub_task_id: "27fc6fa1-6d0e-438a-b4a3-c6091931da88"
``` ```
@@ -101,7 +101,7 @@ Gate: `bridge_status()` tool callable via SSE on localhost:8002 after
```task ```task
id: OPS-WP-0002-T02 id: OPS-WP-0002-T02
status: todo status: done
priority: high priority: high
state_hub_task_id: "2216457d-035e-4804-b685-18975f3c6d1f" state_hub_task_id: "2216457d-035e-4804-b685-18975f3c6d1f"
``` ```
@@ -133,7 +133,7 @@ mcp-http`.
```task ```task
id: OPS-WP-0002-T03 id: OPS-WP-0002-T03
status: todo status: done
priority: medium priority: medium
state_hub_task_id: "4b2e39eb-4585-4e60-ab16-9e7909eced74" state_hub_task_id: "4b2e39eb-4585-4e60-ab16-9e7909eced74"
``` ```
@@ -178,7 +178,7 @@ identifies and recovers a manually-stopped tunnel.
```task ```task
id: OPS-WP-0002-T04 id: OPS-WP-0002-T04
status: todo status: done
priority: medium priority: medium
state_hub_task_id: "cc64bb07-ea5d-498a-8c14-bb653581efe7" state_hub_task_id: "cc64bb07-ea5d-498a-8c14-bb653581efe7"
``` ```
@@ -213,9 +213,9 @@ session protocol references bridge status check.
## Done Criteria ## Done Criteria
- [ ] `make mcp-http` starts the MCP server on port 8002 (SSE) - [x] `make mcp-http` starts the MCP server on port 8002 (SSE)
- [ ] `bridge_status` and `bridge_check` callable as MCP tools from Claude Code - [x] `bridge_status` and `bridge_check` callable as MCP tools from Claude Code
- [ ] `ops-bridge` registered in `~/.claude.json` at user scope - [x] `ops-bridge` registered in `~/.claude.json` at user scope
- [ ] `/bridge` skill surfaces tunnel states and recovers a stopped tunnel - [x] `/bridge` skill surfaces tunnel states and recovers a stopped tunnel
- [ ] Global CLAUDE.md has worker agent bridge protocol - [x] Global CLAUDE.md has worker agent bridge protocol
- [ ] All existing tests pass after T01 changes (`make test`) - [x] All existing tests pass after T01 changes (`make test`)