Compare commits

...

51 Commits

Author SHA1 Message Date
00671f5133 Normalize agent instructions and workplan frontmatter (STATE-WP-0067)
- Align agent files with on-disk workplan prefixes (infer from workplan ids)
- Set workplan domain to registered domain_slug; add topic_slug where applicable
- Repair frontmatter delimiter formatting; migrate legacy task status literals
- Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates
2026-06-22 23:16:27 +02:00
09f2cd4b7a Mark .repo-classification.yaml human-reviewed (CUST-WP-0050 T02)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 11:40:44 +02:00
c3b4fb9d55 Reclassify as tooling (CUST-WP-0050 T02)
Apply the new 'tooling' category (reusable internal tooling/infrastructure)
from the Repo Classification Standard. First-pass agent classification.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 03:06:02 +02:00
fab7409c66 Add repo classification (CUST-WP-0050 T02)
First-pass agent classification per the Repo Classification Standard v1.0
(canon-repo-classification); pending human review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-22 02:44:47 +02:00
1dd664c792 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-21:
  - update .custodian-brief.md for ops-bridge
2026-06-21 20:12:38 +02:00
10c6fdaec9 feat(restart): route reverse tunnels through stale-forward cleanup
bridge restart now means blank-slate recovery: reverse tunnels run
should_cleanup_tunnel and clear orphan remote listeners before reconnecting;
healthy forwards are left running. Local-direction tunnels keep stop/start
only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted,
restarted, error) and exit non-zero on cleanup failure.

Closes BRIDGE-WP-0005.
2026-06-21 20:12:13 +02:00
8c11acc00c docs(ops-bridge): BRIDGE-WP-0005 restart includes remote cleanup
Add workplan to make bridge restart perform conditional stale-forward
cleanup before start (blank-slate recovery). Refines topology for laptop
workstation origin, intermittently offline haskelseed, and stable VPS
remotes (coulombcore, railiance01). Origin: STATE-WP-0063 tunnel incident.
Registered in State Hub via fix-consistency.
2026-06-21 20:02:18 +02:00
499b8781cc chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-06-21:
  - update .custodian-brief.md for ops-bridge
2026-06-21 20:02:10 +02:00
4e9882909f feat(maintenance): nightly stale SSH forward cleanup at 03:00
Add bridge maintenance cleanup to detect reverse tunnels whose remote
port is bound but no longer forwards (zombie sshd sessions), kill the
stale listeners on the remote host, and optionally restart the tunnel.

Includes install-cron/uninstall-cron/show-cron helpers and README notes
for the actcore-state-hub-bridge failure mode we hit on railiance01.
2026-06-19 15:59:27 +02:00
a6857fb8f7 Add credential routing instructions for all agent runtimes
Propagate shared credential-routing section (Codex, Claude, Grok, llm-connect)
from state-hub template via scripts/propagate_credential_routing.py.
2026-06-18 22:48:39 +02:00
675772ab3b Add capability registry scaffold (REUSE-WP-0014-T06 B04) 2026-06-16 01:55:58 +02:00
6eb0b1c52f Fixing bridge to haskelseed 2026-06-14 19:46:06 +02:00
d949f3e93e Refresh agent instruction files 2026-05-18 16:55:47 +02:00
de984736ca feat(cli): add bridge conventions and link from actor errors
Surfaces the actor naming rules (adm-/agt-/atm- prefixes, legacy class
aliases) so users hitting a ConfigError have an in-CLI way to read the
spec without grepping the wiki.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 23:21:37 +02:00
28ecef121e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15:
  - update .custodian-brief.md for ops-bridge
2026-05-15 12:19:50 +02:00
860c08f1db chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15:
  - update .custodian-brief.md for ops-bridge
2026-05-15 09:39:01 +02:00
bd169a07e2 feat(directive): implement BRIDGE-WP-0004 AccessManagementDirective alignment
- ActorType enum (adm/agt/atm) replaces actor_class string; config validates
  naming convention (adm-*/agt-*/atm-*) with hard ConfigError on mismatch;
  legacy 'human'/'automation' values accepted with DeprecationWarning
- cert_command: pluggable shell string run before each SSH launch; cert written
  to state dir; -i cert appended to SSH command alongside -i key
- TTL-aware cert refresh: parses Valid-to via ssh-keygen -L; pre-emptive restart
  5 min before expiry (no backoff, no attempt increment); CERT_EXPIRING logged
- CertAcquisitionError: cert failures trigger normal backoff/retry loop
- cert_identity: Key ID parsed from cert and recorded in BRIDGE_CONNECTED event
- bridge cert-status: new CLI command; exit 1 on expired cert; --json flag
- 233 tests passing, ruff clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:38:29 +02:00
22601ef3e6 chore(workplans): sync BRIDGE-WP-0004 and WARDEN-WP-0001 tasks to state hub
Both workplans had been registered as active workstreams but tasks were
never ingested — the markdown checkbox format was invisible to the
consistency checker, which requires task code blocks. Activated both
workplans (draft→active) and added task blocks with state_hub_task_id
for all 19 tasks (9 + 10).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:29:51 +02:00
569de1497c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-06:
  - update .custodian-brief.md for ops-bridge
2026-05-06 04:24:17 +02:00
fafd04ed2e chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-06:
  - update .custodian-brief.md for ops-bridge
2026-05-06 02:41:26 +02:00
c1d87b47df Added INTENT.md file 2026-05-02 23:17:22 +02:00
204bf48bc8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-01:
  - update .custodian-brief.md for ops-bridge
2026-05-01 23:22:08 +02:00
595c495f7c chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-01:
  - update .custodian-brief.md for ops-bridge
2026-05-01 23:07:50 +02:00
90eda27a14 Scope update from repo-scoping refactor 2026-05-01 12:28:27 +02:00
1361727e15 Added untracked workplans 2026-04-25 17:06:05 +02:00
18e3c118dd chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-04-21:
  - update .custodian-brief.md for ops-bridge
2026-04-21 02:14:25 +02:00
621de64ee0 chore: merge origin/main — reconcile divergent branches
Integrates remote changes (session protocol, .custodian-brief.md, MCP
SSE/HTTP mode, workplan OPS-WP-0002 completion) with local changes
(AccessManagementDirective alignment, architecture docs, BRIDGE-WP-0004
and WARDEN-WP-0001 workplans).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 01:05:11 +00:00
f3a7236c5d docs: align architecture and scope with AccessManagementDirective
Expands architecture constraints and SCOPE.md to reflect the three-actor
vocabulary (adm/agt/atm), two credential modes (static key + cert_command),
and ops-warden boundary. Adds directive wiki doc and two new workplans
(BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 00:59:38 +00:00
4f3c8646b3 feat(mcp): SSE/HTTP mode, workplan OPS-WP-0002 done
- Add --http flag to MCP server for SSE transport on port 8002
- Add make mcp-http / mcp-stop targets
- Pin fastmcp<3.1.0 to stabilize dependency
- Update session-protocol: Step 0 tunnel health check before orient
- Mark OPS-WP-0002 and all its tasks done

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 14:10:49 +01:00
431beef31b chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-03-26:
  - update .custodian-brief.md for ops-bridge
2026-03-26 22:46:07 +01:00
1c7c6eedf8 chore(session): read .custodian-brief.md before MCP call in session init
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-26 17:48:52 +01:00
75a559780e New workplan 2026-03-21 15:27:02 +01:00
d73b7be45d docs(workplan): OPS-WP-0002 — agent usability via MCP registration and /bridge skill
Plan to make ops-bridge fully usable by worker agents:
- T01: SSE transport mode + make mcp-http target
- T02: register in ~/.claude.json at user scope
- T03: /bridge global slash command skill
- T04: worker agent bridge protocol in global CLAUDE.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 15:15:42 +01:00
a55c685f89 feat(diagnostics): end-to-end tunnel check, stale state detection, MCP extensions
- diagnostics.py: TunnelCheckResult with SSH process liveness, port
  probe, and optional API health check; check_tunnel / check_all_tunnels
- cli.py: bridge status shows LIVE column and [STALE] marker when state
  says connected but PID is dead; bridge check wired to diagnostics
- state.py: read_raw_pid helper; _pid_alive exported for reuse
- capabilities.py: capabilities registry stubs
- mcp_server/server.py: expose check_tunnel and tunnel capabilities
  over MCP
- SCOPE.md: rapid orientation document
- workplans/OPS-WP-0001-diagnostics.md: workplan backing this feature
- tests: 207 passing (test_cli, test_mcp, test_diagnostics)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 15:07:47 +01:00
bebd542a2e feat(tunnel): add direction field — support local (-L) port forwards
Previously build_ssh_command only generated -R (reverse) tunnels.
The k3s API tunnel needs -L (local forward: workstation:16443 →
CoulombCore:6443) so kubectl can reach the cluster API directly.

- TunnelConfig.direction: "reverse" (default) | "local"
- config.py: parse direction from YAML, validate allowed values
- manager.py: choose -R or -L flag based on direction

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-21 13:41:55 +01:00
30bbaf303d docs: add SCOPE.md for rapid orientation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-17 23:10:39 +01:00
101244bd1d refactor(docs): split CLAUDE.md into scoped rules files under .claude/rules/
Each concern (identity, session protocol, workplan convention, stack,
architecture, repo boundary) now lives in its own file with a single
responsibility. CLAUDE.md becomes a thin @-import integrator. Removes
Ralph Loop duplication — global ~/.claude/CLAUDE.md remains authoritative.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 18:11:52 +01:00
6673cb0e48 docs: add server prerequisites and health check gotchas
Document ClientAliveInterval/ClientAliveCountMax requirement on remote
sshd to prevent stale sessions holding ports after reconnect. Document
fail2ban ignoreip setup. Clarify that health_check.url must be a local
port (not the remote forwarded port), and that SSE endpoints block the
health checker.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 02:41:17 +01:00
60c742a456 chore: remove stale repo-seed README.md (README.txt is canonical)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 22:44:33 +01:00
3be41c315e test(BRIDGE-WP-0003): add sentinel self-validation for meta-test + MCP section in README
- Add test_meta_test_catches_missing_mode_gap() — validates Goal #4:
  injects _test_sentinel capability (cli+mcp required), provides only
  a cli mock item, asserts collect_capability_coverage reports the mcp gap.
  Proves the cross-mode gap-detection mechanism is functional.

- Add MCP INTEGRATION section to README.txt (T14 requirement): documents
  project-scope .mcp.json, user-scope registration script, skill, and
  direct server invocation.

189 tests, 0 lint errors.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 21:19:58 +01:00
d4b5854483 chore: add Makefile with test, lint, and install targets
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 11:38:23 +01:00
365c0d611a feat(BRIDGE-WP-0003): MCP server, /bridge-status skill, cross-mode coverage enforcement
Implements the full BRIDGE-WP-0003 workplan: 188 tests passing, 0 lint errors.

## What's added

**Capability registry** (`src/bridge/capabilities.py`):
- 10 capabilities with required_access_modes (cli/mcp/skill)
- Single source of truth for what OpsBridge does and where

**MCP server** (`src/bridge/mcp_server/server.py`):
- 10 FastMCP tools: bridge_up/down/restart/status/logs + 5 catalog_* tools
- 3 resources: bridge://status, catalog://domains, catalog://targets
- `.mcp.json` for project-scope auto-registration
- `scripts/register_mcp.py` for user-scope machine-global registration

**Skill** (`~/.claude/plugins/ops-bridge/bridge-status.md`):
- /bridge-status: health table with emoji indicators + remediation advice

**Cross-mode test coverage enforcement**:
- `tests/conftest.py`: capability/access_mode marks + collect_capability_coverage()
- `tests/test_mcp.py`: 31 FastMCP in-process client tests (Client(mcp) pattern)
- `tests/test_skill.py`: static skill lint against capability registry
- `tests/test_coverage_completeness.py`: meta-test that fails if any required
  (capability × mode) pair lacks a test; also validates CLI commands and MCP
  tools are registered in the capability registry

**ADR** (`architecture/adr-001-cross-mode-capability-registry.md`):
- Documents the registry pattern and FastMCP 3.x testing approach

Key implementation note: FastMCP 3.x in-process results are in
result.content[0].text (JSON string), not result.data directly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 11:33:16 +01:00
44b5a9426a docs: add BRIDGE-WP-0003 workplan — MCP server, skill, and cross-mode tests
Defines the FastMCP server, /bridge-status skill, capability registry,
and self-validating cross-access-mode test suite for ops-bridge.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 09:36:19 +01:00
af2d419bf6 chore: mark BRIDGE-WP-0001 and BRIDGE-WP-0002 workplans as completed
All 39 tasks marked done; both workstreams updated to completed status
in the State Hub and workplan files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 03:37:32 +01:00
d248f14a9f docs: add README.txt with usage guide and configuration reference
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 03:24:56 +01:00
baee28eda2 chore: add Claude Code project settings
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 02:10:14 +00:00
91d031ae20 feat: implement OpsCatalog extension (BRIDGE-WP-0002)
Adds the OpsCatalog subsystem: a Git-backed YAML catalog of operations
domains, targets, bridges, and actor classes. Includes catalog loader,
cross-reference validator, bridge resolver (inline-first, catalog
fallback), and new CLI commands: `bridge targets`, `bridge targets show`,
`bridge catalog list/validate/show`. Updates `up/down/restart` to resolve
bridge names from the catalog when not defined inline. 142 tests, all green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 02:05:06 +00:00
a7eaf59ced feat: implement OpsBridge CLI (BRIDGE-WP-0001)
Full TDD implementation of the `bridge` CLI tool covering all phases
from BRIDGE-WP-0001: project scaffolding, config loading, state
management, audit logging, health checks, tunnel lifecycle manager, and
all CLI commands (up/down/restart/status/logs). 77 tests, all green.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-12 01:40:08 +00:00
2c7c440ea7 docs: add BRIDGE-WP-0002 OpsCatalog extension workplan
7-phase plan covering catalog data models, loader, validator, bridge
resolver (inline-first with catalog fallback), bridge targets and
bridge catalog CLI commands, and integration tests. 16 tasks registered
in Custodian State Hub (workstream bridge-wp-0002). Covers OpsCatalog
FRS FR-1–15 and OpsBridge FRS FR-21–23.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-11 22:00:09 +01:00
1364cbcece docs: add CLAUDE.md improvements and BRIDGE-WP-0001 workplan
- Expand CLAUDE.md with dev commands, architecture overview, and required prefix
- Add workplans/BRIDGE-WP-0001-initial-implementation.md: 8-phase implementation
  plan covering FRS FR-1 to FR-26 (23 tasks registered in Custodian State Hub,
  workstream bridge-wp-0001)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-11 21:53:29 +01:00
482edcd7eb chore: register with Custodian State Hub
Add CLAUDE.md (session protocol, tool boundary, workplan prefix BRIDGE-WP)
and workplans/ directory. Repo registered as ops-bridge under custodian
domain (id: 1bf99f56).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-11 21:34:37 +01:00
77 changed files with 12067 additions and 7 deletions

20
.claude/rules/agents.md Normal file
View File

@@ -0,0 +1,20 @@
## Kaizen Agents
Specialized agent personas available on demand via the state-hub MCP.
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
Common agents:
| Agent | Category | When to use |
|-------|----------|-------------|
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
| `test-maintenance` | testing | Diagnose and fix failing tests |
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
| `keepaTodofile` | process | Maintain TODO.md during work |
| `project-management` | process | Track status, determine next steps |
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
All 17 agents: call `list_kaizen_agents()` for the full list.

View File

@@ -0,0 +1,8 @@
## Architecture
<!-- TODO: Describe the key design decisions and component structure.
Key modules, data flows, external integrations, state machines, etc. -->
## Quick Reference
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference

View File

@@ -0,0 +1,50 @@
# Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes**`warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`

View File

@@ -0,0 +1,38 @@
## First Session Protocol
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
The project is registered but work has not yet been structured.
**Step 1 — Read, don't write**
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
- Scan repo root: README, directory structure, existing code or docs
**Step 2 — Survey in-progress work**
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
**Step 3 — Propose workstreams to Bernd**
Propose 13 workstreams — each a coherent strand, weeks to months, anchored to a
roadmap phase. **Wait for approval before creating.**
**Step 4 — Create workplan file first, then DB record (ADR-001)**
```
workplans/BRIDGE-WP-NNNN-<slug>.md ← write this first
```
Then register in the hub:
```
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
```
**Step 5 — Record the setup**
```
add_progress_event(
summary="First session: structured infotech into N workstreams, M tasks",
event_type="milestone",
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
detail={"workstreams": [...], "tasks_created": M}
)
```
<!-- Delete or archive this file once past first session -->

View File

@@ -0,0 +1,8 @@
## Repo boundary
This repo owns **ops-bridge** only. It does not own:
<!-- TODO: List what belongs in adjacent repos, e.g.:
- SSH key management → railiance-infra/
- State hub code → state-hub/
-->

View File

@@ -0,0 +1,5 @@
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
**Domain:** infotech
**Repo slug:** ops-bridge
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a

View File

@@ -0,0 +1,85 @@
## Session Protocol
Dev Hub (State Hub API): http://127.0.0.1:8000
MCP server name in `~/.claude.json`: `dev-hub`
**Step 1 — Orient**
Read the offline-safe brief first — it works without a live hub connection:
```bash
cat .custodian-brief.md
```
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
```
get_domain_summary("infotech")
```
If MCP tools are unavailable in the current agent session, use the REST API:
```bash
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
```
If the hub is offline: `cd ~/state-hub && make api`
**Step 2 — Check inbox**
With MCP tools:
```
get_messages(to_agent="ops-bridge", unread_only=True)
```
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
requests before proceeding.
Without MCP tools:
```bash
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
| python3 -m json.tool
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
**Step 3 — Scan workplans**
```bash
ls workplans/
```
For each file with `status: ready`, `active`, or `blocked`, note pending
`wait`/`todo`/`progress` tasks.
**Step 4 — Present brief**
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
2. **Pending tasks** from `workplans/` + any `[repo:ops-bridge]` hub tasks
3. **Goal guidance** — if `goal_guidance` in summary:
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
- `alignment_warnings`: flag if active work is not aligned with current goal
4. **Suggested next action** — highest-priority open item
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
If no workstreams: follow First Session Protocol (`first-session.md`).
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
**Session close:**
With MCP tools:
```
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
```
Without MCP tools:
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
```
If workplan files were modified, ensure the local copy is up to date first:
```bash
git -C <repo_path> pull --ff-only
cd ~/state-hub && make fix-consistency REPO=ops-bridge
```
For repos where implementation runs on a remote machine (e.g. CoulombCore),
use the combined target which pulls before fixing:
```bash
cd ~/state-hub && make fix-consistency-remote REPO=ops-bridge
```
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
until you pull — intentional to prevent clobbering remote progress.

View File

@@ -0,0 +1,19 @@
## Stack
<!-- TODO: Fill in language, frameworks, and key dependencies -->
- **Language:**
- **Key deps:**
## Dev Commands
```bash
# TODO: Fill in the standard commands for this repo
# Install dependencies
# Run tests
# Lint / type check
# Build / package (if applicable)
```

View File

@@ -0,0 +1,40 @@
## Workplan Convention (ADR-001)
File location: `workplans/BRIDGE-WP-NNNN-<slug>.md`
ID prefix: `BRIDGE-WP-`
Work items originate as files in this repo **before** being registered in the hub.
Canonical workplan/workstream frontmatter statuses are:
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
Use `proposed` for a newly drafted plan, `ready` after review against current
repo state, and `finished` when implementation is complete. `stalled` and
`needs_review` are derived health labels, not stored statuses.
Closed workplans may be moved to `workplans/archived/` with a completion-date
prefix: `YYMMDD-BRIDGE-WP-NNNN-<slug>.md`. The frontmatter id remains
unchanged; the prefix is only for quick visual reference.
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
directly. Promote anything requiring analysis, design, approval, dependencies, or
multiple planned phases into a normal workplan.
Ecosystem todos from other agents arrive as `[repo:ops-bridge]` hub tasks —
visible at session start. Pick one up by creating the workplan file, then registering
the workstream.
Task blocks use this shape:
```task
id: BRIDGE-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
```
Status progression is `todo``progress``done`; use `wait` for waiting or
blocked work and `cancel` for stopped work.
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->

5
.claude/settings.json Normal file
View File

@@ -0,0 +1,5 @@
{
"enabledPlugins": {
"commit-commands@claude-plugins-official": true
}
}

7
.codex/config.toml Normal file
View File

@@ -0,0 +1,7 @@
[mcp_servers.ops-bridge]
command = "uv"
args = [
"run",
"python",
"src/bridge/mcp_server/server.py",
]

18
.custodian-brief.md Normal file
View File

@@ -0,0 +1,18 @@
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
# Custodian Brief — ops-bridge
**Domain:** custodian
**Last synced:** 2026-06-21 18:12 UTC
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
## Active Workstreams
*(none — repo may need first-session setup)*
---
## MCP Orientation (when available)
If the state-hub MCP server is reachable, call:
`get_domain_summary("custodian")`
This provides richer cross-domain context.
If the MCP call fails, use this file as your orientation source.

10
.mcp.json Normal file
View File

@@ -0,0 +1,10 @@
{
"mcpServers": {
"ops-bridge": {
"type": "stdio",
"command": "uv",
"args": ["run", "python", "src/bridge/mcp_server/server.py"],
"cwd": "/home/worsch/ops-bridge"
}
}
}

26
.repo-classification.yaml Normal file
View File

@@ -0,0 +1,26 @@
# Repo classification (Repo Classification Standard v1.0).
repo_classification:
standard: Repo Classification Standard
version: '1.0'
classified_at: '2026-06-22'
classified_by: human
category: tooling
domain: infotech
secondary_domains: []
capability_tags:
- operations
- access-control
- platform
- observability
- orchestration
business_stake:
- operations
- technology
- automation
business_mechanics:
- control
- operation
- adaptation
notes: SSH reverse-tunnel lifecycle manager keeping remote environments connected to the
State Hub. Operational tooling -> product.

219
AGENTS.md Normal file
View File

@@ -0,0 +1,219 @@
# ops-bridge — Agent Instructions
## Repo Identity
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
**Domain:** infotech
**Repo slug:** ops-bridge
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
**Workplan prefix:** `BRIDGE-WP-`
---
## State Hub Integration
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
there is no MCP server for Codex agents.
| Context | URL |
|---------|-----|
| Local workstation | `http://127.0.0.1:8000` |
| Remote via tunnel | `http://127.0.0.1:18000` |
### Orient at session start
```bash
# Offline brief — works without hub connection
cat .custodian-brief.md
# Active workstreams for this domain
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
| python3 -m json.tool
# Check inbox
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
| python3 -m json.tool
```
Mark a message read:
```bash
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
-H "Content-Type: application/json" -d '{}'
```
### Log progress (required at session close)
```bash
curl -s -X POST http://127.0.0.1:8000/progress/ \
-H "Content-Type: application/json" \
-d '{
"summary": "what was done",
"event_type": "note",
"author": "codex",
"workstream_id": "<uuid>",
"task_id": "<uuid>"
}'
```
Omit `workstream_id` / `task_id` when not applicable.
### Update task status
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"status": "progress"}'
# values: wait | todo | progress | done | cancel
```
### Flag a task for human review
```bash
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
-H "Content-Type: application/json" \
-d '{"needs_human": true, "intervention_note": "reason"}'
```
---
## Session Protocol
**Start:**
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
2. Check inbox: `GET /messages/?to_agent=ops-bridge&unread_only=true`; mark read
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
**During work:**
- Update task statuses in workplan files as tasks progress
- Record significant decisions via `POST /decisions/`
**Close:**
1. Update workplan file task statuses to reflect progress
2. Log: `POST /progress/` with a summary of what changed
3. Note for the custodian operator: after workplan file changes, run from
`~/state-hub`:
```bash
make fix-consistency REPO=ops-bridge
```
This syncs task status from files into the hub DB.
---
## Credential and access routing
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
for inference. Run this check **before** requesting secrets, API keys, SSH access,
login tokens, or database passwords — in any repo, not only `ops-warden`.
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
other credential need belongs to another subsystem. **Do not** message
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
### Lookup (do this first)
```bash
warden route find "<describe your need>" --json
warden route show <catalog-id> --json
```
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
| Agent runtime | How to orient |
| --- | --- |
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
### Quick routing table
| I need… | Owner | ops-warden executes? |
| --- | --- | --- |
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
| Authorization decision | flex-auth | No — route only |
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
### Anti-patterns (do not do these)
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
- Pasting secrets into Git, State Hub, workplans, logs, or chat
### Other capabilities (reuse-surface)
Non-credential capabilities are usually discovered through **reuse-surface** federation
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
every repo's agent instructions because it is high-frequency, high-risk, and easy to
get wrong.
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
<!-- REPO-AGENTS-EXTENSIONS -->
<!-- Append repo-specific agent instructions below this marker.
The state-hub template sync preserves content after this line. -->
---
## Workplan Convention (ADR-001)
Work items originate as files in this repo — not in the hub. The hub is a
read/cache/index layer that rebuilds from files.
**File location:** `workplans/OPS-WP-NNNN-<slug>.md`
**Archived location:** finished workplans may move to
`workplans/archived/YYMMDD-OPS-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
the completion/archive date; the frontmatter `id` does not change.
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
this only for low-risk work completed directly; create a normal workplan for
anything needing analysis, design, approval, dependencies, or multiple phases.
**Frontmatter:**
```yaml
---
id: OPS-WP-NNNN
type: workplan
title: "..."
domain: infotech
repo: ops-bridge
status: proposed | ready | active | blocked | backlog | finished | archived
owner: codex
topic_slug: ...
created: "YYYY-MM-DD"
updated: "YYYY-MM-DD"
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
---
```
Use `proposed` for a new draft, `ready` after review against current repo
state, and `finished` after implementation. `stalled` and `needs_review` are
derived health labels, not frontmatter statuses.
**Task block format** (one per `##` section):
```
## Task Title
` ` `task
id: OPS-WP-NNNN-T01
status: wait | todo | progress | done | cancel
priority: high | medium | low
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
` ` `
Task description text.
```
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
To create a new workplan:
1. Write the file following the format above
2. Notify the custodian operator to run `make fix-consistency REPO=ops-bridge`
(or send a message to the hub agent via `POST /messages/`)

12
CLAUDE.md Normal file
View File

@@ -0,0 +1,12 @@
# ops-bridge — Claude Code Instructions
@SCOPE.md
@.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md
@.claude/rules/first-session.md
@.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md
@.claude/rules/credential-routing.md
@.claude/rules/agents.md

92
INTENT.md Normal file
View File

@@ -0,0 +1,92 @@
# INTENT
## Purpose
This repository exists to provide a **reliable, inspectable, and controllable connectivity layer**
between distributed dev, build, test and execution environments for dev and ops personal human and agentic.
Its role is to ensure that remote machines can **consistently and safely “phone home”** without requiring complex network infrastructure or manual intervention.
---
## Primary Utility
The repository provides a **managed SSH reverse tunneling system** that:
* Maintains continuous connectivity between remote systems and a central hub
* Makes connectivity **observable, auditable, and controllable**
* Exposes this capability as both a **CLI tool and an MCP-accessible service**
It transforms raw SSH port-forwarding into a **first-class operational primitive**.
---
## Intended Users
* Human operators (`adm`) managing infrastructure and connectivity
* LLM-based agents (`agt`) requiring stable access to local services
* Deterministic automations (`atm`) coordinating distributed workloads
---
## Strategic Role in the System
This repository acts as the **connectivity backbone** of the custodian ecosystem:
* It enables remote agents and services to participate in a **locally anchored control plane**
* It decouples **execution location** from **control location**
* It supports a **hub-and-spoke topology** where the Custodian State Hub remains central
---
## Strategic Boundaries
This repository is **not** intended to:
* Replace SSH as a general-purpose access mechanism
* Act as a credential authority or security policy engine
* Provide full network virtualization (e.g., VPN, mesh networking)
* Host or orchestrate application workloads
Its responsibility ends at **secure, observable, and managed connectivity via tunnels**.
---
## Design Principles
* **Continuity over convenience**
Connectivity must persist across failures without manual recovery
* **Observability as a first-class concern**
All lifecycle events must be traceable and attributable
* **Actor-aware operations**
Every action is tied to a clearly defined actor type (`adm`, `agt`, `atm`)
* **Pluggable security integration**
Works with both static keys and external certificate authorities without owning them
* **Toolability**
All capabilities should be accessible programmatically (MCP) and operationally (CLI)
---
## Maturity Target
A mature version of this repository should:
* Provide **fully autonomous tunnel lifecycle management** across heterogeneous environments
* Integrate seamlessly with **centralized access control and certificate systems**
* Serve as a **standardized connectivity primitive** across all Custodian-managed systems
* Offer **complete operational transparency** for all connectivity-related actions
* Be robust enough to act as the **default connectivity layer** for distributed agent systems
---
## Stability Note
Changes to this file represent a **deliberate shift in repository purpose or role** within the system architecture.
Such changes should be rare and made with explicit intent.

31
Makefile Normal file
View File

@@ -0,0 +1,31 @@
.DEFAULT_GOAL := help
.PHONY: help setup test lint install mcp-http mcp-stop cron-install-cron cron-uninstall-cron
help: ## List available make targets
@awk 'BEGIN {FS = ":.*## "}; /^[a-zA-Z0-9_.-]+:.*## / {printf " %-16s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
setup: ## Sync dependencies and install the bridge CLI wrapper
uv sync --all-groups
uv tool install -e . --force
test: ## Run the test suite
uv run pytest
lint: ## Run ruff lint checks
uv run ruff check .
install: ## Install the bridge CLI wrapper
uv tool install -e . --force
mcp-http: ## Start MCP server in SSE mode (default port 8002)
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
mcp-stop: ## Stop MCP server running on port 8002
@lsof -ti:$${BRIDGE_MCP_PORT:-8002} | xargs -r kill -TERM && echo "MCP server stopped" || echo "No MCP server running on port $${BRIDGE_MCP_PORT:-8002}"
cron-install-cron: ## Install 03:00 nightly stale-forward cleanup cron
bridge maintenance install-cron
cron-uninstall-cron: ## Remove nightly stale-forward cleanup cron
bridge maintenance uninstall-cron

View File

@@ -1,3 +0,0 @@
# repo-seed
A git repository template to bootstrap coulomb projects from.

318
README.txt Normal file
View File

@@ -0,0 +1,318 @@
ops-bridge
==========
SSH reverse tunnel lifecycle manager. Keeps remote execution environments
(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub
so Claude Code sessions on those machines have full MCP connectivity.
WHAT IT DOES
------------
`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel:
- Is identified by a human-readable name (e.g. state-hub-coulombcore)
- Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host
- Auto-reconnects on drop using exponential backoff
- Optionally runs an HTTP health check to confirm the forwarded service
is actually reachable (not just the SSH process alive)
- Records structured audit events (bridge_started, bridge_connected,
health_check_failed, etc.) to a JSON log per tunnel
Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting
INSTALL
-------
Requires Python 3.11+ and uv (https://docs.astral.sh/uv/).
uv tool install /path/to/ops-bridge
This registers the `bridge` command globally. For development:
cd /path/to/ops-bridge
uv tool install -e .
Verify:
bridge --help
CONFIGURATION
-------------
Config file: ~/.config/bridge/tunnels.yaml
Override with: BRIDGE_CONFIG=/path/to/config.yaml
Minimal example:
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
actors:
agent.claude-coulombcore:
class: automation
description: Claude Code agent on CoulombCore
With health check and reconnect policy:
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health # checked from the REMOTE host
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0 # 0 = retry forever
backoff_initial: 5
backoff_max: 60
actors:
agent.claude-coulombcore:
class: automation # "human" or "automation"
description: Claude Code agent on CoulombCore
operator.bernd:
class: human
description: Bernd Worsch
Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor
Required actor fields: class (must be "human" or "automation")
CLI COMMANDS
------------
Lifecycle:
bridge up [TUNNEL] Start one tunnel, or all if no name given
bridge down [TUNNEL] Stop one tunnel, or all
bridge restart [TUNNEL] Restart one tunnel, or all
Observation:
bridge status Show all tunnels: state, uptime, last event
bridge status --json Machine-readable JSON output
bridge logs TUNNEL Tail the audit log for a tunnel
bridge logs TUNNEL --lines 100 --follow
Examples:
bridge up state-hub-coulombcore
bridge status
bridge logs state-hub-coulombcore --follow
bridge down state-hub-coulombcore
OPSCATALOG EXTENSION (optional)
--------------------------------
If you maintain a Git-backed YAML catalog of your infrastructure, point
bridge at it in your config:
catalog_path: ~/ops-infra/opscatalog/
Catalog layout:
opscatalog/
domains/
<domain-id>/
domain.yaml
targets/
<target-id>.yaml
bridges/
<bridge-id>.yaml
Then you can use:
bridge targets [--domain DOMAIN] List all targets (optionally filtered)
bridge targets show TARGET_ID Show full target metadata
bridge catalog list List domains with counts
bridge catalog validate Check catalog for consistency errors
bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata
Bridges defined in the catalog are resolved the same way as inline tunnels.
Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when
both define the same name.
STATE FILES
-----------
Runtime state is stored in ~/.local/state/bridge/:
{name}.pid Manager process ID
{name}.state Current bridge state (e.g. "connected")
{name}.log Audit log, one JSON object per line
Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir
AUDIT LOG FORMAT
----------------
Each event is one JSON object per line:
{
"ts": "2026-03-12T14:23:01.456789",
"tunnel": "state-hub-coulombcore",
"event": "bridge_connected",
"actor": "agent.claude-coulombcore",
"actor_class": "automation",
"detail": ""
}
Event types: bridge_started, bridge_connected, bridge_disconnected,
bridge_reconnecting, health_check_failed, health_check_recovered,
bridge_stopped
MCP INTEGRATION
---------------
OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents
can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as
first-class MCP tools — no Bash required, structured JSON in/out.
Available tools: bridge_up, bridge_down, bridge_restart, bridge_status,
bridge_logs, catalog_list_targets, catalog_show_target,
catalog_list_domains, catalog_validate, catalog_show_bridge
Available resources: bridge://status, catalog://domains, catalog://targets
Project-scope (auto, inside ops-bridge/):
Already configured in .mcp.json. Claude Code sessions inside this repo
see the tools automatically.
User-scope (machine-global, any repo):
python scripts/register_mcp.py
Human operator skill:
/bridge-status — natural-language tunnel health summary
(skill file: ~/.claude/plugins/ops-bridge/bridge-status.md)
Run the server directly (for debugging):
uv run python src/bridge/mcp_server/server.py
DEVELOPMENT
-----------
uv run pytest Run all tests
uv run pytest tests/test_cli.py -v Run a specific test file
uv run ruff check . Lint
Source layout:
src/bridge/
cli.py Typer CLI (entry point)
models.py Core dataclasses and enums
config.py Config loading from tunnels.yaml
manager.py Tunnel lifecycle (subprocess, reconnect loop)
state.py PID and state file management
audit.py Audit event logging
health.py HTTP health checker (async, httpx)
catalog/ OpsCatalog extension
SERVER PREREQUISITES
--------------------
For reliable auto-reconnect after reboots or network drops, the remote sshd
needs two settings in /etc/ssh/sshd_config:
ClientAliveInterval 30
ClientAliveCountMax 3
Without these, dead SSH sessions hold their remote port forward open (the OS
has not yet cleaned up the socket), so the next reconnect attempt hits
"remote port forwarding failed" and exits with code 255. With ClientAlive
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
NIGHTLY STALE-FORWARD CLEANUP
------------------------------
When a bridge client dies without tearing down its SSH session, the remote
host can keep port 18000 (etc.) bound to a zombie sshd listener. The port
accepts connections but never forwards them, which breaks in-cluster proxies
such as actcore-state-hub-bridge on railiance01.
Install a 03:00 local-time cron job that probes each reverse tunnel's remote
forward, kills stale listeners when the local service is healthy but the
remote forward is not, and restarts the tunnel:
bridge maintenance install-cron
Manual run:
bridge maintenance cleanup --restart
Inspect or remove the cron entry:
bridge maintenance show-cron
bridge maintenance uninstall-cron
Logs append to ~/.local/state/bridge/cleanup.log
Apply and reload (no disconnect):
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config
sudo kill -HUP $(cat /run/sshd.pid)
If fail2ban is running on the remote, whitelist the bridge host IP so rapid
reconnect storms (e.g. after a key auth failure) do not trigger a ban.
Add the client IP to ignoreip in /etc/fail2ban/jail.local:
[DEFAULT]
ignoreip = 127.0.0.1/8 ::1 <your-bridge-host-ip>
Then reload: sudo systemctl reload fail2ban
Note: health_check.url must point to a LOCAL port (the local side of the
tunnel), not the remote forwarded port. For a reverse tunnel
(remote_port=18000, local_port=8000), the correct health check URL is
http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/...
For SSE endpoints (MCP), use a non-streaming endpoint from the same service
(e.g. the state-hub /state/health) since the health checker waits for the
response to complete.
DESIGN NOTES
------------
- No system daemons. Tunnel processes are managed as subprocesses; PIDs
are tracked in ~/.local/state/bridge/.
- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL
follows after 5 seconds if unresponsive.
- Actor attribution on every log event (human vs. automation) supports
audit traceability (FRS §5.7).
- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port
-i ssh_key ssh_user@host
- ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote
port is already in use. This is intentional — it forces a clean reconnect
rather than silently running without the port forward active.
REPO STRUCTURE
--------------
src/bridge/ Main source
tests/ Test suite
wiki/ PRD, FRS, OpsCatalog specification
workplans/ Custodian State Hub workplan files (BRIDGE-WP-*)
pyproject.toml Build config and dependencies

134
SCOPE.md Normal file
View File

@@ -0,0 +1,134 @@
# SCOPE
> This file helps you quickly understand what this repository is about,
> when it is relevant, and when it is not.
> It is intentionally lightweight and may be incomplete.
---
## One-liner
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.
---
## Core Idea
Claude Code sessions run locally; the Custodian State Hub API runs locally. Remote machines (Railiance nodes, Temporal workers, Markitect services) need to reach the hub. Ops-bridge manages named SSH reverse tunnels with auto-reconnect, health checks, audit logging, and an MCP server so Claude Code can start/stop/inspect tunnels as tools.
---
## In Scope
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
- Auto-reconnect with exponential backoff and configurable retry policy
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
works without any CA or external tooling
- **cert_command mode** (optional): pluggable shell command that issues a short-lived
CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
`cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
- PID + state file management in `~/.local/state/bridge/`
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
---
## Out of Scope
- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
certs via the `cert_command` interface but never signs anything itself)
- SSH key generation for human admins (self-service: `ssh-keygen`)
- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
- Long-running application hosting on remote machines (port-forward only, not deployment)
- VPN or layer-3 connectivity
- Monitoring/alerting beyond JSON audit logs
- Replacing SSH for general interactive access
---
## Relevant When
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub
- Diagnosing connectivity issues between local hub and remote services
- Checking certificate validity for active tunnels (`bridge cert-status`)
- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials
---
## Not Relevant When
- All work is local (no remote services involved)
- Manually running `ssh -R` is acceptable
- No need for audit tracing of tunnel state changes
---
## Current State
- Status: active (v0.1 core complete; AccessManagementDirective alignment done — BRIDGE-WP-0004)
- Implementation: ~80% — CLI tunneling fully functional, MCP integration working, health
checks and audit logging complete; ActorType enum (adm/agt/atm) enforced; cert_command
mode implemented with TTL-aware refresh and cert_identity audit logging; OpsCatalog
framework present but not yet populated
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures
- Usage: running in lab for daily Railiance/Temporal connectivity
---
## How It Fits
- Upstream dependencies: SSH (system), OpenSSH server on remote hosts
- Downstream consumers: all remote Claude Code agents depend on ops-bridge to reach local hub MCP; activity-core Temporal server reachable via bridge tunnel
- Often used with: the-custodian (health checks point to hub API), activity-core (Temporal port-forwarding)
---
## Terminology
- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
cert_command, cert_identity
- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
- Also known as: "the bridge"
- Potentially confusing: "bridge state" is a tunnel-specific state machine
(stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)
---
## Related / Overlapping
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
`cert_command` when short-lived certificates are required
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
host-side principal deployment (`/etc/ssh/auth_principals/`)
---
## Provided Capabilities
```capability
type: infrastructure
title: SSH reverse tunnel connectivity
description: Named, auto-reconnecting SSH reverse tunnels with health checks and audit logging — keeps remote execution environments continuously connected to the local Custodian State Hub.
keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge]
```
---
## Getting Oriented
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
`~/.local/state/bridge/` (PID/state/cert files)
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
MCP: `bridge_status()`
- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)

View File

@@ -0,0 +1,55 @@
---
id: ADR-001
title: Cross-Mode Capability Registry and Coverage Enforcement
status: accepted
date: 2026-03-12
---
## Context
OpsBridge exposes its operations through three access modes: CLI (`bridge` CLI), MCP server
(FastMCP stdio), and Skills (Claude plugin prompts). As the capability surface grows, there is
no guarantee that a new capability will be implemented consistently across all required modes,
or that tests exist for each mode.
## Decision
Introduce a canonical **Capability Registry** (`src/bridge/capabilities.py`) that:
1. Lists every operation as a `Capability(name, description, required_access_modes)` dataclass.
2. Declares which access modes each capability must support.
3. Is imported by the cross-mode meta-test to enforce complete test coverage.
### Test coverage enforcement
Pytest marks `@pytest.mark.capability(name)` and `@pytest.mark.access_mode(mode)` are placed
on the canonical test for each (capability, mode) pair. `tests/test_coverage_completeness.py`
collects these marks at session scope and fails if any pair required by the registry has no
corresponding test.
### FastMCP in-process testing
MCP tools are tested in `tests/test_mcp.py` using `fastmcp.Client(mcp_app)` — an in-process
client that calls tools without spawning a subprocess or opening a network socket. This is the
preferred approach because:
- Tests run in the same process as the server code, so patches/mocks work normally.
- No port allocation, no cleanup, no flakiness from network timeouts.
- FastMCP 3.x returns results via `result.content[0].text` (JSON string) for non-empty
responses, and `result.data` (empty list/dict) when the return value is empty.
### Skill static lint
`tests/test_skill.py` validates skill Markdown files in `~/.claude/plugins/ops-bridge/`:
- Required frontmatter: `name`, `description`.
- Body must reference at least one registered capability name.
- The `bridge_status` skill must reference `bridge_status` and the registry must declare
`skill` as a required mode for that capability.
## Consequences
- Every new capability must be added to the registry before or alongside its implementation.
- Every new (capability, mode) pair requires a marked test or the meta-test fails.
- The registry is the single source of truth for "what does OpsBridge do and where".
- Skills must reference capability names by their canonical registry IDs.

40
pyproject.toml Normal file
View File

@@ -0,0 +1,40 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "ops-bridge"
version = "0.1.0"
description = "SSH reverse tunnel lifecycle manager"
requires-python = ">=3.11"
dependencies = [
"typer>=0.12",
"pyyaml>=6.0",
"httpx>=0.27",
"fastmcp>=2.0.0,<3.1.0",
]
[project.scripts]
bridge = "bridge.cli:app"
[tool.hatch.build.targets.wheel]
packages = ["src/bridge"]
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["src"]
asyncio_mode = "auto"
markers = [
"capability(name): the bridge capability under test",
"access_mode(mode): access mode being tested (cli, mcp, skill)",
]
[tool.ruff]
line-length = 88
[dependency-groups]
dev = [
"pytest>=8.0",
"pytest-asyncio>=0.23",
"ruff>=0.4",
]

12
registry/README.md Normal file
View File

@@ -0,0 +1,12 @@
# Capability Registry
Markdown-first capability index for federation and reuse planning.
## Authoring
1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
2. Add the row to `indexes/capabilities.yaml`.
3. Run `reuse-surface validate` from a checkout with the CLI installed.
4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
Federation contract: reuse-surface `docs/RegistryFederation.md`.

View File

View File

@@ -0,0 +1,4 @@
version: 1
updated: '2026-06-16'
domain: helix_forge
capabilities: []

96
scripts/register_mcp.py Normal file
View File

@@ -0,0 +1,96 @@
#!/usr/bin/env python3
"""Register the ops-bridge MCP server at user scope in ~/.claude.json.
Usage:
python scripts/register_mcp.py [--dry-run]
This script:
1. Reads the MCP server config from .mcp.json in the repo root.
2. Calls `claude mcp add-json -s user ops-bridge <config>` to register.
3. Patches the `cwd` field in ~/.claude.json (claude mcp add-json silently drops it).
After running, all Claude Code sessions on this machine have access to the
`ops-bridge` MCP tools — even when opened outside the ops-bridge repo directory.
"""
from __future__ import annotations
import argparse
import json
import subprocess
import sys
from pathlib import Path
REPO_ROOT = Path(__file__).parent.parent
MCP_JSON = REPO_ROOT / ".mcp.json"
CLAUDE_JSON = Path.home() / ".claude.json"
SERVER_NAME = "ops-bridge"
def load_server_config() -> dict:
data = json.loads(MCP_JSON.read_text())
servers = data.get("mcpServers", {})
if SERVER_NAME not in servers:
raise SystemExit(f"ERROR: '{SERVER_NAME}' not found in {MCP_JSON}")
return servers[SERVER_NAME]
def register(config: dict, dry_run: bool) -> None:
config_json = json.dumps(config)
cmd = ["claude", "mcp", "add-json", "-s", "user", SERVER_NAME, config_json]
print(f"→ Running: {' '.join(cmd[:6])} '<config>'")
if not dry_run:
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
print(f"FAILED:\n{result.stderr}", file=sys.stderr)
raise SystemExit(1)
print(f" OK: {result.stdout.strip()}")
def patch_cwd(cwd: str, dry_run: bool) -> None:
"""Patch the cwd field that claude mcp add-json silently drops."""
if not CLAUDE_JSON.exists():
print(f"WARNING: {CLAUDE_JSON} not found — skipping cwd patch")
return
data = json.loads(CLAUDE_JSON.read_text())
servers = data.setdefault("mcpServers", {})
if SERVER_NAME not in servers:
print(f"WARNING: '{SERVER_NAME}' not found in {CLAUDE_JSON} after registration")
return
current_cwd = servers[SERVER_NAME].get("cwd")
if current_cwd == cwd:
print(f"→ cwd already correct: {cwd}")
return
servers[SERVER_NAME]["cwd"] = cwd
print(f"→ Patching cwd: {cwd}")
if not dry_run:
CLAUDE_JSON.write_text(json.dumps(data, indent=2) + "\n")
print(" OK")
def main() -> None:
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument("--dry-run", action="store_true", help="Show what would be done without making changes")
args = parser.parse_args()
if args.dry_run:
print("[DRY RUN] No changes will be made.\n")
config = load_server_config()
cwd = config.get("cwd", str(REPO_ROOT))
print(f"Registering ops-bridge MCP server from {MCP_JSON}")
register(config, dry_run=args.dry_run)
patch_cwd(cwd, dry_run=args.dry_run)
if not args.dry_run:
print("\nDone. Restart Claude Code for the changes to take effect.")
else:
print("\n[DRY RUN complete]")
if __name__ == "__main__":
main()

0
src/bridge/__init__.py Normal file
View File

69
src/bridge/audit.py Normal file
View File

@@ -0,0 +1,69 @@
"""Audit logging for OpsBridge lifecycle events."""
from __future__ import annotations
import json
from datetime import datetime, timezone
from enum import Enum
from pathlib import Path
from typing import Any, Dict, List, Optional
class AuditEvent(str, Enum):
BRIDGE_STARTED = "bridge_started"
BRIDGE_CONNECTED = "bridge_connected"
BRIDGE_DISCONNECTED = "bridge_disconnected"
BRIDGE_RECONNECTING = "bridge_reconnecting"
HEALTH_CHECK_FAILED = "health_check_failed"
HEALTH_CHECK_RECOVERED = "health_check_recovered"
BRIDGE_STOPPED = "bridge_stopped"
CERT_EXPIRING = "cert_expiring"
def _default_state_dir() -> Path:
return Path.home() / ".local" / "state" / "bridge"
class AuditLogger:
def __init__(self, state_dir: Optional[Path] = None):
self._dir = Path(state_dir) if state_dir else _default_state_dir()
def _log_path(self, tunnel: str) -> Path:
return self._dir / f"{tunnel}.log"
def log(
self,
tunnel: str,
event: AuditEvent,
actor: str,
actor_type: str,
detail: str = "",
cert_identity: Optional[str] = None,
) -> None:
self._dir.mkdir(parents=True, exist_ok=True)
entry: Dict[str, Any] = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"tunnel": tunnel,
"actor": actor,
"actor_type": actor_type,
"event": event.value,
}
if detail:
entry["detail"] = detail
if cert_identity:
entry["cert_identity"] = cert_identity
with self._log_path(tunnel).open("a") as f:
f.write(json.dumps(entry) + "\n")
def read_events(self, tunnel: str) -> List[Dict[str, Any]]:
path = self._log_path(tunnel)
if not path.exists():
return []
events = []
for line in path.read_text().splitlines():
line = line.strip()
if line:
try:
events.append(json.loads(line))
except json.JSONDecodeError:
pass
return events

View File

@@ -0,0 +1,83 @@
"""Canonical capability registry for OpsBridge.
Every operation that can be invoked via CLI, MCP, or Skill must be listed here.
The cross-mode test suite uses this registry to enforce test coverage parity.
"""
from __future__ import annotations
from dataclasses import dataclass
ACCESS_MODES = frozenset({"cli", "mcp", "skill"})
@dataclass(frozen=True)
class Capability:
name: str
description: str
required_access_modes: frozenset[str]
CAPABILITIES: list[Capability] = [
Capability(
name="bridge_up",
description="Start one or all tunnels",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="bridge_down",
description="Stop one or all tunnels",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="bridge_restart",
description="Restart one or all tunnels",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="bridge_status",
description="Show tunnel status",
required_access_modes=frozenset({"cli", "mcp", "skill"}),
),
Capability(
name="bridge_logs",
description="Tail tunnel audit log",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="catalog_list_targets",
description="List catalog targets",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="catalog_show_target",
description="Show target metadata",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="catalog_list_domains",
description="List catalog domains",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="catalog_validate",
description="Validate catalog consistency",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="catalog_show_bridge",
description="Show bridge metadata",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="bridge_check",
description="End-to-end tunnel diagnostics via SSH: SSH PID alive + remote port listening",
required_access_modes=frozenset({"cli", "mcp"}),
),
Capability(
name="bridge_cert_status",
description="Show certificate status for tunnels using cert_command mode",
required_access_modes=frozenset({"cli"}),
),
]
CAPABILITIES_BY_NAME: dict[str, Capability] = {c.name: c for c in CAPABILITIES}

View File

View File

@@ -0,0 +1,141 @@
"""Catalog loader — walks a catalog directory tree and parses YAML files."""
from __future__ import annotations
import logging
from pathlib import Path
from typing import Any
import yaml
from bridge.catalog.models import (
ActorClass,
Catalog,
CatalogBridge,
CatalogDomain,
CatalogTarget,
)
from bridge.models import HealthCheckConfig, ReconnectPolicy
log = logging.getLogger(__name__)
class CatalogLoadError(Exception):
"""Raised when catalog loading fails."""
def load_catalog(path: Path) -> Catalog:
"""Walk the catalog directory and return a populated Catalog."""
path = Path(path)
if not path.exists():
raise CatalogLoadError(f"Catalog path not found: {path}")
catalog = Catalog()
for yaml_file in sorted(path.rglob("*.yaml")):
_load_file(yaml_file, catalog)
return catalog
def _load_file(path: Path, catalog: Catalog) -> None:
try:
with path.open() as f:
data = yaml.safe_load(f)
except yaml.YAMLError as e:
raise CatalogLoadError(f"Invalid YAML in {path}: {e}") from e
if not isinstance(data, dict):
log.warning("Skipping %s: not a YAML mapping", path)
return
entry_type = data.get("type")
if not entry_type:
log.warning("Skipping %s: no 'type' field", path)
return
try:
if entry_type == "domain":
entry = _parse_domain(data, path)
catalog.domains[entry.id] = entry
elif entry_type == "target":
entry = _parse_target(data, path)
catalog.targets[entry.id] = entry
elif entry_type == "bridge":
entry = _parse_bridge(data, path)
catalog.bridges[entry.id] = entry
elif entry_type == "actor":
entry = _parse_actor(data, path)
catalog.actors[entry.id] = entry
else:
log.warning("Skipping %s: unknown type '%s'", path, entry_type)
except CatalogLoadError:
raise
except Exception as e:
raise CatalogLoadError(f"Error parsing {path}: {e}") from e
def _require(data: dict, field: str, path: Path) -> Any:
if field not in data:
raise CatalogLoadError(f"Missing required field '{field}' in {path}")
return data[field]
def _parse_domain(data: dict, path: Path) -> CatalogDomain:
return CatalogDomain(
id=str(_require(data, "id", path)),
name=str(_require(data, "name", path)),
description=str(data.get("description", "")),
environment=str(data.get("environment", "")),
)
def _parse_target(data: dict, path: Path) -> CatalogTarget:
return CatalogTarget(
id=str(_require(data, "id", path)),
domain=str(_require(data, "domain", path)),
kind=str(_require(data, "kind", path)),
description=str(data.get("description", "")),
reachable_via=list(data.get("reachable_via") or []),
)
def _parse_bridge(data: dict, path: Path) -> CatalogBridge:
health_check = None
if "health_check" in data and data["health_check"]:
hc = data["health_check"]
health_check = HealthCheckConfig(
url=str(_require(hc, "url", path)),
interval_seconds=int(hc.get("interval_seconds", 30)),
timeout_seconds=int(hc.get("timeout_seconds", 5)),
)
reconnect = None
if "reconnect" in data and data["reconnect"]:
r = data["reconnect"]
reconnect = ReconnectPolicy(
max_attempts=int(r.get("max_attempts", 0)),
backoff_initial=int(r.get("backoff_initial", 5)),
backoff_max=int(r.get("backoff_max", 60)),
)
return CatalogBridge(
id=str(_require(data, "id", path)),
domain=str(_require(data, "domain", path)),
target=str(_require(data, "target", path)),
host=str(_require(data, "host", path)),
remote_port=int(_require(data, "remote_port", path)),
local_port=int(_require(data, "local_port", path)),
ssh_user=str(_require(data, "ssh_user", path)),
ssh_key=str(_require(data, "ssh_key", path)),
actor=str(_require(data, "actor", path)),
description=str(data.get("description", "")),
access_method=str(data.get("access_method", "ssh-reverse")),
health_check=health_check,
reconnect=reconnect,
)
def _parse_actor(data: dict, path: Path) -> ActorClass:
return ActorClass(
id=str(_require(data, "id", path)),
actor_class=str(_require(data, "class", path)),
description=str(data.get("description", "")),
)

View File

@@ -0,0 +1,69 @@
"""Domain models for OpsCatalog."""
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from bridge.models import HealthCheckConfig, ReconnectPolicy, TunnelConfig
@dataclass
class CatalogDomain:
id: str
name: str
description: str = ""
environment: str = ""
@dataclass
class CatalogTarget:
id: str
domain: str
kind: str
description: str = ""
reachable_via: List[str] = field(default_factory=list)
@dataclass
class CatalogBridge:
id: str
domain: str
target: str
host: str
remote_port: int
local_port: int
ssh_user: str
ssh_key: str
actor: str
description: str = ""
access_method: str = "ssh-reverse"
health_check: Optional[HealthCheckConfig] = None
reconnect: Optional[ReconnectPolicy] = None
def to_tunnel_config(self) -> TunnelConfig:
return TunnelConfig(
name=self.id,
host=self.host,
remote_port=self.remote_port,
local_port=self.local_port,
ssh_user=self.ssh_user,
ssh_key=self.ssh_key,
actor=self.actor,
reconnect=self.reconnect if self.reconnect is not None else ReconnectPolicy(),
health_check=self.health_check,
)
@dataclass
class ActorClass:
id: str
actor_class: str
description: str = ""
@dataclass
class Catalog:
domains: Dict[str, CatalogDomain] = field(default_factory=dict)
targets: Dict[str, CatalogTarget] = field(default_factory=dict)
bridges: Dict[str, CatalogBridge] = field(default_factory=dict)
actors: Dict[str, ActorClass] = field(default_factory=dict)

View File

@@ -0,0 +1,35 @@
"""Catalog resolver — resolves a bridge name to a TunnelConfig."""
from __future__ import annotations
from typing import Dict, Optional
from bridge.catalog.models import Catalog
from bridge.models import TunnelConfig
class BridgeNotFound(Exception):
"""Raised when a bridge name cannot be resolved from inline config or catalog."""
def resolve(
name: str,
catalog: Optional[Catalog],
inline_tunnels: Dict[str, TunnelConfig],
) -> TunnelConfig:
"""Resolve bridge name to TunnelConfig.
Lookup order:
1. inline_tunnels (from tunnels.yaml) — wins if present
2. catalog bridges — fallback
3. raises BridgeNotFound if neither has the name
"""
if name in inline_tunnels:
return inline_tunnels[name]
if catalog is not None and name in catalog.bridges:
return catalog.bridges[name].to_tunnel_config()
raise BridgeNotFound(
f"Bridge '{name}' not found in inline config"
+ (" or catalog" if catalog is not None else " (no catalog configured)")
)

View File

@@ -0,0 +1,42 @@
"""Catalog validator — cross-reference checks for catalog consistency."""
from __future__ import annotations
from typing import List
from bridge.catalog.models import Catalog
class ValidationError(Exception):
"""Raised when catalog validation fails (used for programmatic access)."""
def validate_catalog(catalog: Catalog) -> List[str]:
"""Return a list of validation error strings (empty = valid)."""
errors: List[str] = []
for target in catalog.targets.values():
if target.domain not in catalog.domains:
errors.append(
f"Target '{target.id}': domain '{target.domain}' does not exist in catalog"
)
for bridge_id in target.reachable_via:
if bridge_id not in catalog.bridges:
errors.append(
f"Target '{target.id}': reachable_via references unknown bridge '{bridge_id}'"
)
for bridge in catalog.bridges.values():
if bridge.domain not in catalog.domains:
errors.append(
f"Bridge '{bridge.id}': domain '{bridge.domain}' does not exist in catalog"
)
if bridge.target not in catalog.targets:
errors.append(
f"Bridge '{bridge.id}': target '{bridge.target}' does not exist in catalog"
)
if bridge.actor not in catalog.actors:
errors.append(
f"Bridge '{bridge.id}': actor '{bridge.actor}' does not exist in catalog"
)
return errors

328
src/bridge/cleanup.py Normal file
View File

@@ -0,0 +1,328 @@
"""Nightly maintenance: detect and clear stale SSH remote port forwards."""
from __future__ import annotations
import subprocess
from dataclasses import dataclass
from typing import Optional
from urllib.parse import urlparse, urlunparse
import httpx
from bridge.diagnostics import _remote_port_probe_command, check_tunnel
from bridge.manager import TunnelManager
from bridge.models import TunnelConfig
from bridge.state import StateManager
@dataclass
class CleanupAction:
tunnel: str
action: str # skipped | healthy | cleaned | cleaned_and_restarted | error
detail: str = ""
@dataclass
class CleanupReport:
actions: list[CleanupAction]
@property
def cleaned_count(self) -> int:
return sum(1 for a in self.actions if a.action.startswith("cleaned"))
def remote_forward_health_url(cfg: TunnelConfig) -> Optional[str]:
"""Map the local health_check URL to the remote forwarded port."""
if cfg.health_check is None or cfg.direction == "local":
return None
parsed = urlparse(cfg.health_check.url)
if not parsed.hostname:
return None
netloc = f"{parsed.hostname}:{cfg.remote_port}"
return urlunparse(parsed._replace(netloc=netloc))
def _ssh_base_cmd(cfg: TunnelConfig) -> list[str]:
from pathlib import Path
return [
"ssh",
"-i",
str(Path(cfg.ssh_key).expanduser()),
"-o",
"BatchMode=yes",
"-o",
"ConnectTimeout=10",
"-o",
"StrictHostKeyChecking=accept-new",
f"{cfg.ssh_user}@{cfg.host}",
]
def _run_ssh(cfg: TunnelConfig, remote_command: str, *, timeout: float = 30) -> subprocess.CompletedProcess[str]:
return subprocess.run(
[*_ssh_base_cmd(cfg), remote_command],
capture_output=True,
text=True,
timeout=timeout,
)
def remote_port_listening(cfg: TunnelConfig) -> bool:
proc = _run_ssh(cfg, _remote_port_probe_command(cfg.remote_port), timeout=15)
return proc.stdout.strip() == "ok"
def probe_remote_forward(cfg: TunnelConfig) -> tuple[bool, str]:
"""Return (healthy, detail) for the remote forwarded service."""
url = remote_forward_health_url(cfg)
if url is None:
return True, "no remote health url configured"
timeout = cfg.health_check.timeout_seconds if cfg.health_check else 5
remote_cmd = (
f"curl -sf --max-time {timeout} {url!r} >/dev/null "
"&& echo ok || echo fail"
)
try:
proc = _run_ssh(cfg, remote_cmd, timeout=timeout + 15)
except subprocess.TimeoutExpired:
return False, "remote health probe timed out"
output = proc.stdout.strip()
if output == "ok":
return True, "remote forward healthy"
if proc.returncode != 0 and proc.stderr.strip():
return False, proc.stderr.strip()
return False, "remote forward unhealthy"
def local_service_healthy(cfg: TunnelConfig) -> Optional[bool]:
if cfg.health_check is None:
return None
try:
resp = httpx.get(
cfg.health_check.url,
timeout=cfg.health_check.timeout_seconds,
)
return resp.is_success
except Exception:
return False
def _remote_cleanup_script(port: int) -> str:
return f"""set -eu
port={port}
pids=""
if command -v lsof >/dev/null 2>&1; then
pids=$(sudo -n lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
if [ -z "$pids" ]; then
pids=$(lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
fi
fi
if [ -z "$pids" ] && command -v fuser >/dev/null 2>&1; then
pids=$(fuser -n tcp $port 2>/dev/null | tr -s ' ' '\\n' | grep -E '^[0-9]+$' || true)
fi
if [ -z "$pids" ]; then
echo "no_listeners"
exit 0
fi
echo "killing:$pids"
for pid in $pids; do
kill "$pid" 2>/dev/null || sudo -n kill "$pid" 2>/dev/null || true
done
sleep 1
if ss -tln 2>/dev/null | grep -q ":$port "; then
echo "still_listening"
else
echo "cleared"
fi
"""
def clear_stale_remote_binding(cfg: TunnelConfig) -> tuple[bool, str]:
try:
proc = _run_ssh(cfg, _remote_cleanup_script(cfg.remote_port), timeout=30)
except subprocess.TimeoutExpired:
return False, "remote cleanup timed out"
output = proc.stdout.strip()
if "cleared" in output:
return True, output
if "no_listeners" in output:
return True, "no listeners found"
if "still_listening" in output:
return False, output
detail = output or proc.stderr.strip() or f"exit {proc.returncode}"
return False, detail
def should_cleanup_tunnel(
cfg: TunnelConfig,
state_mgr: StateManager,
) -> tuple[bool, str]:
"""Decide whether a reverse tunnel's remote binding looks stale."""
if cfg.direction == "local":
return False, "local tunnel"
if not remote_port_listening(cfg):
return False, "remote port closed"
remote_ok, remote_detail = probe_remote_forward(cfg)
if remote_ok:
return False, remote_detail
check = check_tunnel(cfg, state_mgr)
local_ok = local_service_healthy(cfg)
if local_ok is True and not remote_ok:
return True, f"stale forward: {remote_detail}"
if check.ssh_process != "ok" and check.remote_port == "listening":
return True, f"orphan forward while ssh {check.ssh_process}: {remote_detail}"
if check.ssh_process == "ok" and not remote_ok:
return True, f"broken forward with live client: {remote_detail}"
return False, remote_detail
def cleanup_tunnel(
cfg: TunnelConfig,
state_mgr: StateManager,
*,
restart: bool,
) -> CleanupAction:
name = cfg.name
try:
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
if not needed:
return CleanupAction(name, "healthy", reason)
ok, detail = clear_stale_remote_binding(cfg)
if not ok:
return CleanupAction(name, "error", f"cleanup failed: {detail}")
if not restart:
return CleanupAction(name, "cleaned", f"{reason}; {detail}")
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
was_running = mgr.is_running()
if was_running:
mgr.stop()
mgr.start()
action = "cleaned_and_restarted"
verb = "restarted" if was_running else "started"
return CleanupAction(name, action, f"{reason}; {verb} tunnel; {detail}")
except Exception as exc:
return CleanupAction(name, "error", str(exc))
def restart_tunnel(
cfg: TunnelConfig,
state_mgr: StateManager,
) -> CleanupAction:
"""Restart one tunnel with blank-slate recovery for reverse tunnels."""
if cfg.direction == "local":
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
mgr.stop()
mgr.start()
return CleanupAction(cfg.name, "restarted", "local tunnel stop/start")
return cleanup_tunnel(cfg, state_mgr, restart=True)
def restart_all_tunnels(
cfg,
state_mgr: StateManager,
) -> list[CleanupAction]:
"""Restart every inline tunnel (reverse via cleanup path, local via stop/start)."""
return [restart_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]
def cleanup_all_tunnels(
cfg,
state_mgr: StateManager,
*,
restart: bool,
tunnel_name: Optional[str] = None,
) -> CleanupReport:
tunnels = cfg.tunnels.values()
if tunnel_name is not None:
if tunnel_name not in cfg.tunnels:
raise KeyError(tunnel_name)
tunnels = [cfg.tunnels[tunnel_name]]
actions = [
cleanup_tunnel(tcfg, state_mgr, restart=restart)
for tcfg in tunnels
if tcfg.direction != "local"
]
return CleanupReport(actions=actions)
CRON_MARKER = "# ops-bridge: maintenance cleanup"
CRON_SCHEDULE = "0 3 * * *"
CRON_LOG = "~/.local/state/bridge/cleanup.log"
def build_cron_line() -> str:
bridge_bin = "~/.local/bin/bridge"
return (
f"{CRON_SCHEDULE} BRIDGE_CONFIG=~/.config/bridge/tunnels.yaml "
f"{bridge_bin} maintenance cleanup --restart "
f">> {CRON_LOG} 2>&1 {CRON_MARKER}"
)
def read_installed_cron() -> Optional[str]:
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
if proc.returncode != 0:
return None
for line in proc.stdout.splitlines():
if CRON_MARKER in line:
return line.strip()
return None
def install_cleanup_cron() -> tuple[bool, str]:
existing = read_installed_cron()
if existing:
return False, f"cron already installed: {existing}"
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
current = proc.stdout if proc.returncode == 0 else ""
new_line = build_cron_line()
body = current.rstrip("\n")
if body:
body += "\n"
body += new_line + "\n"
write = subprocess.run(
["crontab", "-"],
input=body,
capture_output=True,
text=True,
)
if write.returncode != 0:
return False, write.stderr.strip() or "crontab write failed"
return True, new_line
def uninstall_cleanup_cron() -> tuple[bool, str]:
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
if proc.returncode != 0:
return False, "no crontab installed"
kept = [
line
for line in proc.stdout.splitlines()
if CRON_MARKER not in line
]
if len(kept) == len(proc.stdout.splitlines()):
return False, "cleanup cron not found"
body = "\n".join(kept).rstrip("\n")
if body:
body += "\n"
write = subprocess.run(
["crontab", "-"],
input=body,
capture_output=True,
text=True,
)
if write.returncode != 0:
return False, write.stderr.strip() or "crontab write failed"
return True, "removed cleanup cron entry"

773
src/bridge/cli.py Normal file
View File

@@ -0,0 +1,773 @@
"""CLI for OpsBridge — bridge command."""
from __future__ import annotations
import dataclasses
import json
import os
import subprocess
from datetime import datetime
from pathlib import Path
from typing import Optional
import typer
from bridge.audit import AuditLogger
from bridge.cleanup import (
CleanupAction,
build_cron_line,
cleanup_all_tunnels,
install_cleanup_cron,
read_installed_cron,
restart_all_tunnels,
restart_tunnel,
uninstall_cleanup_cron,
)
from bridge.config import ConfigError, load_config
from bridge.diagnostics import check_all_tunnels, check_tunnel
from bridge.manager import TunnelManager
from bridge.state import StateManager, _pid_alive
app = typer.Typer(
name="bridge",
help="OpsBridge — SSH reverse tunnel lifecycle manager.",
no_args_is_help=True,
)
targets_app = typer.Typer(help="Inspect infrastructure targets from the OpsCatalog.")
catalog_app = typer.Typer(help="Inspect and validate the OpsCatalog.")
maintenance_app = typer.Typer(help="Scheduled maintenance for tunnel hygiene.")
app.add_typer(targets_app, name="targets")
app.add_typer(catalog_app, name="catalog")
app.add_typer(maintenance_app, name="maintenance")
def _state_dir() -> Path:
return Path(os.environ.get("BRIDGE_STATE_DIR", str(Path.home() / ".local" / "state" / "bridge")))
def _load_or_exit():
try:
return load_config()
except ConfigError as e:
typer.echo(f"Error: {e}", err=True)
raise typer.Exit(1)
def _load_catalog_or_exit(cfg):
from bridge.catalog.loader import load_catalog
if cfg.catalog_path is None:
typer.echo("Error: catalog_path not configured in tunnels.yaml", err=True)
raise typer.Exit(1)
try:
return load_catalog(cfg.catalog_path)
except Exception as e:
typer.echo(f"Error loading catalog: {e}", err=True)
raise typer.Exit(1)
def _resolve_tunnel(cfg, name: str):
"""Resolve tunnel name: inline first, then catalog, then error."""
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import BridgeNotFound, resolve
catalog = None
if cfg.catalog_path is not None:
try:
catalog = load_catalog(cfg.catalog_path)
except Exception:
pass
try:
return resolve(name, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound:
typer.echo(f"Error: tunnel '{name}' not found in config or catalog", err=True)
raise typer.Exit(1)
def _all_tunnel_names(cfg):
"""Return names from inline config (all-tunnels operations use inline only)."""
return list(cfg.tunnels.keys())
# ─── Tunnel lifecycle commands ────────────────────────────────────────────────
@app.command()
def up(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
):
"""Start one or all tunnels."""
cfg = _load_or_exit()
sd = _state_dir()
if tunnel:
tcfg = _resolve_tunnel(cfg, tunnel)
mgr = TunnelManager(tcfg, state_dir=sd)
if mgr.is_running():
typer.echo(f"Tunnel '{tunnel}' is already running.")
raise typer.Exit(2)
mgr.start()
typer.echo(f"Started tunnel '{tunnel}'.")
else:
names = _all_tunnel_names(cfg)
any_already_running = False
for name in names:
tcfg = cfg.tunnels[name]
mgr = TunnelManager(tcfg, state_dir=sd)
if mgr.is_running():
typer.echo(f"Tunnel '{name}' is already running.")
any_already_running = True
else:
mgr.start()
typer.echo(f"Started tunnel '{name}'.")
if any_already_running and len(names) == 1:
raise typer.Exit(2)
@app.command()
def down(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
):
"""Stop one or all tunnels."""
cfg = _load_or_exit()
sd = _state_dir()
if tunnel:
tcfg = _resolve_tunnel(cfg, tunnel)
mgr = TunnelManager(tcfg, state_dir=sd)
if not mgr.is_running():
typer.echo(f"Tunnel '{tunnel}' is not running.")
raise typer.Exit(2)
mgr.stop()
typer.echo(f"Stopped tunnel '{tunnel}'.")
else:
names = _all_tunnel_names(cfg)
any_not_running = False
for name in names:
tcfg = cfg.tunnels[name]
mgr = TunnelManager(tcfg, state_dir=sd)
if not mgr.is_running():
typer.echo(f"Tunnel '{name}' is not running.")
any_not_running = True
else:
mgr.stop()
typer.echo(f"Stopped tunnel '{name}'.")
if any_not_running and len(names) == 1:
raise typer.Exit(2)
def _emit_restart_actions(actions: list[CleanupAction]) -> None:
any_error = False
for action in actions:
typer.echo(f"{action.tunnel}: {action.action}{action.detail}")
if action.action == "error":
any_error = True
if any_error:
raise typer.Exit(1)
@app.command()
def restart(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
):
"""Restart one or all tunnels.
Reverse tunnels run conditional remote stale-forward cleanup before
reconnecting; healthy forwards are left running. Local-direction tunnels
use local stop/start only.
"""
cfg = _load_or_exit()
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
if tunnel:
tcfg = _resolve_tunnel(cfg, tunnel)
actions = [restart_tunnel(tcfg, state_mgr)]
else:
actions = restart_all_tunnels(cfg, state_mgr)
_emit_restart_actions(actions)
@app.command()
def status(
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Show status of all tunnels."""
cfg = _load_or_exit()
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
rows = []
for name, tcfg in cfg.tunnels.items():
state = state_mgr.read_state(name)
raw_pid = state_mgr.read_raw_pid(name)
pid_alive_val = _pid_alive(raw_pid) if raw_pid is not None else None
stale = (
state.value in ("connected", "degraded")
and pid_alive_val is not True
)
rows.append({
"tunnel": name,
"state": state.value,
"actor": tcfg.actor,
"host": tcfg.host,
"pid": raw_pid,
"pid_alive": pid_alive_val,
"stale": stale,
"uptime": None,
"health": None,
})
if as_json:
typer.echo(json.dumps(rows, indent=2))
else:
_print_status_table(rows)
def _print_status_table(rows):
if not rows:
typer.echo("No tunnels configured.")
return
def _state_display(row):
s = row["state"]
if row.get("stale"):
s += " [STALE]"
return s
def _live_display(row):
alive = row.get("pid_alive")
if alive is True:
return "yes"
elif alive is False:
return "no"
return "\u2014"
headers = ["TUNNEL", "STATE", "ACTOR", "HOST", "PID", "LIVE"]
col_widths = [
max(len("TUNNEL"), max((len(row["tunnel"]) for row in rows), default=0)),
max(len("STATE"), max((len(_state_display(row)) for row in rows), default=0)),
max(len("ACTOR"), max((len(str(row.get("actor", "") or "")) for row in rows), default=0)),
max(len("HOST"), max((len(str(row.get("host", "") or "")) for row in rows), default=0)),
max(len("PID"), max((len(str(row["pid"] or "")) for row in rows), default=0)),
max(len("LIVE"), max((len(_live_display(row)) for row in rows), default=0)),
]
def _fmt_row(vals):
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
typer.echo(_fmt_row(headers))
typer.echo(_fmt_row(["-" * w for w in col_widths]))
for row in rows:
typer.echo(_fmt_row([
row["tunnel"],
_state_display(row),
row["actor"],
row["host"],
str(row["pid"] or ""),
_live_display(row),
]))
@app.command()
def logs(
tunnel: str = typer.Argument(..., help="Tunnel name"),
lines: int = typer.Option(50, "--lines", "-n", help="Number of lines to show"),
follow: bool = typer.Option(False, "--follow", "-f", help="Follow the log"),
):
"""Show audit log for a tunnel."""
cfg = _load_or_exit()
_resolve_tunnel(cfg, tunnel) # validate name
sd = _state_dir()
logger = AuditLogger(state_dir=sd)
events = logger.read_events(tunnel)
if not events:
typer.echo(f"No log entries for tunnel '{tunnel}'.")
return
for entry in events[-lines:]:
ts = entry.get("timestamp", "")
event = entry.get("event", "")
actor = entry.get("actor", "")
detail = entry.get("detail", "")
parts = [ts, event, f"actor={actor}"]
if detail:
parts.append(detail)
typer.echo(" ".join(parts))
if follow:
import time
log_path = sd / f"{tunnel}.log"
try:
with log_path.open() as f:
f.seek(0, 2)
while True:
line = f.readline()
if line:
try:
entry = json.loads(line)
ts = entry.get("timestamp", "")
event = entry.get("event", "")
actor = entry.get("actor", "")
detail = entry.get("detail", "")
parts = [ts, event, f"actor={actor}"]
if detail:
parts.append(detail)
typer.echo(" ".join(parts))
except json.JSONDecodeError:
pass
else:
time.sleep(0.5)
except KeyboardInterrupt:
pass
@app.command()
def check(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""End-to-end diagnostics: verify SSH PID alive and remote port listening."""
cfg = _load_or_exit()
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
if tunnel:
results = [check_tunnel(_resolve_tunnel(cfg, tunnel), state_mgr)]
else:
results = check_all_tunnels(cfg, state_mgr)
if as_json:
typer.echo(json.dumps(
[{**dataclasses.asdict(r), "ok": r.ok} for r in results],
indent=2,
))
else:
_print_check_table(results)
if any(not r.ok for r in results):
raise typer.Exit(1)
def _print_check_table(results):
if not results:
typer.echo("No tunnels configured.")
return
headers = ["TUNNEL", "SSH", "PID", "PORT", "API", "OK"]
rows_data = []
for r in results:
rows_data.append([
r.tunnel,
r.ssh_process,
str(r.pid or ""),
r.remote_port,
r.local_api or "\u2014",
"yes" if r.ok else "no",
])
col_widths = [
max(len(h), max((len(row[i]) for row in rows_data), default=0))
for i, h in enumerate(headers)
]
def _fmt(vals):
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
typer.echo(_fmt(headers))
typer.echo(_fmt(["-" * w for w in col_widths]))
for row in rows_data:
typer.echo(_fmt(row))
@app.command("cert-status")
def cert_status(
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Show certificate status for tunnels using cert_command mode."""
cfg = _load_or_exit()
sd = _state_dir()
names = [tunnel] if tunnel else list(cfg.tunnels.keys())
rows = []
any_expired = False
for name in names:
cert_file = sd / f"{name}-cert.pub"
if not cert_file.exists():
rows.append({"tunnel": name, "mode": "static-key", "cert_file": None})
continue
try:
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_file)],
capture_output=True, text=True, check=False,
)
info = {"tunnel": name, "mode": "cert", "cert_file": str(cert_file)}
for line in result.stdout.splitlines():
line = line.strip()
if line.startswith("Key ID:"):
info["key_id"] = line.split(":", 1)[1].strip().strip('"')
elif line.startswith("Valid:"):
parts = line.split()
if len(parts) >= 5 and parts[1] == "from" and parts[3] == "to":
info["valid_from"] = parts[2]
info["valid_until"] = parts[4]
try:
expires = datetime.fromisoformat(parts[4])
now = datetime.now()
remaining = expires - now
if remaining.total_seconds() <= 0:
info["expired"] = True
any_expired = True
else:
info["expired"] = False
mins = int(remaining.total_seconds() // 60)
info["ttl_remaining"] = f"{mins}m"
except ValueError:
pass
rows.append(info)
except FileNotFoundError:
rows.append({"tunnel": name, "mode": "cert", "error": "ssh-keygen not found"})
if as_json:
typer.echo(json.dumps(rows, indent=2))
else:
for row in rows:
mode = row.get("mode", "unknown")
if mode == "static-key":
typer.echo(f"{row['tunnel']} static-key / no cert")
elif "error" in row:
typer.echo(f"{row['tunnel']} ERROR: {row['error']}")
else:
parts = [row["tunnel"]]
if "key_id" in row:
parts.append(f"id={row['key_id']}")
if "valid_from" in row:
parts.append(f"from={row['valid_from']}")
if "valid_until" in row:
parts.append(f"until={row['valid_until']}")
if row.get("expired"):
parts.append("EXPIRED")
elif "ttl_remaining" in row:
parts.append(f"ttl={row['ttl_remaining']}")
typer.echo(" ".join(parts))
if any_expired:
raise typer.Exit(1)
# ─── targets commands ─────────────────────────────────────────────────────────
@targets_app.callback(invoke_without_command=True)
def targets_default(
ctx: typer.Context,
domain: Optional[str] = typer.Option(None, "--domain", help="Filter by domain"),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""List infrastructure targets from the OpsCatalog."""
if ctx.invoked_subcommand is not None:
return
cfg = _load_or_exit()
cat = _load_catalog_or_exit(cfg)
rows = []
for t in cat.targets.values():
if domain and t.domain != domain:
continue
rows.append({
"domain": t.domain,
"target": t.id,
"kind": t.kind,
"description": t.description,
"bridges": t.reachable_via,
})
if as_json:
typer.echo(json.dumps(rows, indent=2))
else:
if not rows:
typer.echo("No targets found.")
return
headers = ["DOMAIN", "TARGET", "KIND", "BRIDGES"]
col_widths = [
max(len(h), max((len(str(r.get(h.lower(), "") or "")) for r in rows), default=0))
for h in headers
]
def _fmt(vals):
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
typer.echo(_fmt(headers))
typer.echo(_fmt(["-" * w for w in col_widths]))
for row in rows:
typer.echo(_fmt([
row["domain"],
row["target"],
row["kind"],
", ".join(row["bridges"]),
]))
@targets_app.command("show")
def targets_show(
target: str = typer.Argument(..., help="Target ID"),
):
"""Show full metadata for a target."""
cfg = _load_or_exit()
cat = _load_catalog_or_exit(cfg)
if target not in cat.targets:
typer.echo(f"Error: target '{target}' not found in catalog", err=True)
raise typer.Exit(1)
t = cat.targets[target]
typer.echo(f"Target: {t.id}")
typer.echo(f"Domain: {t.domain}")
typer.echo(f"Kind: {t.kind}")
if t.description:
typer.echo(f"Description: {t.description}")
if t.reachable_via:
typer.echo(f"Bridges: {', '.join(t.reachable_via)}")
# Show ops notes from docs/ if available
if cfg.catalog_path:
docs_dir = cfg.catalog_path / "domains" / t.domain / "docs"
if docs_dir.exists():
for md_file in sorted(docs_dir.glob("*.md")):
typer.echo(f"\n--- {md_file.name} ---")
typer.echo(md_file.read_text())
# ─── catalog commands ─────────────────────────────────────────────────────────
@catalog_app.callback(invoke_without_command=True)
def catalog_default(ctx: typer.Context):
"""Inspect and validate the OpsCatalog."""
if ctx.invoked_subcommand is None:
typer.echo(ctx.get_help())
@catalog_app.command("list")
def catalog_list(
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""List all domains with target and bridge counts."""
cfg = _load_or_exit()
cat = _load_catalog_or_exit(cfg)
rows = []
for domain in cat.domains.values():
target_count = sum(1 for t in cat.targets.values() if t.domain == domain.id)
bridge_count = sum(1 for b in cat.bridges.values() if b.domain == domain.id)
rows.append({
"domain": domain.id,
"name": domain.name,
"environment": domain.environment,
"targets": target_count,
"bridges": bridge_count,
})
if as_json:
typer.echo(json.dumps(rows, indent=2))
else:
if not rows:
typer.echo("Catalog is empty.")
return
headers = ["DOMAIN", "NAME", "ENV", "TARGETS", "BRIDGES"]
col_widths = [
max(len(h), max((len(str(r.get(h.lower()[:3] if h == "ENV" else h.lower(), "") or "")) for r in rows), default=0))
for h in headers
]
# Manual col widths for cleaner output
col_widths = [
max(len("DOMAIN"), max((len(r["domain"]) for r in rows), default=0)),
max(len("NAME"), max((len(r["name"]) for r in rows), default=0)),
max(len("ENV"), max((len(r["environment"]) for r in rows), default=0)),
max(len("TARGETS"), max((len(str(r["targets"])) for r in rows), default=0)),
max(len("BRIDGES"), max((len(str(r["bridges"])) for r in rows), default=0)),
]
def _fmt(vals):
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
typer.echo(_fmt(headers))
typer.echo(_fmt(["-" * w for w in col_widths]))
for row in rows:
typer.echo(_fmt([
row["domain"], row["name"], row["environment"],
str(row["targets"]), str(row["bridges"]),
]))
@catalog_app.command("validate")
def catalog_validate():
"""Validate catalog for consistency errors."""
from bridge.catalog.validator import validate_catalog
cfg = _load_or_exit()
cat = _load_catalog_or_exit(cfg)
errors = validate_catalog(cat)
if errors:
typer.echo(f"Catalog has {len(errors)} violation(s):")
for err in errors:
typer.echo(f" - {err}")
raise typer.Exit(1)
else:
typer.echo(f"Catalog OK — {len(cat.domains)} domain(s), {len(cat.targets)} target(s), {len(cat.bridges)} bridge(s).")
@catalog_app.command("show")
def catalog_show(
bridge_id: str = typer.Argument(..., help="Bridge ID"),
):
"""Show full metadata for a bridge."""
cfg = _load_or_exit()
cat = _load_catalog_or_exit(cfg)
if bridge_id not in cat.bridges:
typer.echo(f"Error: bridge '{bridge_id}' not found in catalog", err=True)
raise typer.Exit(1)
b = cat.bridges[bridge_id]
typer.echo(f"Bridge: {b.id}")
typer.echo(f"Domain: {b.domain}")
typer.echo(f"Target: {b.target}")
typer.echo(f"Host: {b.host}")
typer.echo(f"Ports: {b.remote_port} -> {b.local_port}")
typer.echo(f"SSH user: {b.ssh_user}")
typer.echo(f"Actor: {b.actor}")
typer.echo(f"Method: {b.access_method}")
if b.description:
typer.echo(f"Description: {b.description}")
if b.health_check:
typer.echo(f"Health: {b.health_check.url} (every {b.health_check.interval_seconds}s)")
# Domain context
if b.domain in cat.domains:
d = cat.domains[b.domain]
typer.echo(f"\nDomain context: {d.name} [{d.environment}]")
# Target context
if b.target in cat.targets:
t = cat.targets[b.target]
typer.echo(f"Target: {t.description or t.id} ({t.kind})")
_CONVENTIONS_TEXT = """\
Actor Naming Conventions (from AccessManagementDirective.md §2)
Every actor declared under `actors:` in ~/.config/bridge/tunnels.yaml must have
a `class` field, and the actor name must start with the class-specific prefix:
class prefix purpose
----- ------ ------------------------------------------------------------
adm adm- Human operator (interactive shell when needed)
agt agt- LLM-powered autonomous agent (Claude Code, etc.)
atm atm- Deterministic script / cron job / pipeline
Legacy class aliases (deprecated, still accepted with a warning):
human -> adm
automation -> atm
Examples:
adm-bernd: { class: adm, description: Bernd Worsch }
agt-claude-coulombcore: { class: agt, description: Claude Code on CoulombCore }
atm-backup-daily: { class: atm, description: Nightly DB backup }
Full specification:
<ops-bridge repo>/wiki/AccessManagementDirective.md
"""
@maintenance_app.command("cleanup")
def maintenance_cleanup(
tunnel: Optional[str] = typer.Argument(
None,
help="Tunnel name (omit for all reverse tunnels)",
),
restart: bool = typer.Option(
False,
"--restart",
help="Restart tunnels after clearing stale remote bindings",
),
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
):
"""Clear stale SSH remote port forwards that block tunnel reconnects."""
cfg = _load_or_exit()
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
try:
report = cleanup_all_tunnels(
cfg,
state_mgr,
restart=restart,
tunnel_name=tunnel,
)
except KeyError:
typer.echo(f"Error: tunnel '{tunnel}' not found in config", err=True)
raise typer.Exit(1)
if as_json:
payload = {
"cleaned_count": report.cleaned_count,
"actions": [
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
for a in report.actions
],
}
typer.echo(json.dumps(payload, indent=2))
return
if not report.actions:
typer.echo("No reverse tunnels configured.")
return
for action in report.actions:
typer.echo(f"{action.tunnel}: {action.action}{action.detail}")
typer.echo(f"done ({report.cleaned_count} cleaned)")
@maintenance_app.command("install-cron")
def maintenance_install_cron():
"""Install a 03:00 daily cron job for `bridge maintenance cleanup --restart`."""
installed, message = install_cleanup_cron()
if installed:
typer.echo("Installed nightly cleanup cron:")
typer.echo(f" {message}")
else:
typer.echo(message)
raise typer.Exit(2)
@maintenance_app.command("uninstall-cron")
def maintenance_uninstall_cron():
"""Remove the nightly cleanup cron job."""
removed, message = uninstall_cleanup_cron()
if removed:
typer.echo(message)
else:
typer.echo(message)
raise typer.Exit(2)
@maintenance_app.command("show-cron")
def maintenance_show_cron():
"""Show the configured nightly cleanup cron line."""
existing = read_installed_cron()
if existing:
typer.echo(existing)
else:
typer.echo("Nightly cleanup cron is not installed.")
typer.echo("Would install:")
typer.echo(f" {build_cron_line()}")
@app.command()
def conventions():
"""Show the actor naming conventions enforced by tunnels.yaml."""
typer.echo(_CONVENTIONS_TEXT)

165
src/bridge/config.py Normal file
View File

@@ -0,0 +1,165 @@
"""Config loading for OpsBridge."""
from __future__ import annotations
import os
import warnings
from dataclasses import dataclass
from pathlib import Path
from typing import Dict, Optional
import yaml
from bridge.models import ActorInfo, ActorType, HealthCheckConfig, ReconnectPolicy, TunnelConfig
class ConfigError(Exception):
"""Raised when config is invalid or missing."""
@dataclass
class BridgeConfig:
tunnels: Dict[str, TunnelConfig]
actors: Dict[str, ActorInfo]
catalog_path: Optional[Path] = None
def _default_config_path() -> Path:
return Path.home() / ".config" / "bridge" / "tunnels.yaml"
def load_config() -> BridgeConfig:
"""Load and validate tunnels.yaml. Respects BRIDGE_CONFIG env var."""
path = Path(os.environ.get("BRIDGE_CONFIG", str(_default_config_path())))
if not path.exists():
raise ConfigError(f"Config file not found: {path}")
try:
with path.open() as f:
raw = yaml.safe_load(f)
except yaml.YAMLError as e:
raise ConfigError(f"Invalid YAML in {path}: {e}") from e
if not isinstance(raw, dict):
raise ConfigError(f"Config must be a YAML mapping, got: {type(raw)}")
tunnels = _parse_tunnels(raw.get("tunnels") or {})
actors = _parse_actors(raw.get("actors") or {})
catalog_path = None
if "catalog_path" in raw and raw["catalog_path"]:
catalog_path = Path(os.path.expanduser(str(raw["catalog_path"])))
return BridgeConfig(tunnels=tunnels, actors=actors, catalog_path=catalog_path)
def _parse_tunnels(raw: dict) -> Dict[str, TunnelConfig]:
tunnels = {}
for name, data in raw.items():
if not isinstance(data, dict):
raise ConfigError(f"Tunnel '{name}' must be a mapping")
tunnels[name] = _parse_tunnel(name, data)
return tunnels
def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
required = ["host", "remote_port", "local_port", "ssh_user", "ssh_key", "actor"]
for field in required:
if field not in data:
raise ConfigError(f"Tunnel '{name}' missing required field: {field}")
reconnect = ReconnectPolicy()
if "reconnect" in data and data["reconnect"]:
r = data["reconnect"]
reconnect = ReconnectPolicy(
max_attempts=r.get("max_attempts", 0),
backoff_initial=r.get("backoff_initial", 5),
backoff_max=r.get("backoff_max", 60),
)
health_check = None
if "health_check" in data and data["health_check"]:
hc = data["health_check"]
if "url" not in hc:
raise ConfigError(f"Tunnel '{name}' health_check missing required field: url")
health_check = HealthCheckConfig(
url=hc["url"],
interval_seconds=hc.get("interval_seconds", 30),
timeout_seconds=hc.get("timeout_seconds", 5),
)
direction = str(data.get("direction", "reverse"))
if direction not in ("reverse", "local"):
raise ConfigError(f"Tunnel '{name}' direction must be 'reverse' or 'local', got: {direction!r}")
cert_command = data.get("cert_command") or None
if cert_command is not None:
cert_command = str(cert_command)
return TunnelConfig(
name=name,
host=str(data["host"]),
remote_port=int(data["remote_port"]),
local_port=int(data["local_port"]),
ssh_user=str(data["ssh_user"]),
ssh_key=str(data["ssh_key"]),
actor=str(data["actor"]),
reconnect=reconnect,
health_check=health_check,
direction=direction,
cert_command=cert_command,
)
_LEGACY_CLASS_MAP = {
"human": ActorType.ADM,
"automation": ActorType.ATM,
}
_ACTOR_TYPE_PREFIXES = {
ActorType.ADM: "adm-",
ActorType.AGT: "agt-",
ActorType.ATM: "atm-",
}
def _parse_actor_type(name: str, raw_class: str) -> ActorType:
if raw_class in _LEGACY_CLASS_MAP:
warnings.warn(
f"Actor '{name}': class '{raw_class}' is deprecated; "
f"use '{_LEGACY_CLASS_MAP[raw_class].value}' instead.",
DeprecationWarning,
stacklevel=4,
)
return _LEGACY_CLASS_MAP[raw_class]
try:
return ActorType(raw_class)
except ValueError:
raise ConfigError(
f"Actor '{name}' has unknown class '{raw_class}'; "
f"must be one of: adm, agt, atm (or legacy: human, automation). "
f"Run `bridge conventions` for the full naming rules."
)
def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
actors = {}
for name, data in raw.items():
if not isinstance(data, dict):
raise ConfigError(f"Actor '{name}' must be a mapping")
if "class" not in data:
raise ConfigError(f"Actor '{name}' missing required field: class")
actor_type = _parse_actor_type(name, str(data["class"]))
required_prefix = _ACTOR_TYPE_PREFIXES[actor_type]
if not name.startswith(required_prefix):
raise ConfigError(
f"Actor '{name}' has type '{actor_type.value}' but name must start "
f"with '{required_prefix}' (got '{name}'). "
f"Run `bridge conventions` for the full naming rules."
)
actors[name] = ActorInfo(
name=name,
actor_type=actor_type,
description=str(data.get("description", "")),
)
return actors

146
src/bridge/diagnostics.py Normal file
View File

@@ -0,0 +1,146 @@
"""End-to-end tunnel diagnostics for OpsBridge."""
from __future__ import annotations
import socket
import subprocess
import time
from dataclasses import dataclass
from pathlib import Path
from typing import Optional
import httpx
from bridge.models import BridgeState, TunnelConfig
from bridge.state import StateManager, _pid_alive
def _remote_port_probe_command(remote_port: int) -> str:
"""Build a portable remote shell probe for a listening TCP port."""
return (
f"port={remote_port}; "
"if command -v ss >/dev/null 2>&1; then "
"ss -tnlp 2>/dev/null | grep -q \":$port \" && echo ok || echo closed; "
"elif command -v netstat >/dev/null 2>&1; then "
"netstat -tnlp 2>/dev/null | "
"grep -q \"[.:]$port[[:space:]]\" && echo ok || echo closed; "
"else "
"hex=$(printf '%04X' \"$port\"); "
"awk -v p=\":$hex\" "
"'NR > 1 && $4 == \"0A\" && index($2, p) { found = 1 } "
"END { print found ? \"ok\" : \"closed\" }' "
"/proc/net/tcp /proc/net/tcp6 2>/dev/null; "
"fi"
)
def _probe_local_port(local_port: int) -> str:
"""Check whether the local side of an SSH -L tunnel is accepting TCP."""
try:
with socket.create_connection(("127.0.0.1", local_port), timeout=5):
return "listening"
except ConnectionRefusedError:
return "closed"
except socket.timeout:
return "error:timeout"
except OSError as e:
return f"error:{e}"
@dataclass
class TunnelCheckResult:
tunnel: str
ssh_process: str # "ok" | "dead" | "no_pid"
pid: Optional[int]
remote_port: str # "listening" | "closed" | "error:<msg>"
local_api: Optional[str] # "ok" | "error:<msg>" | None
latency_ms: Optional[float]
stale_state: bool # state file says connected but process is dead
@property
def ok(self) -> bool:
return self.ssh_process == "ok" and self.remote_port == "listening"
def check_tunnel(cfg: TunnelConfig, state_mgr: StateManager) -> TunnelCheckResult:
"""Run end-to-end diagnostics for a single tunnel.
Checks SSH PID liveness, remote port listening via SSH probe, and optional
local API health check. Returns a TunnelCheckResult with all findings.
"""
name = cfg.name
# 1. PID liveness
pid = state_mgr.read_raw_pid(name)
if pid is None:
ssh_process = "no_pid"
elif _pid_alive(pid):
ssh_process = "ok"
else:
ssh_process = "dead"
# 2. Stale state: state file says connected/degraded but process is dead
state = state_mgr.read_state(name)
stale_state = (
state in (BridgeState.CONNECTED, BridgeState.DEGRADED)
and ssh_process != "ok"
)
# 3. Port probe: reverse tunnels listen remotely; local tunnels listen here.
if cfg.direction == "local":
remote_port = _probe_local_port(cfg.local_port)
else:
key_path = str(Path(cfg.ssh_key).expanduser())
cmd = [
"ssh",
"-i", key_path,
"-o", "BatchMode=yes",
"-o", "ConnectTimeout=5",
"-o", "StrictHostKeyChecking=accept-new",
f"{cfg.ssh_user}@{cfg.host}",
_remote_port_probe_command(cfg.remote_port),
]
try:
proc = subprocess.run(
cmd,
capture_output=True,
text=True,
timeout=10,
)
output = proc.stdout.strip()
if output == "ok":
remote_port = "listening"
elif output == "closed":
remote_port = "closed"
else:
remote_port = f"error:{proc.stderr.strip() or 'unknown'}"
except subprocess.TimeoutExpired:
remote_port = "error:timeout"
except Exception as e:
remote_port = f"error:{e}"
# 4. Local API health check (optional)
local_api: Optional[str] = None
latency_ms: Optional[float] = None
if cfg.health_check is not None:
try:
t0 = time.monotonic()
resp = httpx.get(cfg.health_check.url, timeout=cfg.health_check.timeout_seconds)
latency_ms = (time.monotonic() - t0) * 1000
local_api = "ok" if resp.is_success else f"error:http_{resp.status_code}"
except Exception as e:
local_api = f"error:{e}"
return TunnelCheckResult(
tunnel=name,
ssh_process=ssh_process,
pid=pid,
remote_port=remote_port,
local_api=local_api,
latency_ms=latency_ms,
stale_state=stale_state,
)
def check_all_tunnels(cfg, state_mgr: StateManager) -> list[TunnelCheckResult]:
"""Run diagnostics for all configured inline tunnels."""
return [check_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]

31
src/bridge/health.py Normal file
View File

@@ -0,0 +1,31 @@
"""HTTP health checker for OpsBridge."""
from __future__ import annotations
from dataclasses import dataclass
from typing import Optional
import httpx
@dataclass
class HealthResult:
ok: bool
status_code: Optional[int] = None
error: Optional[str] = None
class HealthChecker:
def __init__(self, url: str, timeout_seconds: int = 5):
self._url = url
self._timeout = timeout_seconds
async def check(self) -> HealthResult:
try:
async with httpx.AsyncClient(timeout=self._timeout) as client:
response = await client.get(self._url)
response.raise_for_status()
return HealthResult(ok=True, status_code=response.status_code)
except httpx.HTTPStatusError as e:
return HealthResult(ok=False, status_code=e.response.status_code, error=str(e))
except Exception as e:
return HealthResult(ok=False, error=str(e))

380
src/bridge/manager.py Normal file
View File

@@ -0,0 +1,380 @@
"""Tunnel lifecycle manager for OpsBridge."""
from __future__ import annotations
import logging
import os
import signal
import subprocess
import time
from datetime import datetime, timedelta
from pathlib import Path
from typing import List, Optional
from bridge.audit import AuditEvent, AuditLogger
from bridge.health import HealthChecker
from bridge.models import BridgeState, CertAcquisitionError, TunnelConfig
from bridge.state import StateManager
log = logging.getLogger(__name__)
def _actor_type_from_name(name: str) -> str:
for prefix in ("adm", "agt", "atm"):
if name.startswith(f"{prefix}-"):
return prefix
return "unknown"
def build_ssh_command(cfg: TunnelConfig, cert_path: Optional[Path] = None) -> List[str]:
"""Build the SSH tunnel command (reverse -R or local -L)."""
key = os.path.expanduser(cfg.ssh_key)
if cfg.direction == "local":
forward_flag = ["-L", f"{cfg.local_port}:127.0.0.1:{cfg.remote_port}"]
else:
forward_flag = ["-R", f"{cfg.remote_port}:127.0.0.1:{cfg.local_port}"]
cmd = [
"ssh",
"-N",
*forward_flag,
"-i", key,
]
if cert_path is not None:
cmd += ["-i", str(cert_path)]
cmd += [
"-o", "ServerAliveInterval=10",
"-o", "ServerAliveCountMax=3",
"-o", "ExitOnForwardFailure=yes",
"-o", "StrictHostKeyChecking=accept-new",
f"{cfg.ssh_user}@{cfg.host}",
]
return cmd
def _run_cert_command(cfg: TunnelConfig, state_dir: Path) -> Optional[Path]:
"""Run cert_command and write cert to state dir. Returns cert path or None."""
if cfg.cert_command is None:
return None
result = subprocess.run(
cfg.cert_command,
shell=True,
capture_output=True,
text=True,
)
if result.returncode != 0:
raise CertAcquisitionError(result.stderr.strip())
cert_path = state_dir / f"{cfg.name}-cert.pub"
cert_path.write_text(result.stdout)
return cert_path
def _parse_cert_identity(cert_path: Path) -> Optional[str]:
"""Parse Key ID from ssh-keygen -L output."""
try:
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_path)],
capture_output=True,
text=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line.startswith("Key ID:"):
return line.split(":", 1)[1].strip().strip('"')
except Exception:
pass
return None
def _parse_cert_expiry(cert_path: Path) -> Optional[datetime]:
"""Parse Valid-before datetime from ssh-keygen -L output."""
try:
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_path)],
capture_output=True,
text=True,
)
for line in result.stdout.splitlines():
line = line.strip()
if line.startswith("Valid:"):
# "Valid: from 2026-05-15T10:00:00 to 2026-05-15T22:00:00"
parts = line.split()
if len(parts) >= 5 and parts[3] == "to":
return datetime.fromisoformat(parts[4])
except Exception:
pass
return None
class TunnelManager:
"""Manages a single named SSH reverse tunnel.
start() daemonises: forks a child that runs the reconnect loop, then the
parent returns immediately after writing the manager PID.
"""
def __init__(self, cfg: TunnelConfig, state_dir: Optional[Path] = None):
self._cfg = cfg
self._state = StateManager(state_dir=state_dir)
self._audit = AuditLogger(state_dir=state_dir)
def get_state(self) -> BridgeState:
return self._state.read_state(self._cfg.name)
def is_running(self) -> bool:
return self._state.is_running(self._cfg.name)
def _actor_info(self):
actor = self._cfg.actor
return actor, _actor_type_from_name(actor)
def _next_backoff(self, attempt: int) -> int:
initial = self._cfg.reconnect.backoff_initial
max_b = self._cfg.reconnect.backoff_max
value = initial * (2 ** attempt)
return min(value, max_b)
def start(self) -> None:
"""Start the tunnel manager as a daemonised subprocess."""
if self.is_running():
log.info("Tunnel %s already running", self._cfg.name)
return
self._state.write_state(self._cfg.name, BridgeState.STARTING)
actor, actor_type = self._actor_info()
self._audit.log(
tunnel=self._cfg.name,
event=AuditEvent.BRIDGE_STARTED,
actor=actor,
actor_type=actor_type,
)
pid = os.fork()
if pid > 0:
# Parent: record manager PID and return
self._state.write_pid(self._cfg.name, pid)
return
# Child: become a daemon
os.setsid()
try:
self._run_loop()
except Exception as e:
log.exception("Tunnel manager loop crashed: %s", e)
finally:
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
self._state.clear_pid(self._cfg.name)
self._audit.log(
tunnel=self._cfg.name,
event=AuditEvent.BRIDGE_STOPPED,
actor=actor,
actor_type=actor_type,
)
os._exit(0)
def stop(self) -> None:
"""Stop the running tunnel manager."""
pid = self._state.read_pid(self._cfg.name)
if pid is None:
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
return
try:
os.kill(pid, signal.SIGTERM)
# Give up to 5 seconds for graceful shutdown
for _ in range(50):
try:
os.kill(pid, 0)
time.sleep(0.1)
except ProcessLookupError:
break
else:
# Force kill if still running
try:
os.kill(pid, signal.SIGKILL)
except ProcessLookupError:
pass
except ProcessLookupError:
pass
self._state.clear_pid(self._cfg.name)
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
actor, actor_type = self._actor_info()
self._audit.log(
tunnel=self._cfg.name,
event=AuditEvent.BRIDGE_STOPPED,
actor=actor,
actor_type=actor_type,
)
def _run_loop(self) -> None:
"""Reconnect loop running in daemon child."""
import asyncio
cfg = self._cfg
actor, actor_type = self._actor_info()
attempt = 0
max_attempts = cfg.reconnect.max_attempts # 0 = infinite
state_dir = self._state._dir
_stop = [False]
def _on_term(signum, frame):
_stop[0] = True
signal.signal(signal.SIGTERM, _on_term)
signal.signal(signal.SIGINT, _on_term)
while not _stop[0]:
if max_attempts > 0 and attempt >= max_attempts:
self._state.write_state(cfg.name, BridgeState.FAILED)
break
# Acquire cert before each SSH launch (T3, T7)
try:
cert_path = _run_cert_command(cfg, state_dir)
except CertAcquisitionError as e:
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.BRIDGE_DISCONNECTED,
actor=actor,
actor_type=actor_type,
detail=f"cert acquisition failed: {e}",
)
attempt += 1
if max_attempts > 0 and attempt >= max_attempts:
self._state.write_state(cfg.name, BridgeState.FAILED)
break
backoff = self._next_backoff(attempt - 1)
self._state.write_state(cfg.name, BridgeState.RECONNECTING)
log.info("Cert acquisition failed, retrying in %ds", backoff)
time.sleep(backoff)
continue
cert_identity = _parse_cert_identity(cert_path) if cert_path else None
cert_expires_at = _parse_cert_expiry(cert_path) if cert_path else None
cmd = build_ssh_command(cfg, cert_path=cert_path)
log.info("Starting SSH: %s", " ".join(cmd))
self._state.write_state(cfg.name, BridgeState.STARTING)
try:
proc = subprocess.Popen(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
except FileNotFoundError:
self._state.write_state(cfg.name, BridgeState.FAILED)
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.BRIDGE_DISCONNECTED,
actor=actor,
actor_type=actor_type,
detail="ssh binary not found",
)
break
time.sleep(2)
_ttl_refresh = False
if proc.poll() is None:
self._state.write_state(cfg.name, BridgeState.CONNECTED)
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.BRIDGE_CONNECTED,
actor=actor,
actor_type=actor_type,
cert_identity=cert_identity,
)
attempt = 0
def _check_ttl() -> bool:
"""Return True if cert is within 5 min of expiry and SSH should restart."""
if cert_expires_at is None:
return False
return datetime.now() >= cert_expires_at - timedelta(minutes=5)
if cfg.health_check:
checker = HealthChecker(
url=cfg.health_check.url,
timeout_seconds=cfg.health_check.timeout_seconds,
)
health_failing = False
while not _stop[0] and proc.poll() is None:
if _check_ttl():
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.CERT_EXPIRING,
actor=actor,
actor_type=actor_type,
cert_identity=cert_identity,
detail=str(cert_expires_at),
)
proc.terminate()
_ttl_refresh = True
break
result = asyncio.run(checker.check())
if result.ok:
if health_failing:
health_failing = False
self._state.write_state(cfg.name, BridgeState.CONNECTED)
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.HEALTH_CHECK_RECOVERED,
actor=actor,
actor_type=actor_type,
)
else:
if not health_failing:
health_failing = True
self._state.write_state(cfg.name, BridgeState.DEGRADED)
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.HEALTH_CHECK_FAILED,
actor=actor,
actor_type=actor_type,
detail=result.error or f"HTTP {result.status_code}",
)
time.sleep(cfg.health_check.interval_seconds)
else:
while not _stop[0] and proc.poll() is None:
if _check_ttl():
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.CERT_EXPIRING,
actor=actor,
actor_type=actor_type,
cert_identity=cert_identity,
detail=str(cert_expires_at),
)
proc.terminate()
_ttl_refresh = True
break
time.sleep(1)
if _ttl_refresh:
# Planned cert refresh — don't count as failure, no backoff
continue
if proc.poll() is not None:
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.BRIDGE_DISCONNECTED,
actor=actor,
actor_type=actor_type,
detail=f"exit code {proc.returncode}",
)
if _stop[0]:
if proc.poll() is None:
proc.terminate()
break
attempt += 1
backoff = self._next_backoff(attempt - 1)
self._state.write_state(cfg.name, BridgeState.RECONNECTING)
self._audit.log(
tunnel=cfg.name,
event=AuditEvent.BRIDGE_RECONNECTING,
actor=actor,
actor_type=actor_type,
detail=f"retry {attempt}, backoff {backoff}s",
)
log.info("Reconnecting in %ds (attempt %d)", backoff, attempt)
time.sleep(backoff)

View File

View File

@@ -0,0 +1,529 @@
"""OpsBridge MCP server — exposes bridge and catalog operations as FastMCP tools.
Entry point (stdio):
uv run python src/bridge/mcp_server/server.py
The server imports the Python library directly — no subprocess required.
All tool functions return JSON-serialisable dicts/lists.
"""
from __future__ import annotations
import dataclasses
import json
import os
from pathlib import Path
from typing import Optional
from fastmcp import FastMCP
from bridge.diagnostics import check_all_tunnels, check_tunnel
from bridge.state import StateManager
mcp = FastMCP(
name="ops-bridge",
instructions=(
"OpsBridge MCP server. Use bridge_status to check tunnel health, "
"bridge_up/down/restart to manage lifecycle, bridge_logs for audit history. "
"catalog_* tools require catalog_path to be configured in tunnels.yaml."
),
)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _state_dir() -> Path:
return Path(os.environ.get("BRIDGE_STATE_DIR", str(Path.home() / ".local" / "state" / "bridge")))
def _load_cfg():
from bridge.config import load_config
return load_config()
def _load_cfg_or_error() -> tuple:
"""Return (cfg, None) or (None, error_dict)."""
try:
return _load_cfg(), None
except Exception as e:
return None, {"error": str(e)}
def _load_catalog(cfg):
"""Return (catalog, None) or (None, error_dict)."""
if cfg.catalog_path is None:
return None, {"error": "catalog_path not configured"}
try:
from bridge.catalog.loader import load_catalog
return load_catalog(cfg.catalog_path), None
except Exception as e:
return None, {"error": f"Failed to load catalog: {e}"}
# ---------------------------------------------------------------------------
# Bridge lifecycle tools
# ---------------------------------------------------------------------------
@mcp.tool()
def bridge_up(tunnel: Optional[str] = None) -> dict:
"""Start one or all configured tunnels.
Args:
tunnel: Tunnel name to start. If omitted, starts all inline tunnels.
Returns:
{"started": [...], "already_running": [...]} or {"error": "..."}
"""
cfg, err = _load_cfg_or_error()
if err:
return err
from bridge.manager import TunnelManager
sd = _state_dir()
started = []
already_running = []
if tunnel:
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import BridgeNotFound, resolve
catalog = None
if cfg.catalog_path is not None:
try:
catalog = load_catalog(cfg.catalog_path)
except Exception:
pass
try:
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound:
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
mgr = TunnelManager(tcfg, state_dir=sd)
if mgr.is_running():
already_running.append(tunnel)
else:
mgr.start()
started.append(tunnel)
else:
for name, tcfg in cfg.tunnels.items():
mgr = TunnelManager(tcfg, state_dir=sd)
if mgr.is_running():
already_running.append(name)
else:
mgr.start()
started.append(name)
return {"started": started, "already_running": already_running}
@mcp.tool()
def bridge_down(tunnel: Optional[str] = None) -> dict:
"""Stop one or all configured tunnels.
Args:
tunnel: Tunnel name to stop. If omitted, stops all inline tunnels.
Returns:
{"stopped": [...], "not_running": [...]} or {"error": "..."}
"""
cfg, err = _load_cfg_or_error()
if err:
return err
from bridge.manager import TunnelManager
sd = _state_dir()
stopped = []
not_running = []
if tunnel:
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import BridgeNotFound, resolve
catalog = None
if cfg.catalog_path is not None:
try:
catalog = load_catalog(cfg.catalog_path)
except Exception:
pass
try:
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound:
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
mgr = TunnelManager(tcfg, state_dir=sd)
if not mgr.is_running():
not_running.append(tunnel)
else:
mgr.stop()
stopped.append(tunnel)
else:
for name, tcfg in cfg.tunnels.items():
mgr = TunnelManager(tcfg, state_dir=sd)
if not mgr.is_running():
not_running.append(name)
else:
mgr.stop()
stopped.append(name)
return {"stopped": stopped, "not_running": not_running}
@mcp.tool()
def bridge_restart(tunnel: Optional[str] = None) -> dict:
"""Restart one or all configured tunnels.
Reverse tunnels run conditional remote stale-forward cleanup before
reconnecting; healthy forwards are left running.
Args:
tunnel: Tunnel name to restart. If omitted, restarts all inline tunnels.
Returns:
{"actions": [{"tunnel", "action", "detail"}, ...]} or {"error": "..."}
"""
cfg, err = _load_cfg_or_error()
if err:
return err
from bridge.cleanup import restart_all_tunnels, restart_tunnel
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
if tunnel:
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import BridgeNotFound, resolve
catalog = None
if cfg.catalog_path is not None:
try:
catalog = load_catalog(cfg.catalog_path)
except Exception:
pass
try:
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound:
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
actions = [restart_tunnel(tcfg, state_mgr)]
else:
actions = restart_all_tunnels(cfg, state_mgr)
payload = {
"actions": [
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
for a in actions
],
}
if any(a.action == "error" for a in actions):
payload["error"] = "one or more tunnels failed to restart"
return payload
@mcp.tool()
def bridge_status() -> list[dict]:
"""Return status of all configured tunnels.
Returns:
List of tunnel status dicts, each with keys:
tunnel, state, actor, host, pid, uptime, health
"""
cfg, err = _load_cfg_or_error()
if err:
return [err]
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
rows = []
for name, tcfg in cfg.tunnels.items():
state = state_mgr.read_state(name)
pid = state_mgr.read_pid(name)
rows.append({
"tunnel": name,
"state": state.value,
"actor": tcfg.actor,
"host": tcfg.host,
"pid": pid,
"uptime": None,
"health": None,
})
return rows
@mcp.tool()
def bridge_logs(tunnel: str, lines: int = 50) -> list[dict]:
"""Return recent audit log entries for a tunnel.
Args:
tunnel: Tunnel name.
lines: Maximum number of log entries to return (default 50).
Returns:
List of audit event dicts (timestamp, event, actor, detail).
"""
cfg, err = _load_cfg_or_error()
if err:
return [err]
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import BridgeNotFound, resolve
catalog = None
if cfg.catalog_path is not None:
try:
catalog = load_catalog(cfg.catalog_path)
except Exception:
pass
try:
resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound:
return [{"error": f"Tunnel '{tunnel}' not found in config or catalog"}]
from bridge.audit import AuditLogger
sd = _state_dir()
logger = AuditLogger(state_dir=sd)
events = logger.read_events(tunnel)
return events[-lines:] if events else []
# ---------------------------------------------------------------------------
# Catalog tools
# ---------------------------------------------------------------------------
@mcp.tool()
def catalog_list_targets(domain: Optional[str] = None) -> list[dict]:
"""List all infrastructure targets from the OpsCatalog.
Args:
domain: Optional domain filter.
Returns:
List of target dicts (id, domain, kind, description, reachable_via).
Returns [{"error": "..."}] when catalog is not configured or fails to load.
"""
cfg, err = _load_cfg_or_error()
if err:
return [err]
catalog, err = _load_catalog(cfg)
if err:
return [err]
targets = []
for t in catalog.targets.values():
if domain and t.domain != domain:
continue
targets.append({
"id": t.id,
"domain": t.domain,
"kind": t.kind,
"description": t.description or "",
"reachable_via": list(t.reachable_via),
})
return targets
@mcp.tool()
def catalog_show_target(target_id: str) -> dict:
"""Show full metadata for a catalog target.
Args:
target_id: The target identifier.
Returns:
Target metadata dict, or {"error": "..."}.
"""
cfg, err = _load_cfg_or_error()
if err:
return err
catalog, err = _load_catalog(cfg)
if err:
return err
if target_id not in catalog.targets:
return {"error": f"Target '{target_id}' not found"}
t = catalog.targets[target_id]
return {
"id": t.id,
"domain": t.domain,
"kind": t.kind,
"description": t.description or "",
"reachable_via": list(t.reachable_via),
}
@mcp.tool()
def catalog_list_domains() -> list[dict]:
"""List all domains in the OpsCatalog with target and bridge counts.
Returns:
List of domain dicts (id, name, environment, target_count, bridge_count).
Returns [{"error": "..."}] when catalog is not configured or fails to load.
"""
cfg, err = _load_cfg_or_error()
if err:
return [err]
catalog, err = _load_catalog(cfg)
if err:
return [err]
domains = []
for d in catalog.domains.values():
target_count = sum(1 for t in catalog.targets.values() if t.domain == d.id)
bridge_count = sum(1 for b in catalog.bridges.values() if b.domain == d.id)
domains.append({
"id": d.id,
"name": d.name,
"environment": d.environment,
"description": d.description or "",
"target_count": target_count,
"bridge_count": bridge_count,
})
return domains
@mcp.tool()
def catalog_validate() -> dict:
"""Validate the OpsCatalog for consistency errors.
Returns:
{"valid": True} or {"valid": False, "errors": ["..."]}
"""
cfg, err = _load_cfg_or_error()
if err:
return {"valid": False, "errors": [err["error"]]}
catalog, err = _load_catalog(cfg)
if err:
return {"valid": False, "errors": [err["error"]]}
from bridge.catalog.validator import validate_catalog
errors = validate_catalog(catalog)
if errors:
return {"valid": False, "errors": errors}
return {"valid": True, "errors": []}
@mcp.tool()
def catalog_show_bridge(bridge_id: str) -> dict:
"""Show full metadata for a catalog bridge definition.
Args:
bridge_id: The bridge identifier.
Returns:
Bridge metadata dict, or {"error": "..."}.
"""
cfg, err = _load_cfg_or_error()
if err:
return err
catalog, err = _load_catalog(cfg)
if err:
return err
if bridge_id not in catalog.bridges:
return {"error": f"Bridge '{bridge_id}' not found"}
b = catalog.bridges[bridge_id]
result = {
"id": b.id,
"domain": b.domain,
"target": b.target,
"host": b.host,
"remote_port": b.remote_port,
"local_port": b.local_port,
"ssh_user": b.ssh_user,
"actor": b.actor,
"access_method": b.access_method,
"description": b.description or "",
}
if b.health_check:
result["health_check"] = {
"url": b.health_check.url,
"interval_seconds": b.health_check.interval_seconds,
"timeout_seconds": b.health_check.timeout_seconds,
}
return result
# ---------------------------------------------------------------------------
# Diagnostics tool
# ---------------------------------------------------------------------------
@mcp.tool()
def bridge_check(tunnel: Optional[str] = None) -> list[dict]:
"""End-to-end diagnostics: SSH process alive + remote port listening.
Args:
tunnel: Specific tunnel name, or None for all inline tunnels.
Returns:
List of dicts with keys: tunnel, ssh_process, pid, remote_port,
local_api, latency_ms, stale_state, ok.
Returns [{"error": "..."}] on config load failure.
"""
cfg, err = _load_cfg_or_error()
if err:
return [err]
sd = _state_dir()
state_mgr = StateManager(state_dir=sd)
if tunnel:
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import BridgeNotFound, resolve
catalog = None
if cfg.catalog_path is not None:
try:
catalog = load_catalog(cfg.catalog_path)
except Exception:
pass
try:
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
except BridgeNotFound:
return [{"error": f"Tunnel '{tunnel}' not found in config or catalog"}]
results = [check_tunnel(tcfg, state_mgr)]
else:
results = check_all_tunnels(cfg, state_mgr)
return [{**dataclasses.asdict(r), "ok": r.ok} for r in results]
# ---------------------------------------------------------------------------
# MCP resources
# ---------------------------------------------------------------------------
@mcp.resource("bridge://status")
def resource_bridge_status() -> str:
"""Live snapshot of all tunnel states as JSON."""
rows = bridge_status()
return json.dumps(rows, indent=2)
@mcp.resource("bridge://check")
def resource_bridge_check() -> str:
"""Live end-to-end diagnostic snapshot for all tunnels."""
return json.dumps(bridge_check(), indent=2)
@mcp.resource("catalog://domains")
def resource_catalog_domains() -> str:
"""List of all catalog domains as JSON."""
domains = catalog_list_domains()
return json.dumps(domains, indent=2)
@mcp.resource("catalog://targets")
def resource_catalog_targets() -> str:
"""List of all catalog targets as JSON."""
targets = catalog_list_targets()
return json.dumps(targets, indent=2)
# ---------------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------------
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="OpsBridge MCP server")
parser.add_argument("--http", action="store_true", help="Run in SSE/HTTP mode instead of stdio")
args = parser.parse_args()
if args.http:
port = int(os.environ.get("BRIDGE_MCP_PORT", "8002"))
mcp.run(transport="sse", host="127.0.0.1", port=port)
else:
mcp.run(transport="stdio")

61
src/bridge/models.py Normal file
View File

@@ -0,0 +1,61 @@
"""Domain models for OpsBridge."""
from __future__ import annotations
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class BridgeState(str, Enum):
STOPPED = "stopped"
STARTING = "starting"
CONNECTED = "connected"
DEGRADED = "degraded"
RECONNECTING = "reconnecting"
FAILED = "failed"
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
class CertAcquisitionError(Exception):
"""Raised when cert_command fails to produce a certificate."""
@dataclass
class ReconnectPolicy:
max_attempts: int = 0 # 0 = infinite
backoff_initial: int = 5
backoff_max: int = 60
@dataclass
class HealthCheckConfig:
url: str
interval_seconds: int = 30
timeout_seconds: int = 5
@dataclass
class TunnelConfig:
name: str
host: str
remote_port: int
local_port: int
ssh_user: str
ssh_key: str
actor: str
reconnect: ReconnectPolicy = field(default_factory=ReconnectPolicy)
health_check: Optional[HealthCheckConfig] = None
direction: str = "reverse" # "reverse" (-R) or "local" (-L)
cert_command: Optional[str] = None
@dataclass
class ActorInfo:
name: str
actor_type: ActorType
description: str = ""

83
src/bridge/state.py Normal file
View File

@@ -0,0 +1,83 @@
"""State file management for OpsBridge."""
from __future__ import annotations
import os
from pathlib import Path
from typing import Optional
from bridge.models import BridgeState
def _default_state_dir() -> Path:
return Path.home() / ".local" / "state" / "bridge"
class StateManager:
def __init__(self, state_dir: Optional[Path] = None):
self._dir = Path(state_dir) if state_dir else _default_state_dir()
def _ensure_dir(self) -> None:
self._dir.mkdir(parents=True, exist_ok=True)
def _state_path(self, name: str) -> Path:
return self._dir / f"{name}.state"
def _pid_path(self, name: str) -> Path:
return self._dir / f"{name}.pid"
def read_state(self, name: str) -> BridgeState:
path = self._state_path(name)
if not path.exists():
return BridgeState.STOPPED
text = path.read_text().strip()
try:
return BridgeState(text)
except ValueError:
return BridgeState.STOPPED
def write_state(self, name: str, state: BridgeState) -> None:
self._ensure_dir()
self._state_path(name).write_text(state.value)
def read_pid(self, name: str) -> Optional[int]:
path = self._pid_path(name)
if not path.exists():
return None
try:
pid = int(path.read_text().strip())
except (ValueError, OSError):
return None
if _pid_alive(pid):
return pid
return None
def read_raw_pid(self, name: str) -> Optional[int]:
"""Read PID from file without liveness check. Returns None if file absent/invalid."""
path = self._pid_path(name)
if not path.exists():
return None
try:
return int(path.read_text().strip())
except (ValueError, OSError):
return None
def write_pid(self, name: str, pid: int) -> None:
self._ensure_dir()
self._pid_path(name).write_text(str(pid))
def clear_pid(self, name: str) -> None:
path = self._pid_path(name)
if path.exists():
path.unlink()
def is_running(self, name: str) -> bool:
return self.read_pid(name) is not None
def _pid_alive(pid: int) -> bool:
"""Return True if the process with given PID exists."""
try:
os.kill(pid, 0)
return True
except (ProcessLookupError, PermissionError):
return False

0
tests/__init__.py Normal file
View File

154
tests/conftest.py Normal file
View File

@@ -0,0 +1,154 @@
"""Shared pytest configuration for OpsBridge tests.
Registers capability and access_mode marks, and provides the
collect_capability_coverage() helper used by the cross-mode meta-test.
"""
from __future__ import annotations
import textwrap
from typing import Iterable
import pytest
# ---------------------------------------------------------------------------
# Shared fixtures
# ---------------------------------------------------------------------------
VALID_CONFIG = textwrap.dedent("""\
tunnels:
test-tunnel:
host: host.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
""")
VALID_CONFIG_WITH_CATALOG = textwrap.dedent("""\
tunnels:
test-tunnel:
host: host.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
catalog_path: {catalog_path}
""")
@pytest.fixture
def config_file(tmp_path):
f = tmp_path / "tunnels.yaml"
f.write_text(VALID_CONFIG)
return f
@pytest.fixture
def state_dir(tmp_path):
d = tmp_path / "state"
d.mkdir()
return d
@pytest.fixture
def catalog_dir(tmp_path):
"""Minimal catalog directory with one domain, target, and bridge."""
cat = tmp_path / "catalog"
domain_dir = cat / "domains" / "coulombcore"
domain_dir.mkdir(parents=True)
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
type: domain
id: coulombcore
name: CoulombCore Infrastructure
description: Core infrastructure domain
environment: production
"""))
targets_dir = domain_dir / "targets"
targets_dir.mkdir()
(targets_dir / "state-hub.yaml").write_text(textwrap.dedent("""\
type: target
id: state-hub
domain: coulombcore
kind: service
description: Infrastructure state coordination service
reachable_via:
- state-hub-coulombcore
"""))
bridges_dir = domain_dir / "bridges"
bridges_dir.mkdir()
(bridges_dir / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
type: bridge
id: state-hub-coulombcore
domain: coulombcore
target: state-hub
description: Bridge to state hub
access_method: ssh-reverse
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
"""))
actors_dir = cat / "actors"
actors_dir.mkdir()
(actors_dir / "agent.yaml").write_text(textwrap.dedent("""\
type: actor
id: agent.claude-coulombcore
class: automation
description: Claude Code agent on CoulombCore
"""))
return cat
@pytest.fixture
def config_file_with_catalog(tmp_path, catalog_dir):
f = tmp_path / "tunnels.yaml"
f.write_text(VALID_CONFIG_WITH_CATALOG.format(catalog_path=str(catalog_dir)))
return f
# ---------------------------------------------------------------------------
# Coverage collector helper
# ---------------------------------------------------------------------------
def collect_capability_coverage(items: Iterable) -> set[tuple[str, str]]:
"""Walk pytest items and return set of (capability_name, access_mode) pairs.
Each test item is inspected for `capability` and `access_mode` markers.
A pair is added for every combination of capability × access_mode marks
found on a single item.
Args:
items: Iterable of pytest.Item objects (from session.items or similar).
Returns:
Set of (capability_name, access_mode) tuples found across all items.
"""
covered: set[tuple[str, str]] = set()
for item in items:
capabilities = [
m.args[0] for m in item.iter_markers("capability") if m.args
]
modes = [
m.args[0] for m in item.iter_markers("access_mode") if m.args
]
for cap in capabilities:
for mode in modes:
covered.add((cap, mode))
return covered

89
tests/test_audit.py Normal file
View File

@@ -0,0 +1,89 @@
"""Tests for audit logging."""
import json
import pytest
from bridge.audit import AuditLogger, AuditEvent
@pytest.fixture
def log_dir(tmp_path):
return tmp_path / "bridge"
@pytest.fixture
def logger(log_dir):
return AuditLogger(state_dir=log_dir)
class TestAuditLogger:
def test_log_event_creates_file(self, logger, log_dir):
logger.log(
tunnel="my-tunnel",
event=AuditEvent.BRIDGE_STARTED,
actor="operator.bernd",
actor_type="adm",
)
log_file = log_dir / "my-tunnel.log"
assert log_file.exists()
def test_log_event_is_json_line(self, logger, log_dir):
logger.log(
tunnel="my-tunnel",
event=AuditEvent.BRIDGE_STARTED,
actor="operator.bernd",
actor_type="adm",
)
lines = (log_dir / "my-tunnel.log").read_text().strip().splitlines()
assert len(lines) == 1
entry = json.loads(lines[0])
assert entry["tunnel"] == "my-tunnel"
assert entry["event"] == "bridge_started"
assert entry["actor"] == "operator.bernd"
assert entry["actor_type"] == "adm"
assert "timestamp" in entry
def test_multiple_events_append(self, logger, log_dir):
for event in [AuditEvent.BRIDGE_STARTED, AuditEvent.BRIDGE_CONNECTED, AuditEvent.BRIDGE_STOPPED]:
logger.log(tunnel="t", event=event, actor="a", actor_type="adm")
lines = (log_dir / "t.log").read_text().strip().splitlines()
assert len(lines) == 3
def test_log_with_detail(self, logger, log_dir):
logger.log(
tunnel="t",
event=AuditEvent.HEALTH_CHECK_FAILED,
actor="a",
actor_type="atm",
detail="connection refused",
)
entry = json.loads((log_dir / "t.log").read_text().strip())
assert entry["detail"] == "connection refused"
def test_all_event_types_defined(self):
events = {e.value for e in AuditEvent}
assert "bridge_started" in events
assert "bridge_connected" in events
assert "bridge_disconnected" in events
assert "bridge_reconnecting" in events
assert "health_check_failed" in events
assert "health_check_recovered" in events
assert "bridge_stopped" in events
def test_timestamp_is_iso8601(self, logger, log_dir):
from datetime import datetime
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
entry = json.loads((log_dir / "t.log").read_text().strip())
# Should parse without error
dt = datetime.fromisoformat(entry["timestamp"])
assert dt.tzinfo is not None or True # UTC or naive both acceptable
def test_read_events(self, logger, log_dir):
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_type="adm")
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
events = logger.read_events("t")
assert len(events) == 2
assert events[0]["event"] == "bridge_started"
def test_read_events_missing_returns_empty(self, logger):
assert logger.read_events("nonexistent") == []

212
tests/test_catalog_cli.py Normal file
View File

@@ -0,0 +1,212 @@
"""Tests for catalog CLI commands (targets, catalog list/validate/show)."""
import json
import textwrap
import pytest
from typer.testing import CliRunner
from bridge.cli import app
runner = CliRunner()
# Config with catalog_path pointing to a fixture
BASE_CONFIG = textwrap.dedent("""\
tunnels: {{}}
actors: {{}}
catalog_path: {catalog_path}
""")
CONFIG_NO_CATALOG = textwrap.dedent("""\
tunnels: {}
actors: {}
""")
@pytest.fixture
def catalog_dir(tmp_path):
root = tmp_path / "opscatalog"
domain_dir = root / "domains" / "coulombcore"
(domain_dir / "targets").mkdir(parents=True)
(domain_dir / "bridges").mkdir(parents=True)
actors_dir = root / "actors"
actors_dir.mkdir(parents=True)
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
type: domain
id: coulombcore
name: CoulombCore Infrastructure
description: Core infra
environment: production
"""))
(domain_dir / "targets" / "state-hub.yaml").write_text(textwrap.dedent("""\
type: target
id: state-hub
domain: coulombcore
kind: service
description: State coordination service
reachable_via:
- state-hub-coulombcore
"""))
(domain_dir / "bridges" / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
type: bridge
id: state-hub-coulombcore
domain: coulombcore
target: state-hub
description: Ops bridge for state hub
access_method: ssh-reverse
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
"""))
(actors_dir / "agents.yaml").write_text(textwrap.dedent("""\
type: actor
id: agent.claude-coulombcore
class: automation
description: Claude Code agent
"""))
return root
@pytest.fixture
def config_file(tmp_path, catalog_dir):
f = tmp_path / "tunnels.yaml"
f.write_text(BASE_CONFIG.format(catalog_path=str(catalog_dir)))
return f
@pytest.fixture
def env(config_file, tmp_path):
return {
"BRIDGE_CONFIG": str(config_file),
"BRIDGE_STATE_DIR": str(tmp_path / "state"),
}
class TestTargetsCommand:
@pytest.mark.capability("catalog_list_targets")
@pytest.mark.access_mode("cli")
def test_targets_shows_table(self, env):
result = runner.invoke(app, ["targets"], env=env)
assert result.exit_code == 0
assert "state-hub" in result.output
def test_targets_json(self, env):
result = runner.invoke(app, ["targets", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert isinstance(data, list)
assert any(t["target"] == "state-hub" for t in data)
assert any(t["domain"] == "coulombcore" for t in data)
def test_targets_domain_filter(self, env):
result = runner.invoke(app, ["targets", "--domain", "coulombcore"], env=env)
assert result.exit_code == 0
assert "state-hub" in result.output
def test_targets_domain_filter_unknown(self, env):
result = runner.invoke(app, ["targets", "--domain", "nonexistent"], env=env)
assert result.exit_code == 0
# No results but no crash
def test_targets_no_catalog_configured(self, tmp_path):
f = tmp_path / "tunnels.yaml"
f.write_text(CONFIG_NO_CATALOG)
result = runner.invoke(app, ["targets"], env={"BRIDGE_CONFIG": str(f)})
assert result.exit_code == 1
assert "catalog" in result.output.lower()
@pytest.mark.capability("catalog_show_target")
@pytest.mark.access_mode("cli")
def test_targets_show_subcommand(self, env):
result = runner.invoke(app, ["targets", "show", "state-hub"], env=env)
assert result.exit_code == 0
assert "state-hub" in result.output
assert "coulombcore" in result.output
def test_targets_show_unknown(self, env):
result = runner.invoke(app, ["targets", "show", "nonexistent"], env=env)
assert result.exit_code == 1
class TestCatalogCommand:
@pytest.mark.capability("catalog_list_domains")
@pytest.mark.access_mode("cli")
def test_catalog_list(self, env):
result = runner.invoke(app, ["catalog", "list"], env=env)
assert result.exit_code == 0
assert "coulombcore" in result.output
def test_catalog_list_json(self, env):
result = runner.invoke(app, ["catalog", "list", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert isinstance(data, list)
assert any(d["domain"] == "coulombcore" for d in data)
@pytest.mark.capability("catalog_validate")
@pytest.mark.access_mode("cli")
def test_catalog_validate_clean(self, env):
result = runner.invoke(app, ["catalog", "validate"], env=env)
assert result.exit_code == 0
assert "valid" in result.output.lower() or "ok" in result.output.lower() or "0" in result.output
def test_catalog_validate_with_errors(self, tmp_path):
# Catalog with dangling reference
root = tmp_path / "bad-catalog"
domain_dir = root / "domains" / "d"
(domain_dir / "targets").mkdir(parents=True)
(domain_dir / "domain.yaml").write_text(
"type: domain\nid: d\nname: D\n"
)
(domain_dir / "targets" / "t.yaml").write_text(
"type: target\nid: t\ndomain: d\nkind: service\nreachable_via:\n - missing-bridge\n"
)
f = tmp_path / "tunnels.yaml"
f.write_text(BASE_CONFIG.format(catalog_path=str(root)))
result = runner.invoke(app, ["catalog", "validate"], env={"BRIDGE_CONFIG": str(f)})
assert result.exit_code == 1
assert "missing-bridge" in result.output
@pytest.mark.capability("catalog_show_bridge")
@pytest.mark.access_mode("cli")
def test_catalog_show(self, env):
result = runner.invoke(app, ["catalog", "show", "state-hub-coulombcore"], env=env)
assert result.exit_code == 0
assert "state-hub-coulombcore" in result.output
assert "coulombcore.local" in result.output
def test_catalog_show_unknown(self, env):
result = runner.invoke(app, ["catalog", "show", "nonexistent"], env=env)
assert result.exit_code == 1
def test_catalog_no_catalog_configured(self, tmp_path):
f = tmp_path / "tunnels.yaml"
f.write_text(CONFIG_NO_CATALOG)
result = runner.invoke(app, ["catalog", "list"], env={"BRIDGE_CONFIG": str(f)})
assert result.exit_code == 1
class TestUpWithCatalogFallback:
def test_up_resolves_catalog_bridge(self, env):
"""bridge up <catalog-bridge-name> works when name not in inline tunnels.yaml."""
from unittest.mock import MagicMock, patch
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_mgr_cls.return_value = mock_mgr
result = runner.invoke(app, ["up", "state-hub-coulombcore"], env=env)
assert result.exit_code == 0
mock_mgr.start.assert_called_once()
def test_up_unknown_bridge_exit_1(self, env):
result = runner.invoke(app, ["up", "totally-nonexistent"], env=env)
assert result.exit_code == 1

View File

@@ -0,0 +1,195 @@
"""Integration tests for OpsCatalog (T14-T16 from BRIDGE-WP-0002)."""
import json
import textwrap
from unittest.mock import MagicMock, patch
import pytest
from typer.testing import CliRunner
from bridge.catalog.loader import load_catalog
from bridge.catalog.resolver import resolve
from bridge.catalog.validator import validate_catalog
from bridge.cli import app
runner = CliRunner()
@pytest.fixture
def catalog_dir(tmp_path):
root = tmp_path / "opscatalog"
domain_dir = root / "domains" / "coulombcore"
(domain_dir / "targets").mkdir(parents=True)
(domain_dir / "bridges").mkdir(parents=True)
(domain_dir / "docs").mkdir(parents=True)
actors_dir = root / "actors"
actors_dir.mkdir(parents=True)
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
type: domain
id: coulombcore
name: CoulombCore Infrastructure
description: Core infra
environment: production
"""))
(domain_dir / "targets" / "state-hub.yaml").write_text(textwrap.dedent("""\
type: target
id: state-hub
domain: coulombcore
kind: service
description: State coordination service
reachable_via:
- state-hub-coulombcore
"""))
(domain_dir / "bridges" / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
type: bridge
id: state-hub-coulombcore
domain: coulombcore
target: state-hub
description: Ops bridge for state hub
access_method: ssh-reverse
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
"""))
(actors_dir / "agents.yaml").write_text(textwrap.dedent("""\
type: actor
id: agent.claude-coulombcore
class: automation
description: Claude Code agent on CoulombCore
"""))
(domain_dir / "docs" / "overview.md").write_text(
"# CoulombCore Overview\nCore infrastructure notes."
)
return root
@pytest.fixture
def config_with_catalog(tmp_path, catalog_dir):
f = tmp_path / "tunnels.yaml"
f.write_text(textwrap.dedent(f"""\
catalog_path: {catalog_dir}
tunnels: {{}}
actors: {{}}
"""))
return f
@pytest.fixture
def env(config_with_catalog, tmp_path):
return {
"BRIDGE_CONFIG": str(config_with_catalog),
"BRIDGE_STATE_DIR": str(tmp_path / "state"),
}
class TestT14CatalogLoadAndResolve:
def test_catalog_loads_all_types(self, catalog_dir):
cat = load_catalog(catalog_dir)
assert "coulombcore" in cat.domains
assert "state-hub" in cat.targets
assert "state-hub-coulombcore" in cat.bridges
assert "agent.claude-coulombcore" in cat.actors
def test_resolve_from_catalog(self, catalog_dir):
cat = load_catalog(catalog_dir)
tc = resolve("state-hub-coulombcore", catalog=cat, inline_tunnels={})
assert tc.name == "state-hub-coulombcore"
assert tc.host == "coulombcore.local"
assert tc.remote_port == 18000
def test_bridge_up_with_catalog_bridge(self, env):
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_mgr_cls.return_value = mock_mgr
result = runner.invoke(app, ["up", "state-hub-coulombcore"], env=env)
assert result.exit_code == 0
mock_mgr.start.assert_called_once()
# Verify TunnelManager was constructed with correct config
call_args = mock_mgr_cls.call_args
tcfg = call_args[0][0]
assert tcfg.host == "coulombcore.local"
assert tcfg.remote_port == 18000
class TestT15BridgeTargetsOutput:
def test_targets_table(self, env):
result = runner.invoke(app, ["targets"], env=env)
assert result.exit_code == 0
assert "state-hub" in result.output
assert "coulombcore" in result.output
assert "service" in result.output
def test_targets_json_structure(self, env):
result = runner.invoke(app, ["targets", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert len(data) == 1
t = data[0]
assert t["target"] == "state-hub"
assert t["domain"] == "coulombcore"
assert t["kind"] == "service"
assert "state-hub-coulombcore" in t["bridges"]
def test_targets_show_includes_docs(self, env):
result = runner.invoke(app, ["targets", "show", "state-hub"], env=env)
assert result.exit_code == 0
assert "state-hub" in result.output
assert "coulombcore" in result.output
class TestT16CatalogValidate:
def test_validate_clean_catalog_exit_0(self, env):
result = runner.invoke(app, ["catalog", "validate"], env=env)
assert result.exit_code == 0
assert "ok" in result.output.lower() or "0" in result.output
def test_validate_dangling_reference_exit_1(self, tmp_path):
root = tmp_path / "bad"
domain_dir = root / "domains" / "d"
(domain_dir / "targets").mkdir(parents=True)
(domain_dir / "bridges").mkdir(parents=True)
(root / "actors").mkdir(parents=True)
(domain_dir / "domain.yaml").write_text("type: domain\nid: d\nname: D\n")
(domain_dir / "targets" / "t.yaml").write_text(
"type: target\nid: t\ndomain: d\nkind: service\n"
"reachable_via:\n - nonexistent-bridge\n"
)
(domain_dir / "bridges" / "b.yaml").write_text(
"type: bridge\nid: b\ndomain: d\ntarget: t\n"
"host: h\nremote_port: 1\nlocal_port: 2\n"
"ssh_user: u\nssh_key: k\nactor: missing-actor\n"
)
f = tmp_path / "tunnels.yaml"
f.write_text(f"catalog_path: {root}\ntunnels: {{}}\nactors: {{}}\n")
result = runner.invoke(app, ["catalog", "validate"], env={"BRIDGE_CONFIG": str(f)})
assert result.exit_code == 1
assert "nonexistent-bridge" in result.output or "missing-actor" in result.output
def test_catalog_list_shows_counts(self, env):
result = runner.invoke(app, ["catalog", "list"], env=env)
assert result.exit_code == 0
assert "coulombcore" in result.output
def test_catalog_show_bridge(self, env):
result = runner.invoke(app, ["catalog", "show", "state-hub-coulombcore"], env=env)
assert result.exit_code == 0
assert "coulombcore.local" in result.output
assert "18000" in result.output
def test_validate_using_validator_directly(self, catalog_dir):
cat = load_catalog(catalog_dir)
errors = validate_catalog(cat)
assert errors == []

View File

@@ -0,0 +1,140 @@
"""Tests for catalog loader."""
import textwrap
import pytest
from bridge.catalog.loader import CatalogLoadError, load_catalog
from bridge.catalog.models import Catalog
@pytest.fixture
def catalog_dir(tmp_path):
"""Build a minimal valid catalog fixture."""
root = tmp_path / "opscatalog"
domain_dir = root / "domains" / "coulombcore"
(domain_dir / "targets").mkdir(parents=True)
(domain_dir / "bridges").mkdir(parents=True)
(domain_dir / "docs").mkdir(parents=True)
actors_dir = root / "actors"
actors_dir.mkdir(parents=True)
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
type: domain
id: coulombcore
name: CoulombCore Infrastructure
description: Core infra
environment: production
"""))
(domain_dir / "targets" / "state-hub.yaml").write_text(textwrap.dedent("""\
type: target
id: state-hub
domain: coulombcore
kind: service
description: State coordination service
reachable_via:
- state-hub-coulombcore
"""))
(domain_dir / "bridges" / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
type: bridge
id: state-hub-coulombcore
domain: coulombcore
target: state-hub
description: Ops bridge
access_method: ssh-reverse
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
"""))
(actors_dir / "agents.yaml").write_text(textwrap.dedent("""\
type: actor
id: agent.claude-coulombcore
class: automation
description: Claude Code agent on CoulombCore
"""))
(domain_dir / "docs" / "overview.md").write_text("# Overview\nSome ops notes.")
return root
class TestLoadCatalog:
def test_loads_domain(self, catalog_dir):
cat = load_catalog(catalog_dir)
assert "coulombcore" in cat.domains
d = cat.domains["coulombcore"]
assert d.name == "CoulombCore Infrastructure"
assert d.environment == "production"
def test_loads_target(self, catalog_dir):
cat = load_catalog(catalog_dir)
assert "state-hub" in cat.targets
t = cat.targets["state-hub"]
assert t.domain == "coulombcore"
assert t.kind == "service"
assert "state-hub-coulombcore" in t.reachable_via
def test_loads_bridge(self, catalog_dir):
cat = load_catalog(catalog_dir)
assert "state-hub-coulombcore" in cat.bridges
b = cat.bridges["state-hub-coulombcore"]
assert b.host == "coulombcore.local"
assert b.remote_port == 18000
assert b.health_check is not None
assert b.health_check.url == "http://127.0.0.1:18000/health"
assert b.reconnect is not None
assert b.reconnect.max_attempts == 0
def test_loads_actor(self, catalog_dir):
cat = load_catalog(catalog_dir)
assert "agent.claude-coulombcore" in cat.actors
a = cat.actors["agent.claude-coulombcore"]
assert a.actor_class == "automation"
def test_unknown_type_skipped(self, catalog_dir):
(catalog_dir / "domains" / "coulombcore" / "unknown.yaml").write_text(
"type: mystery\nid: x\n"
)
# Should not raise
cat = load_catalog(catalog_dir)
assert isinstance(cat, Catalog)
def test_empty_catalog_dir(self, tmp_path):
root = tmp_path / "empty"
root.mkdir()
cat = load_catalog(root)
assert cat.domains == {}
assert cat.bridges == {}
def test_missing_required_field_raises(self, tmp_path):
root = tmp_path / "bad"
domain_dir = root / "domains" / "x"
domain_dir.mkdir(parents=True)
(domain_dir / "domain.yaml").write_text("type: domain\nname: X\n")
with pytest.raises(CatalogLoadError, match="id"):
load_catalog(root)
def test_nonexistent_path_raises(self, tmp_path):
with pytest.raises(CatalogLoadError, match="not found"):
load_catalog(tmp_path / "nonexistent")
def test_invalid_yaml_raises(self, tmp_path):
root = tmp_path / "bad"
domain_dir = root / "domains" / "x"
domain_dir.mkdir(parents=True)
(domain_dir / "domain.yaml").write_text("type: domain\n[\nbad: yaml")
with pytest.raises(CatalogLoadError):
load_catalog(root)

View File

@@ -0,0 +1,115 @@
"""Tests for catalog domain models."""
from bridge.catalog.models import (
ActorClass,
Catalog,
CatalogBridge,
CatalogDomain,
CatalogTarget,
)
class TestCatalogDomain:
def test_required_fields(self):
d = CatalogDomain(id="coulombcore", name="CoulombCore Infra")
assert d.id == "coulombcore"
assert d.name == "CoulombCore Infra"
def test_optional_fields_default(self):
d = CatalogDomain(id="x", name="X")
assert d.description == ""
assert d.environment == ""
class TestCatalogTarget:
def test_required_fields(self):
t = CatalogTarget(id="state-hub", domain="coulombcore", kind="service")
assert t.id == "state-hub"
assert t.domain == "coulombcore"
assert t.kind == "service"
def test_reachable_via_defaults_empty(self):
t = CatalogTarget(id="t", domain="d", kind="service")
assert t.reachable_via == []
def test_reachable_via(self):
t = CatalogTarget(id="t", domain="d", kind="service", reachable_via=["b1", "b2"])
assert t.reachable_via == ["b1", "b2"]
class TestCatalogBridge:
def test_required_fields(self):
b = CatalogBridge(
id="state-hub-coulombcore",
domain="coulombcore",
target="state-hub",
host="coulombcore.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="agent.claude-coulombcore",
)
assert b.id == "state-hub-coulombcore"
assert b.domain == "coulombcore"
assert b.host == "coulombcore.local"
def test_optional_fields_default(self):
b = CatalogBridge(
id="b",
domain="d",
target="t",
host="h",
remote_port=1,
local_port=2,
ssh_user="u",
ssh_key="k",
actor="a",
)
assert b.description == ""
assert b.access_method == "ssh-reverse"
assert b.health_check is None
assert b.reconnect is None
def test_to_tunnel_config(self):
from bridge.models import TunnelConfig
b = CatalogBridge(
id="state-hub-coulombcore",
domain="coulombcore",
target="state-hub",
host="coulombcore.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="agent.claude-coulombcore",
)
tc = b.to_tunnel_config()
assert isinstance(tc, TunnelConfig)
assert tc.name == "state-hub-coulombcore"
assert tc.host == "coulombcore.local"
assert tc.remote_port == 18000
class TestActorClass:
def test_fields(self):
a = ActorClass(id="agent.claude", actor_class="automation", description="Claude agent")
assert a.id == "agent.claude"
assert a.actor_class == "automation"
def test_optional_description(self):
a = ActorClass(id="x", actor_class="human")
assert a.description == ""
class TestCatalog:
def test_empty_catalog(self):
c = Catalog()
assert c.domains == {}
assert c.targets == {}
assert c.bridges == {}
assert c.actors == {}
def test_add_entries(self):
c = Catalog()
c.domains["d"] = CatalogDomain(id="d", name="D")
assert "d" in c.domains

View File

@@ -0,0 +1,88 @@
"""Tests for catalog resolver."""
import pytest
from bridge.catalog.models import (
ActorClass,
Catalog,
CatalogBridge,
CatalogDomain,
CatalogTarget,
)
from bridge.catalog.resolver import BridgeNotFound, resolve
from bridge.models import TunnelConfig, ReconnectPolicy
@pytest.fixture
def catalog():
cat = Catalog()
cat.domains["d"] = CatalogDomain(id="d", name="D")
cat.targets["t"] = CatalogTarget(id="t", domain="d", kind="service")
cat.bridges["catalog-bridge"] = CatalogBridge(
id="catalog-bridge",
domain="d",
target="t",
host="catalog-host.local",
remote_port=19000,
local_port=9000,
ssh_user="ubuntu",
ssh_key="~/.ssh/catalog",
actor="operator.bernd",
)
cat.actors["operator.bernd"] = ActorClass(id="operator.bernd", actor_class="human")
return cat
@pytest.fixture
def inline_tunnels():
return {
"inline-bridge": TunnelConfig(
name="inline-bridge",
host="inline-host.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/inline",
actor="operator.bernd",
)
}
class TestResolve:
def test_inline_takes_precedence(self, catalog, inline_tunnels):
tc = resolve("inline-bridge", catalog=catalog, inline_tunnels=inline_tunnels)
assert tc.host == "inline-host.local"
def test_catalog_fallback(self, catalog, inline_tunnels):
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels=inline_tunnels)
assert tc.host == "catalog-host.local"
assert tc.remote_port == 19000
def test_catalog_fallback_no_inline(self, catalog):
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels={})
assert tc.name == "catalog-bridge"
def test_missing_name_raises(self, catalog, inline_tunnels):
with pytest.raises(BridgeNotFound, match="nonexistent"):
resolve("nonexistent", catalog=catalog, inline_tunnels=inline_tunnels)
def test_missing_name_no_catalog_raises(self, inline_tunnels):
with pytest.raises(BridgeNotFound):
resolve("nonexistent", catalog=None, inline_tunnels=inline_tunnels)
def test_inline_bridge_returns_tunnel_config(self, catalog, inline_tunnels):
tc = resolve("inline-bridge", catalog=catalog, inline_tunnels=inline_tunnels)
assert isinstance(tc, TunnelConfig)
def test_catalog_bridge_returns_tunnel_config(self, catalog):
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels={})
assert isinstance(tc, TunnelConfig)
def test_catalog_is_none_no_inline_raises(self):
with pytest.raises(BridgeNotFound):
resolve("any-name", catalog=None, inline_tunnels={})
def test_resolve_preserves_reconnect_policy(self, catalog):
catalog.bridges["catalog-bridge"].reconnect = ReconnectPolicy(
max_attempts=3, backoff_initial=2, backoff_max=30
)
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels={})
assert tc.reconnect.max_attempts == 3

View File

@@ -0,0 +1,93 @@
"""Tests for catalog validator."""
from bridge.catalog.models import (
ActorClass,
Catalog,
CatalogBridge,
CatalogDomain,
CatalogTarget,
)
from bridge.catalog.validator import validate_catalog
def _make_full_catalog() -> Catalog:
cat = Catalog()
cat.domains["coulombcore"] = CatalogDomain(id="coulombcore", name="CoulombCore")
cat.targets["state-hub"] = CatalogTarget(
id="state-hub",
domain="coulombcore",
kind="service",
reachable_via=["state-hub-coulombcore"],
)
cat.bridges["state-hub-coulombcore"] = CatalogBridge(
id="state-hub-coulombcore",
domain="coulombcore",
target="state-hub",
host="host.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="agent.claude-coulombcore",
)
cat.actors["agent.claude-coulombcore"] = ActorClass(
id="agent.claude-coulombcore",
actor_class="automation",
)
return cat
class TestValidateCatalog:
def test_valid_catalog_no_errors(self):
cat = _make_full_catalog()
errors = validate_catalog(cat)
assert errors == []
def test_target_domain_must_exist(self):
cat = _make_full_catalog()
cat.targets["orphan"] = CatalogTarget(
id="orphan", domain="nonexistent-domain", kind="service"
)
errors = validate_catalog(cat)
assert any("orphan" in e and "nonexistent-domain" in e for e in errors)
def test_target_reachable_via_must_exist(self):
cat = _make_full_catalog()
cat.targets["state-hub"].reachable_via.append("nonexistent-bridge")
errors = validate_catalog(cat)
assert any("nonexistent-bridge" in e for e in errors)
def test_bridge_domain_must_exist(self):
cat = _make_full_catalog()
cat.bridges["state-hub-coulombcore"].domain = "missing-domain"
errors = validate_catalog(cat)
assert any("missing-domain" in e for e in errors)
def test_bridge_target_must_exist(self):
cat = _make_full_catalog()
cat.bridges["state-hub-coulombcore"].target = "missing-target"
errors = validate_catalog(cat)
assert any("missing-target" in e for e in errors)
def test_bridge_actor_must_exist(self):
cat = _make_full_catalog()
cat.bridges["state-hub-coulombcore"].actor = "nonexistent-actor"
errors = validate_catalog(cat)
assert any("nonexistent-actor" in e for e in errors)
def test_multiple_errors_all_reported(self):
cat = Catalog()
# Target with dangling domain and reachable_via
cat.targets["t1"] = CatalogTarget(
id="t1", domain="missing", kind="service", reachable_via=["missing-bridge"]
)
# Bridge with dangling domain + target + actor
cat.bridges["b1"] = CatalogBridge(
id="b1", domain="missing", target="missing", host="h",
remote_port=1, local_port=2, ssh_user="u", ssh_key="k", actor="missing-actor",
)
errors = validate_catalog(cat)
assert len(errors) >= 4
def test_empty_catalog_is_valid(self):
cat = Catalog()
assert validate_catalog(cat) == []

130
tests/test_cleanup.py Normal file
View File

@@ -0,0 +1,130 @@
"""Tests for stale SSH forward cleanup."""
from __future__ import annotations
import textwrap
from unittest.mock import MagicMock, patch
from typer.testing import CliRunner
from bridge.cleanup import (
CleanupAction,
build_cron_line,
cleanup_all_tunnels,
remote_forward_health_url,
should_cleanup_tunnel,
)
from bridge.cli import app
from bridge.config import load_config
from bridge.models import HealthCheckConfig, TunnelConfig
from bridge.state import StateManager
def _tunnel(**overrides) -> TunnelConfig:
base = dict(
name="state-hub-railiance01",
host="92.205.62.239",
remote_port=18000,
local_port=8000,
ssh_user="tegwick",
ssh_key="~/.ssh/id_ops",
actor="agt-claude-railiance01",
health_check=HealthCheckConfig(
url="http://127.0.0.1:8000/state/health",
timeout_seconds=5,
),
)
base.update(overrides)
return TunnelConfig(**base)
class TestRemoteForwardHealthUrl:
def test_maps_local_port_to_remote(self):
cfg = _tunnel()
assert remote_forward_health_url(cfg) == "http://127.0.0.1:18000/state/health"
def test_returns_none_for_local_tunnel(self):
cfg = _tunnel(direction="local")
assert remote_forward_health_url(cfg) is None
class TestShouldCleanupTunnel:
def test_skips_healthy_remote_forward(self, tmp_path):
cfg = _tunnel()
state_mgr = StateManager(state_dir=tmp_path)
with (
patch("bridge.cleanup.remote_port_listening", return_value=True),
patch("bridge.cleanup.probe_remote_forward", return_value=(True, "ok")),
):
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
assert needed is False
def test_detects_stale_forward_when_local_ok_remote_fails(self, tmp_path):
cfg = _tunnel()
state_mgr = StateManager(state_dir=tmp_path)
with (
patch("bridge.cleanup.remote_port_listening", return_value=True),
patch("bridge.cleanup.probe_remote_forward", return_value=(False, "timeout")),
patch("bridge.cleanup.local_service_healthy", return_value=True),
patch(
"bridge.cleanup.check_tunnel",
return_value=MagicMock(ssh_process="ok", remote_port="listening"),
),
):
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
assert needed is True
assert "stale forward" in reason
class TestCleanupAllTunnels:
def test_reports_cleaned_tunnel(self, tmp_path, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "tunnels.yaml"))
(tmp_path / "tunnels.yaml").write_text(
textwrap.dedent(
"""\
tunnels:
state-hub-railiance01:
host: 92.205.62.239
remote_port: 18000
local_port: 8000
ssh_user: tegwick
ssh_key: ~/.ssh/id_ops
actor: agt-claude-railiance01
health_check:
url: http://127.0.0.1:8000/state/health
actors:
agt-claude-railiance01:
class: agt
"""
)
)
cfg = load_config()
state_mgr = StateManager(state_dir=tmp_path / "state")
with patch(
"bridge.cleanup.cleanup_tunnel",
return_value=CleanupAction("state-hub-railiance01", "cleaned", "cleared"),
):
report = cleanup_all_tunnels(cfg, state_mgr, restart=False)
assert report.cleaned_count == 1
assert report.actions[0].action == "cleaned"
class TestMaintenanceCli:
def test_cleanup_help(self):
runner = CliRunner()
result = runner.invoke(app, ["maintenance", "cleanup", "--help"])
assert result.exit_code == 0
assert "restart" in result.output.lower()
def test_show_cron_prints_template_when_not_installed(self):
runner = CliRunner()
with patch("bridge.cli.read_installed_cron", return_value=None):
result = runner.invoke(app, ["maintenance", "show-cron"])
assert result.exit_code == 0
assert "0 3 * * *" in result.output
def test_build_cron_line_contains_marker():
line = build_cron_line()
assert "0 3 * * *" in line
assert "maintenance cleanup --restart" in line
assert "ops-bridge: maintenance cleanup" in line

411
tests/test_cli.py Normal file
View File

@@ -0,0 +1,411 @@
"""Tests for CLI commands."""
import json
import textwrap
from unittest.mock import MagicMock, patch
import pytest
from typer.testing import CliRunner
from bridge.cli import app
VALID_CONFIG = textwrap.dedent("""\
tunnels:
test-tunnel:
host: host.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
""")
runner = CliRunner()
@pytest.fixture
def config_file(tmp_path):
f = tmp_path / "tunnels.yaml"
f.write_text(VALID_CONFIG)
return f
@pytest.fixture
def state_dir(tmp_path):
return tmp_path / "state"
@pytest.fixture
def env(config_file, state_dir):
return {"BRIDGE_CONFIG": str(config_file), "BRIDGE_STATE_DIR": str(state_dir)}
class TestHelpCommand:
def test_app_help(self):
result = runner.invoke(app, ["--help"])
assert result.exit_code == 0
assert "bridge" in result.output.lower() or "Usage" in result.output
def test_up_help(self):
result = runner.invoke(app, ["up", "--help"])
assert result.exit_code == 0
def test_down_help(self):
result = runner.invoke(app, ["down", "--help"])
assert result.exit_code == 0
def test_status_help(self):
result = runner.invoke(app, ["status", "--help"])
assert result.exit_code == 0
def test_logs_help(self):
result = runner.invoke(app, ["logs", "--help"])
assert result.exit_code == 0
def test_restart_help(self):
result = runner.invoke(app, ["restart", "--help"])
assert result.exit_code == 0
class TestStatusCommand:
@pytest.mark.capability("bridge_status")
@pytest.mark.access_mode("cli")
def test_status_shows_tunnels(self, env, state_dir):
result = runner.invoke(app, ["status"], env=env)
assert result.exit_code == 0
assert "test-tunnel" in result.output
def test_status_json_flag(self, env, state_dir):
result = runner.invoke(app, ["status", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert isinstance(data, list)
assert len(data) == 1
assert data[0]["tunnel"] == "test-tunnel"
assert "state" in data[0]
assert "actor" in data[0]
assert "host" in data[0]
def test_status_shows_state(self, env, state_dir):
result = runner.invoke(app, ["status"], env=env)
assert result.exit_code == 0
assert "stopped" in result.output.lower()
def test_status_unknown_config_exit_1(self, tmp_path):
result = runner.invoke(app, ["status"], env={"BRIDGE_CONFIG": str(tmp_path / "no.yaml")})
assert result.exit_code == 1
class TestUpCommand:
def test_up_unknown_tunnel_exit_1(self, env):
result = runner.invoke(app, ["up", "nonexistent"], env=env)
assert result.exit_code == 1
assert "nonexistent" in result.output
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("cli")
def test_up_calls_manager_start(self, env, state_dir):
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_mgr_cls.return_value = mock_mgr
result = runner.invoke(app, ["up", "test-tunnel"], env=env)
assert result.exit_code == 0
mock_mgr.start.assert_called_once()
def test_up_already_running_exit_2(self, env, state_dir):
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = True
mock_mgr_cls.return_value = mock_mgr
result = runner.invoke(app, ["up", "test-tunnel"], env=env)
assert result.exit_code == 2
class TestDownCommand:
def test_down_unknown_tunnel_exit_1(self, env):
result = runner.invoke(app, ["down", "nonexistent"], env=env)
assert result.exit_code == 1
@pytest.mark.capability("bridge_down")
@pytest.mark.access_mode("cli")
def test_down_calls_manager_stop(self, env, state_dir):
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = True
mock_mgr_cls.return_value = mock_mgr
result = runner.invoke(app, ["down", "test-tunnel"], env=env)
assert result.exit_code == 0
mock_mgr.stop.assert_called_once()
def test_down_not_running_exit_2(self, env, state_dir):
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_mgr_cls.return_value = mock_mgr
result = runner.invoke(app, ["down", "test-tunnel"], env=env)
assert result.exit_code == 2
class TestLogsCommand:
def test_logs_unknown_tunnel_exit_1(self, env):
result = runner.invoke(app, ["logs", "nonexistent"], env=env)
assert result.exit_code == 1
def test_logs_no_log_file_shows_empty(self, env, state_dir):
result = runner.invoke(app, ["logs", "test-tunnel"], env=env)
assert result.exit_code == 0
@pytest.mark.capability("bridge_logs")
@pytest.mark.access_mode("cli")
def test_logs_shows_events(self, env, state_dir):
import json as _json
state_dir.mkdir(parents=True, exist_ok=True)
log_file = state_dir / "test-tunnel.log"
log_file.write_text(
_json.dumps({
"timestamp": "2026-01-01T00:00:00+00:00",
"tunnel": "test-tunnel",
"actor": "operator.bernd",
"actor_class": "human",
"event": "bridge_started",
}) + "\n"
)
result = runner.invoke(app, ["logs", "test-tunnel"], env=env)
assert result.exit_code == 0
assert "bridge_started" in result.output
class TestCheckCommand:
def test_check_help(self):
result = runner.invoke(app, ["check", "--help"])
assert result.exit_code == 0
@pytest.mark.capability("bridge_check")
@pytest.mark.access_mode("cli")
def test_check_all_pass(self, env):
from bridge.diagnostics import TunnelCheckResult
ok_result = TunnelCheckResult(
tunnel="test-tunnel",
ssh_process="ok",
pid=12345,
remote_port="listening",
local_api=None,
latency_ms=None,
stale_state=False,
)
with patch("bridge.cli.check_all_tunnels", return_value=[ok_result]):
result = runner.invoke(app, ["check"], env=env)
assert result.exit_code == 0
def test_check_any_fail(self, env):
from bridge.diagnostics import TunnelCheckResult
fail_result = TunnelCheckResult(
tunnel="test-tunnel",
ssh_process="dead",
pid=None,
remote_port="closed",
local_api=None,
latency_ms=None,
stale_state=True,
)
with patch("bridge.cli.check_all_tunnels", return_value=[fail_result]):
result = runner.invoke(app, ["check"], env=env)
assert result.exit_code == 1
def test_check_json_flag(self, env):
from bridge.diagnostics import TunnelCheckResult
ok_result = TunnelCheckResult(
tunnel="test-tunnel",
ssh_process="ok",
pid=12345,
remote_port="listening",
local_api=None,
latency_ms=None,
stale_state=False,
)
with patch("bridge.cli.check_all_tunnels", return_value=[ok_result]):
result = runner.invoke(app, ["check", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert isinstance(data, list)
assert len(data) == 1
assert data[0]["ok"] is True
assert data[0]["tunnel"] == "test-tunnel"
assert data[0]["ssh_process"] == "ok"
def test_check_specific_tunnel(self, env):
from bridge.diagnostics import TunnelCheckResult
ok_result = TunnelCheckResult(
tunnel="test-tunnel",
ssh_process="ok",
pid=12345,
remote_port="listening",
local_api=None,
latency_ms=None,
stale_state=False,
)
with patch("bridge.cli.check_tunnel", return_value=ok_result):
result = runner.invoke(app, ["check", "test-tunnel"], env=env)
assert result.exit_code == 0
def test_check_unknown_tunnel_exit_1(self, env):
result = runner.invoke(app, ["check", "nonexistent"], env=env)
assert result.exit_code == 1
REVERSE_CONFIG = VALID_CONFIG
LOCAL_TUNNEL_CONFIG = textwrap.dedent("""\
tunnels:
k3s-api:
host: host.local
remote_port: 6443
local_port: 6443
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
direction: local
actors:
adm-bernd:
class: adm
description: Bernd
""")
class TestRestartCommand:
def test_restart_unknown_tunnel_exit_1(self, env):
result = runner.invoke(app, ["restart", "nonexistent"], env=env)
assert result.exit_code == 1
def test_restart_help_mentions_remote_cleanup(self):
result = runner.invoke(app, ["restart", "--help"])
assert result.exit_code == 0
assert "stale-forward" in result.output.lower() or "remote" in result.output.lower()
@pytest.mark.capability("bridge_restart")
@pytest.mark.access_mode("cli")
def test_restart_reverse_tunnel_delegates_to_cleanup(self, env):
from bridge.cleanup import CleanupAction
with patch("bridge.cli.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel", "healthy", "remote forward healthy"
)
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
assert result.exit_code == 0
mock_restart.assert_called_once()
assert "test-tunnel: healthy" in result.output
def test_restart_reverse_tunnel_reports_cleaned_and_restarted(self, env):
from bridge.cleanup import CleanupAction
with patch("bridge.cli.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel",
"cleaned_and_restarted",
"stale forward; restarted tunnel; cleared",
)
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
assert result.exit_code == 0
assert "cleaned_and_restarted" in result.output
def test_restart_reverse_tunnel_error_exit_1(self, env):
from bridge.cleanup import CleanupAction
with patch("bridge.cli.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel", "error", "cleanup failed: still_listening"
)
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
assert result.exit_code == 1
assert "error" in result.output
def test_restart_local_tunnel_uses_stop_start(self, tmp_path, state_dir):
config_file = tmp_path / "tunnels.yaml"
config_file.write_text(LOCAL_TUNNEL_CONFIG)
env = {
"BRIDGE_CONFIG": str(config_file),
"BRIDGE_STATE_DIR": str(state_dir),
}
with patch("bridge.cleanup.TunnelManager") as mock_mgr_cls:
mock_mgr = MagicMock()
mock_mgr_cls.return_value = mock_mgr
call_order = []
mock_mgr.stop.side_effect = lambda: call_order.append("stop")
mock_mgr.start.side_effect = lambda: call_order.append("start")
result = runner.invoke(app, ["restart", "k3s-api"], env=env)
assert result.exit_code == 0
assert call_order == ["stop", "start"]
assert "k3s-api: restarted" in result.output
class TestCertStatusCommand:
@pytest.mark.capability("bridge_cert_status")
@pytest.mark.access_mode("cli")
def test_cert_status_no_cert_shows_static_key(self, env, state_dir):
result = runner.invoke(app, ["cert-status"], env=env)
assert result.exit_code == 0
assert "static-key" in result.output
def test_cert_status_json_no_cert(self, env, state_dir):
result = runner.invoke(app, ["cert-status", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert data[0]["mode"] == "static-key"
def test_cert_status_exit_1_on_expired(self, env, state_dir, tmp_path):
# Write a fake cert file in state dir; mock ssh-keygen to report expired
state_dir.mkdir(parents=True, exist_ok=True)
cert_file = state_dir / "test-tunnel-cert.pub"
cert_file.write_text("fake cert")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout=(
"test-tunnel-cert.pub:\n"
" Key ID: \"agt-test\"\n"
" Valid: from 2026-01-01T00:00:00 to 2026-01-02T00:00:00\n"
),
returncode=0,
)
result = runner.invoke(app, ["cert-status"], env=env)
assert result.exit_code == 1
assert "EXPIRED" in result.output
def test_cert_status_json_with_cert(self, env, state_dir):
state_dir.mkdir(parents=True, exist_ok=True)
cert_file = state_dir / "test-tunnel-cert.pub"
cert_file.write_text("fake cert")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout=(
"test-tunnel-cert.pub:\n"
" Key ID: \"agt-test\"\n"
" Valid: from 2030-01-01T00:00:00 to 2030-01-02T00:00:00\n"
),
returncode=0,
)
result = runner.invoke(app, ["cert-status", "--json"], env=env)
assert result.exit_code == 0
data = json.loads(result.output)
assert data[0]["mode"] == "cert"
assert data[0]["key_id"] == "agt-test"
assert data[0]["expired"] is False

299
tests/test_config.py Normal file
View File

@@ -0,0 +1,299 @@
"""Tests for config loading."""
import textwrap
import warnings
import pytest
from bridge.config import ConfigError, load_config
from bridge.models import ActorType
VALID_YAML = textwrap.dedent("""\
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agt-claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
actors:
agt-claude-coulombcore:
class: agt
description: Claude Code agent on CoulombCore
adm-bernd:
class: adm
description: Bernd Worsch
""")
@pytest.fixture
def config_file(tmp_path):
f = tmp_path / "tunnels.yaml"
f.write_text(VALID_YAML)
return f
def test_load_valid_config(config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
assert "state-hub-coulombcore" in cfg.tunnels
t = cfg.tunnels["state-hub-coulombcore"]
assert t.host == "coulombcore.local"
assert t.remote_port == 18000
assert t.local_port == 8000
assert t.ssh_user == "ubuntu"
assert t.actor == "agt-claude-coulombcore"
def test_health_check_loaded(config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
t = cfg.tunnels["state-hub-coulombcore"]
assert t.health_check is not None
assert t.health_check.url == "http://127.0.0.1:18000/health"
assert t.health_check.interval_seconds == 30
def test_reconnect_policy_loaded(config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
t = cfg.tunnels["state-hub-coulombcore"]
assert t.reconnect.max_attempts == 0
assert t.reconnect.backoff_initial == 5
assert t.reconnect.backoff_max == 60
def test_actors_loaded(config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
assert "agt-claude-coulombcore" in cfg.actors
a = cfg.actors["agt-claude-coulombcore"]
assert a.actor_type == ActorType.AGT
assert "adm-bernd" in cfg.actors
def test_missing_required_field_raises(tmp_path, monkeypatch):
f = tmp_path / "bad.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
broken:
remote_port: 18000
local_port: 8000
actors: {}
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="host"):
load_config()
def test_invalid_yaml_raises(tmp_path, monkeypatch):
f = tmp_path / "bad.yaml"
f.write_text("tunnels: [\nnot: valid: yaml")
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError):
load_config()
def test_missing_config_file_raises(tmp_path, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "nonexistent.yaml"))
with pytest.raises(ConfigError, match="not found"):
load_config()
def test_tunnel_without_health_check(tmp_path, monkeypatch):
f = tmp_path / "tunnels.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
simple:
host: host.local
remote_port: 9000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_rsa
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.tunnels["simple"].health_check is None
class TestActorTypeValidation:
def test_canonical_agt_accepted(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: agt-claude
actors:
agt-claude:
class: agt
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.actors["agt-claude"].actor_type == ActorType.AGT
def test_canonical_atm_accepted(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: atm-backup
actors:
atm-backup:
class: atm
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.actors["atm-backup"].actor_type == ActorType.ATM
def test_wrong_prefix_raises_config_error(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: adm-bernd
actors:
adm-bernd:
class: agt
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="must start with 'agt-'"):
load_config()
def test_missing_prefix_raises_config_error(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: operator.bernd
actors:
operator.bernd:
class: adm
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="must start with 'adm-'"):
load_config()
def test_unknown_class_raises_config_error(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: adm-bernd
actors:
adm-bernd:
class: wizard
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with pytest.raises(ConfigError, match="unknown class"):
load_config()
def test_legacy_human_maps_to_adm_with_warning(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: adm-bernd
actors:
adm-bernd:
class: human
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
cfg = load_config()
assert cfg.actors["adm-bernd"].actor_type == ActorType.ADM
assert any("deprecated" in str(x.message).lower() for x in w)
def test_legacy_automation_maps_to_atm_with_warning(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: atm-cron
actors:
atm-cron:
class: automation
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
with warnings.catch_warnings(record=True) as w:
warnings.simplefilter("always")
cfg = load_config()
assert cfg.actors["atm-cron"].actor_type == ActorType.ATM
assert any("deprecated" in str(x.message).lower() for x in w)
class TestCertCommandConfig:
def test_cert_command_parsed(self, tmp_path, monkeypatch):
f = tmp_path / "t.yaml"
f.write_text(textwrap.dedent("""\
tunnels:
t:
host: h
remote_port: 1
local_port: 2
ssh_user: u
ssh_key: ~/.ssh/k
actor: agt-bridge
cert_command: "warden sign agt-bridge --pubkey /tmp/k.pub"
actors:
agt-bridge:
class: agt
"""))
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
cfg = load_config()
assert cfg.tunnels["t"].cert_command == "warden sign agt-bridge --pubkey /tmp/k.pub"
def test_no_cert_command_is_none(self, config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
assert cfg.tunnels["state-hub-coulombcore"].cert_command is None

View File

@@ -0,0 +1,229 @@
"""Cross-mode capability coverage meta-test.
Enforces that every capability in the registry has at least one test
marked with @pytest.mark.capability(name) and @pytest.mark.access_mode(mode)
for each of its required_access_modes.
The test discovers coverage by walking all collected test items, so it will
only pass when the full test suite is collected (i.e. run without -k filters
that exclude capability-marked tests).
Also validates the registry itself is self-consistent.
"""
from __future__ import annotations
import pytest
from bridge.capabilities import CAPABILITIES, CAPABILITIES_BY_NAME
from tests.conftest import collect_capability_coverage
# ---------------------------------------------------------------------------
# Registry self-consistency
# ---------------------------------------------------------------------------
def test_registry_has_capabilities():
"""Sanity: registry must be non-empty."""
assert len(CAPABILITIES) > 0
def test_registry_names_are_unique():
names = [c.name for c in CAPABILITIES]
assert len(names) == len(set(names)), "Duplicate capability names in registry"
def test_registry_access_modes_are_valid():
valid = {"cli", "mcp", "skill"}
for cap in CAPABILITIES:
unknown = cap.required_access_modes - valid
assert not unknown, (
f"Capability '{cap.name}' has unknown access modes: {unknown}"
)
def test_registry_each_capability_has_at_least_one_mode():
for cap in CAPABILITIES:
assert cap.required_access_modes, (
f"Capability '{cap.name}' has no required_access_modes"
)
# ---------------------------------------------------------------------------
# Cross-mode coverage completeness (session-scope fixture)
# ---------------------------------------------------------------------------
@pytest.fixture(scope="session")
def capability_coverage(request) -> set[tuple[str, str]]:
"""Collect all (capability, access_mode) pairs from the test session."""
return collect_capability_coverage(request.session.items)
def test_all_required_modes_have_tests(capability_coverage):
"""Every (capability, mode) pair in the registry must have a test."""
missing: list[str] = []
for cap in CAPABILITIES:
for mode in sorted(cap.required_access_modes):
if (cap.name, mode) not in capability_coverage:
missing.append(f" {cap.name!r} × {mode!r}")
if missing:
pytest.fail(
"Missing test coverage for the following (capability, access_mode) pairs:\n"
+ "\n".join(missing)
+ "\n\nAdd a test with @pytest.mark.capability(<name>) and "
"@pytest.mark.access_mode(<mode>)."
)
# ---------------------------------------------------------------------------
# T02 — Registry completeness against CLI commands and MCP tools
# ---------------------------------------------------------------------------
def test_registry_cli_capabilities_have_matching_commands():
"""Every capability requiring CLI must have a corresponding CLI command.
Checks that the registry doesn't list CLI requirements for operations that
don't actually exist as CLI commands. Uses the Typer app's callback names.
"""
from bridge.cli import app, targets_app, catalog_app
# Collect all CLI callback function names (canonical command identity)
top_level = {f"bridge_{cmd.callback.__name__}" for cmd in app.registered_commands}
# targets sub-commands: callback name "targets_show" → "catalog_show_target"
targets_cmds = set()
for cmd in targets_app.registered_commands:
fn = cmd.callback.__name__
if fn == "targets_show":
targets_cmds.add("catalog_show_target")
catalog_cmds = set()
for cmd in catalog_app.registered_commands:
fn = cmd.callback.__name__
if fn == "catalog_list":
catalog_cmds.add("catalog_list_domains")
elif fn == "catalog_validate":
catalog_cmds.add("catalog_validate")
elif fn == "catalog_show":
catalog_cmds.add("catalog_show_bridge")
# Also include catalog_list_targets (from targets_app without sub-command filter)
# The targets app root command lists targets
all_cli_caps = top_level | targets_cmds | catalog_cmds | {"catalog_list_targets"}
for cap in CAPABILITIES:
if "cli" in cap.required_access_modes:
assert cap.name in all_cli_caps, (
f"Capability '{cap.name}' requires CLI coverage but no matching "
f"CLI command was found. Either add the command or update the registry."
)
async def test_mcp_tools_in_registry():
"""Every MCP tool name must appear as a capability in the registry."""
from fastmcp import Client
from bridge.mcp_server.server import mcp
async with Client(mcp) as c:
tools = await c.list_tools()
tool_names = {t.name for t in tools}
registered_cap_names = set(CAPABILITIES_BY_NAME)
for name in tool_names:
assert name in registered_cap_names, (
f"MCP tool '{name}' is not registered as a capability. "
f"Add it to src/bridge/capabilities.py."
)
# ---------------------------------------------------------------------------
# T12 — Self-validation: sentinel fixture proves the gap-checker catches gaps
# ---------------------------------------------------------------------------
def test_meta_test_catches_missing_mode_gap():
"""Self-validation: the coverage checker must detect a missing-mode gap.
Injects a synthetic _test_sentinel capability requiring both cli and mcp.
Creates mock items with *only* a cli test for it (deliberately omitting mcp).
Asserts that collect_capability_coverage reports the mcp gap — proving the
meta-test mechanism is functional, not a silent no-op.
This test validates Goal #4 from BRIDGE-WP-0003:
"The gap-detection mechanism is itself tested: a synthetic missing-mode
fixture asserts the meta-test catches it."
"""
from bridge.capabilities import Capability
sentinel = Capability(
name="_test_sentinel",
description="Synthetic capability for meta-test self-validation",
required_access_modes=frozenset({"cli", "mcp"}),
)
patched_caps = CAPABILITIES + [sentinel]
# Minimal mock: an iterable of items that respond to iter_markers()
class _Mark:
def __init__(self, arg: str):
self.args = (arg,)
class _MockItem:
def __init__(self, capability: str, mode: str):
self._cap = capability
self._mode = mode
def iter_markers(self, name: str):
if name == "capability":
return [_Mark(self._cap)]
if name == "access_mode":
return [_Mark(self._mode)]
return []
# Only supply a cli test for the sentinel — the mcp test is intentionally absent
mock_items = [_MockItem("_test_sentinel", "cli")]
covered = collect_capability_coverage(mock_items)
# The cli mode should be registered
assert ("_test_sentinel", "cli") in covered, (
"collect_capability_coverage failed to record the cli mock item"
)
# The mcp mode must NOT be covered — this is the gap we want to catch
assert ("_test_sentinel", "mcp") not in covered, (
"collect_capability_coverage incorrectly registered an mcp test that was not provided"
)
# Run the same gap-detection logic used by test_all_required_modes_have_tests
gaps = [
(cap.name, mode)
for cap in patched_caps
for mode in cap.required_access_modes
if (cap.name, mode) not in covered
]
assert ("_test_sentinel", "mcp") in gaps, (
"Gap checker failed to detect the missing mcp mode for _test_sentinel. "
"The meta-test mechanism is broken."
)
# Sanity: cli mode should NOT appear as a gap (it was covered)
assert ("_test_sentinel", "cli") not in gaps
def test_no_orphan_capability_marks(capability_coverage):
"""Every (capability, mode) pair in the test suite must exist in the registry.
This prevents tests from referencing stale or misspelled capability names.
"""
orphans: list[str] = []
for cap_name, mode in sorted(capability_coverage):
if cap_name not in CAPABILITIES_BY_NAME:
orphans.append(f" {cap_name!r} (mode={mode!r}) — not in registry")
else:
cap = CAPABILITIES_BY_NAME[cap_name]
if mode not in cap.required_access_modes:
orphans.append(
f" {cap_name!r} × {mode!r} — mode not required for this capability"
)
if orphans:
pytest.fail(
"Test suite references capability/mode pairs not in registry:\n"
+ "\n".join(orphans)
)

213
tests/test_diagnostics.py Normal file
View File

@@ -0,0 +1,213 @@
"""Tests for bridge.diagnostics — check_tunnel() logic."""
from __future__ import annotations
import subprocess
from unittest.mock import MagicMock, patch
import pytest
from bridge.diagnostics import (
_remote_port_probe_command,
check_all_tunnels,
check_tunnel,
)
from bridge.models import BridgeState, TunnelConfig
from bridge.state import StateManager
@pytest.fixture
def tcfg():
return TunnelConfig(
name="test-tunnel",
host="coulombcore.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="adm-bernd",
)
@pytest.fixture
def state_mgr(tmp_path):
d = tmp_path / "state"
d.mkdir()
return StateManager(state_dir=d)
class TestCheckTunnel:
def test_remote_port_probe_has_minimal_host_fallback(self):
"""Remote probe supports minimal hosts without ss/netstat."""
command = _remote_port_probe_command(18000)
assert "command -v ss" in command
assert "command -v netstat" in command
assert "/proc/net/tcp" in command
assert "/proc/net/tcp6" in command
def test_no_pid(self, tcfg, state_mgr):
"""No PID file → ssh_process='no_pid', ok=False."""
with patch("bridge.diagnostics.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
result = check_tunnel(tcfg, state_mgr)
assert result.ssh_process == "no_pid"
assert result.pid is None
assert result.stale_state is False
assert result.ok is False
def test_pid_dead(self, tcfg, state_mgr):
"""Dead PID + connected state → ssh_process='dead', stale_state=True."""
state_mgr.write_pid("test-tunnel", 99999)
state_mgr.write_state("test-tunnel", BridgeState.CONNECTED)
with (
patch("bridge.diagnostics._pid_alive", return_value=False),
patch("bridge.diagnostics.subprocess.run") as mock_run,
):
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
result = check_tunnel(tcfg, state_mgr)
assert result.ssh_process == "dead"
assert result.stale_state is True
assert result.ok is False
def test_pid_alive_port_listening(self, tcfg, state_mgr):
"""Alive PID + SSH reports port listening → remote_port='listening', ok=True."""
state_mgr.write_pid("test-tunnel", 12345)
with (
patch("bridge.diagnostics._pid_alive", return_value=True),
patch("bridge.diagnostics.subprocess.run") as mock_run,
):
mock_run.return_value = MagicMock(stdout="ok\n", stderr="", returncode=0)
result = check_tunnel(tcfg, state_mgr)
assert result.ssh_process == "ok"
assert result.pid == 12345
assert result.remote_port == "listening"
assert result.ok is True
def test_pid_alive_port_closed(self, tcfg, state_mgr):
"""Alive PID + SSH reports port closed → remote_port='closed', ok=False."""
state_mgr.write_pid("test-tunnel", 12345)
with (
patch("bridge.diagnostics._pid_alive", return_value=True),
patch("bridge.diagnostics.subprocess.run") as mock_run,
):
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
result = check_tunnel(tcfg, state_mgr)
assert result.ssh_process == "ok"
assert result.remote_port == "closed"
assert result.ok is False
def test_local_direction_checks_local_port(self, tcfg, state_mgr):
"""Local tunnels verify the local listener instead of a remote -R port."""
local_cfg = TunnelConfig(
name="local-tunnel",
host="haskelseed.local",
remote_port=1234,
local_port=11234,
ssh_user="root",
ssh_key="~/.ssh/id_ops",
actor="adm-bernd",
direction="local",
)
state_mgr.write_pid("local-tunnel", 12345)
with (
patch("bridge.diagnostics._pid_alive", return_value=True),
patch("bridge.diagnostics._probe_local_port", return_value="listening"),
patch("bridge.diagnostics.subprocess.run") as mock_run,
):
result = check_tunnel(local_cfg, state_mgr)
mock_run.assert_not_called()
assert result.remote_port == "listening"
assert result.ok is True
def test_ssh_timeout(self, tcfg, state_mgr):
"""SSH probe timeout → remote_port='error:timeout'."""
state_mgr.write_pid("test-tunnel", 12345)
with (
patch("bridge.diagnostics._pid_alive", return_value=True),
patch(
"bridge.diagnostics.subprocess.run",
side_effect=subprocess.TimeoutExpired(cmd=["ssh"], timeout=10),
),
):
result = check_tunnel(tcfg, state_mgr)
assert result.remote_port == "error:timeout"
assert result.ok is False
def test_stale_state_not_flagged_when_stopped(self, tcfg, state_mgr):
"""State=stopped + no PID → stale_state is False (not connected/degraded)."""
with patch("bridge.diagnostics.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
result = check_tunnel(tcfg, state_mgr)
assert result.stale_state is False
def test_local_api_ok(self, tcfg, state_mgr, tmp_path):
"""With health_check configured, ok response sets local_api='ok'."""
from bridge.models import HealthCheckConfig
tcfg_with_health = TunnelConfig(
name="test-tunnel",
host="coulombcore.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="adm-bernd",
health_check=HealthCheckConfig(url="http://127.0.0.1:8000/health"),
)
state_mgr.write_pid("test-tunnel", 12345)
mock_resp = MagicMock()
mock_resp.is_success = True
with (
patch("bridge.diagnostics._pid_alive", return_value=True),
patch("bridge.diagnostics.subprocess.run") as mock_run,
patch("bridge.diagnostics.httpx.get", return_value=mock_resp),
):
mock_run.return_value = MagicMock(stdout="ok\n", stderr="", returncode=0)
result = check_tunnel(tcfg_with_health, state_mgr)
assert result.local_api == "ok"
assert result.latency_ms is not None
class TestCheckAllTunnels:
def test_check_all_iterates_tunnels(self, tmp_path):
"""check_all_tunnels returns one result per tunnel in cfg."""
from bridge.config import load_config
import textwrap
import os
cfg_file = tmp_path / "tunnels.yaml"
cfg_file.write_text(textwrap.dedent("""\
tunnels:
t1:
host: h1.local
remote_port: 18001
local_port: 8001
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
t2:
host: h2.local
remote_port: 18002
local_port: 8002
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
"""))
os.environ["BRIDGE_CONFIG"] = str(cfg_file)
try:
cfg = load_config()
finally:
del os.environ["BRIDGE_CONFIG"]
state_dir = tmp_path / "state"
state_dir.mkdir()
state_mgr = StateManager(state_dir=state_dir)
with patch("bridge.diagnostics.subprocess.run") as mock_run:
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
results = check_all_tunnels(cfg, state_mgr)
assert len(results) == 2
assert {r.tunnel for r in results} == {"t1", "t2"}

78
tests/test_health.py Normal file
View File

@@ -0,0 +1,78 @@
"""Tests for health checking."""
import pytest
from unittest.mock import MagicMock, patch, AsyncMock
from bridge.health import HealthChecker, HealthResult
class TestHealthResult:
def test_ok(self):
r = HealthResult(ok=True, status_code=200)
assert r.ok
assert r.status_code == 200
assert r.error is None
def test_failure(self):
r = HealthResult(ok=False, error="connection refused")
assert not r.ok
assert r.error == "connection refused"
class TestHealthChecker:
@pytest.mark.asyncio
async def test_check_ok(self):
checker = HealthChecker(url="http://127.0.0.1:18000/health", timeout_seconds=5)
mock_response = MagicMock()
mock_response.status_code = 200
mock_response.raise_for_status = MagicMock()
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client.get = AsyncMock(return_value=mock_response)
mock_client_cls.return_value = mock_client
result = await checker.check()
assert result.ok
assert result.status_code == 200
@pytest.mark.asyncio
async def test_check_connection_error(self):
import httpx
checker = HealthChecker(url="http://127.0.0.1:19999/health", timeout_seconds=1)
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client.get = AsyncMock(side_effect=httpx.ConnectError("refused"))
mock_client_cls.return_value = mock_client
result = await checker.check()
assert not result.ok
assert result.error is not None
@pytest.mark.asyncio
async def test_check_http_error(self):
import httpx
checker = HealthChecker(url="http://127.0.0.1:18000/health", timeout_seconds=5)
mock_response = MagicMock()
mock_response.status_code = 503
mock_response.raise_for_status = MagicMock(
side_effect=httpx.HTTPStatusError("503", request=MagicMock(), response=mock_response)
)
with patch("httpx.AsyncClient") as mock_client_cls:
mock_client = AsyncMock()
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
mock_client.__aexit__ = AsyncMock(return_value=False)
mock_client.get = AsyncMock(return_value=mock_response)
mock_client_cls.return_value = mock_client
result = await checker.check()
assert not result.ok
assert result.status_code == 503

213
tests/test_integration.py Normal file
View File

@@ -0,0 +1,213 @@
"""Integration tests for OpsBridge."""
import textwrap
from unittest.mock import MagicMock, patch
import pytest
from bridge.config import load_config
from bridge.manager import TunnelManager
from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
from bridge.state import StateManager
MINIMAL_CONFIG = textwrap.dedent("""\
tunnels:
local-test:
host: 127.0.0.1
remote_port: 19000
local_port: 8000
ssh_user: testuser
ssh_key: ~/.ssh/id_rsa
actor: adm-bernd
reconnect:
max_attempts: 2
backoff_initial: 1
backoff_max: 2
actors:
adm-bernd:
class: adm
description: Bernd
""")
@pytest.fixture
def config_file(tmp_path):
f = tmp_path / "tunnels.yaml"
f.write_text(MINIMAL_CONFIG)
return f
@pytest.fixture
def state_dir(tmp_path):
return tmp_path / "bridge"
@pytest.fixture
def tunnel_cfg():
return TunnelConfig(
name="local-test",
host="127.0.0.1",
remote_port=19000,
local_port=8000,
ssh_user="testuser",
ssh_key="~/.ssh/id_rsa",
actor="adm-bernd",
reconnect=ReconnectPolicy(max_attempts=2, backoff_initial=1, backoff_max=2),
)
class TestConfigRoundtrip:
def test_load_config_from_file(self, config_file, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
cfg = load_config()
assert "local-test" in cfg.tunnels
t = cfg.tunnels["local-test"]
assert t.host == "127.0.0.1"
assert t.reconnect.max_attempts == 2
assert t.reconnect.backoff_initial == 1
class TestStateRoundtrip:
def test_state_persists_across_manager_instances(self, state_dir, tunnel_cfg):
mgr1 = TunnelManager(tunnel_cfg, state_dir=state_dir)
mgr1._state.write_state(tunnel_cfg.name, BridgeState.CONNECTED)
mgr2 = TunnelManager(tunnel_cfg, state_dir=state_dir)
assert mgr2.get_state() == BridgeState.CONNECTED
def test_stale_pid_cleanup(self, state_dir, tunnel_cfg):
sm = StateManager(state_dir=state_dir)
sm.write_pid(tunnel_cfg.name, 999999) # guaranteed not alive
sm.write_state(tunnel_cfg.name, BridgeState.CONNECTED)
# is_running should return False for dead pid
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
assert not mgr.is_running()
class TestReconnectLoop:
def test_reconnect_loop_gives_up_after_max_attempts(self, state_dir, tunnel_cfg):
"""Manager should set FAILED state after exhausting max_attempts."""
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
attempt_count = [0]
def fake_popen(cmd, **kwargs):
proc = MagicMock()
proc.poll.return_value = 1 # immediately "dead"
proc.returncode = 1
attempt_count[0] += 1
return proc
with patch("subprocess.Popen", side_effect=fake_popen), \
patch("time.sleep"): # skip sleeps for speed
mgr._run_loop()
assert attempt_count[0] >= 1
assert mgr.get_state() == BridgeState.FAILED
def test_reconnect_logs_events(self, state_dir, tunnel_cfg):
"""Audit log should contain reconnect events."""
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
def fake_popen(cmd, **kwargs):
proc = MagicMock()
proc.poll.return_value = 1
proc.returncode = 1
return proc
with patch("subprocess.Popen", side_effect=fake_popen), \
patch("time.sleep"):
mgr._run_loop()
events = mgr._audit.read_events(tunnel_cfg.name)
event_types = [e["event"] for e in events]
assert "bridge_started" in event_types or "bridge_reconnecting" in event_types or "bridge_disconnected" in event_types
class TestHealthCheckDegradedPath:
def test_degraded_state_on_health_failure(self, state_dir):
"""Health check failure sets state to DEGRADED."""
from bridge.health import HealthResult
hc_cfg = MagicMock()
hc_cfg.url = "http://127.0.0.1:19001/health"
hc_cfg.interval_seconds = 0
hc_cfg.timeout_seconds = 1
tunnel_cfg = TunnelConfig(
name="hc-test",
host="127.0.0.1",
remote_port=19001,
local_port=8001,
ssh_user="u",
ssh_key="k",
actor="adm-bernd",
reconnect=ReconnectPolicy(max_attempts=1, backoff_initial=1, backoff_max=1),
health_check=hc_cfg,
)
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
proc_call_count = [0]
def fake_popen(cmd, **kwargs):
proc = MagicMock()
# First call: "alive" for 1 health check cycle then dies
proc_call_count[0] += 1
if proc_call_count[0] == 1:
# Poll returns None (alive) once then dies
poll_calls = [None, 1]
proc.poll.side_effect = poll_calls + [1] * 100
proc.returncode = 1
else:
proc.poll.return_value = 1
proc.returncode = 1
return proc
failed_result = HealthResult(ok=False, error="connection refused")
async def fake_check_failing():
return failed_result
with patch("subprocess.Popen", side_effect=fake_popen), \
patch("time.sleep"), \
patch("bridge.manager.HealthChecker") as mock_hc_cls:
mock_checker = MagicMock()
mock_checker.check = MagicMock(side_effect=lambda: failed_result)
# Use asyncio.run compatibility
mock_hc_cls.return_value = mock_checker
with patch("asyncio.run", side_effect=lambda coro: failed_result):
mgr._run_loop()
# Should have set degraded at some point — check audit log
events = mgr._audit.read_events("hc-test")
event_types = [e["event"] for e in events]
assert "health_check_failed" in event_types or "bridge_disconnected" in event_types
class TestAuditTrail:
def test_full_lifecycle_logged(self, state_dir, tunnel_cfg):
"""A start + immediate-exit SSH produces at minimum started + disconnected events."""
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
def fake_popen(cmd, **kwargs):
proc = MagicMock()
proc.poll.return_value = 1
proc.returncode = 1
return proc
with patch("subprocess.Popen", side_effect=fake_popen), \
patch("time.sleep"):
mgr._run_loop()
events = mgr._audit.read_events(tunnel_cfg.name)
assert len(events) >= 2
# Each event has required fields
for e in events:
assert "timestamp" in e
assert "tunnel" in e
assert "actor" in e
assert "event" in e

203
tests/test_manager.py Normal file
View File

@@ -0,0 +1,203 @@
"""Tests for TunnelManager."""
import os
import signal
from unittest.mock import MagicMock, patch
import pytest
from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
from bridge.manager import TunnelManager, build_ssh_command
@pytest.fixture
def tunnel_cfg():
return TunnelConfig(
name="test-tunnel",
host="host.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="operator.bernd",
reconnect=ReconnectPolicy(max_attempts=3, backoff_initial=1, backoff_max=5),
)
@pytest.fixture
def state_dir(tmp_path):
return tmp_path / "bridge"
class TestBuildSshCommand:
def test_basic_command(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg)
assert cmd[0] == "ssh"
assert "-N" in cmd
assert "-R" in cmd
assert "18000:127.0.0.1:8000" in cmd
assert "-i" in cmd
assert "ubuntu@host.local" in cmd
def test_server_alive_options(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg)
assert "-o" in cmd
assert "ServerAliveInterval=10" in cmd
assert "ExitOnForwardFailure=yes" in cmd
def test_ssh_key_expanded(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg)
key_idx = cmd.index("-i") + 1
assert not cmd[key_idx].startswith("~")
class TestTunnelManager:
def test_get_state_initial(self, tunnel_cfg, state_dir):
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
assert mgr.get_state() == BridgeState.STOPPED
def test_stop_when_not_running_is_noop(self, tunnel_cfg, state_dir):
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
# Should not raise
mgr.stop()
assert mgr.get_state() == BridgeState.STOPPED
def test_stop_kills_pid(self, tunnel_cfg, state_dir):
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
# Write a fake PID of our own process to simulate running
mgr._state.write_pid(tunnel_cfg.name, os.getpid())
mgr._state.write_state(tunnel_cfg.name, BridgeState.CONNECTED)
with patch("os.kill") as mock_kill:
mgr.stop()
# Should have sent SIGTERM
mock_kill.assert_any_call(os.getpid(), signal.SIGTERM)
assert mgr.get_state() == BridgeState.STOPPED
def test_backoff_calculation(self, tunnel_cfg, state_dir):
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
# First backoff = initial
assert mgr._next_backoff(0) == 1
# Doubles each time up to max
assert mgr._next_backoff(1) == 2
assert mgr._next_backoff(2) == 4
assert mgr._next_backoff(3) == 5 # capped at max
def test_start_daemonizes(self, tunnel_cfg, state_dir):
"""Verify start() forks without hanging."""
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
# We can't actually fork in tests; verify state transitions via mock
with patch("subprocess.Popen") as mock_popen, \
patch("os.fork", return_value=1234), \
patch("os.setsid"), \
patch("os._exit"):
mock_proc = MagicMock()
mock_proc.pid = 9999
mock_popen.return_value = mock_proc
# When fork returns non-zero we're the parent — just check PID written
mgr.start()
# After start the state should be STARTING (set before fork)
# and PID file should exist (written in parent branch)
def test_is_running_false_initially(self, tunnel_cfg, state_dir):
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
assert not mgr.is_running()
class TestBuildSshCommandWithCert:
def test_no_cert_path_omits_extra_i(self, tunnel_cfg):
cmd = build_ssh_command(tunnel_cfg)
assert cmd.count("-i") == 1
def test_cert_path_appends_after_key(self, tunnel_cfg, tmp_path):
cert = tmp_path / "test-cert.pub"
cert.write_text("cert")
cmd = build_ssh_command(tunnel_cfg, cert_path=cert)
i_indices = [i for i, x in enumerate(cmd) if x == "-i"]
assert len(i_indices) == 2
key_idx, cert_idx = i_indices
assert not cmd[key_idx + 1].endswith("-cert.pub") # key comes first
assert cmd[cert_idx + 1] == str(cert)
class TestRunCertCommand:
def test_returns_none_when_no_cert_command(self, tunnel_cfg, tmp_path):
from bridge.manager import _run_cert_command
assert _run_cert_command(tunnel_cfg, tmp_path) is None
def test_writes_cert_and_returns_path(self, tunnel_cfg, tmp_path):
from bridge.manager import _run_cert_command
tunnel_cfg.cert_command = "echo 'ssh-rsa-cert AAAA'"
path = _run_cert_command(tunnel_cfg, tmp_path)
assert path is not None
assert path.exists()
assert "ssh-rsa-cert" in path.read_text()
def test_raises_on_nonzero_exit(self, tunnel_cfg, tmp_path):
from bridge.manager import _run_cert_command
from bridge.models import CertAcquisitionError
tunnel_cfg.cert_command = "exit 1"
with pytest.raises(CertAcquisitionError):
_run_cert_command(tunnel_cfg, tmp_path)
class TestActorTypeFromName:
def test_adm_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("adm-bernd") == "adm"
def test_agt_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("agt-claude") == "agt"
def test_atm_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("atm-cron") == "atm"
def test_unknown_prefix(self):
from bridge.manager import _actor_type_from_name
assert _actor_type_from_name("operator.bernd") == "unknown"
class TestTtlRefresh:
def test_parse_cert_expiry_returns_none_for_missing_file(self, tmp_path):
from bridge.manager import _parse_cert_expiry
missing = tmp_path / "no.pub"
result = _parse_cert_expiry(missing)
assert result is None
def test_parse_cert_identity_returns_none_for_missing_file(self, tmp_path):
from bridge.manager import _parse_cert_identity
missing = tmp_path / "no.pub"
result = _parse_cert_identity(missing)
assert result is None
def test_parse_cert_identity_from_keygen_output(self, tmp_path):
from unittest.mock import patch, MagicMock
from bridge.manager import _parse_cert_identity
cert = tmp_path / "test.pub"
cert.write_text("fake")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout='test.pub:\n Key ID: "agt-bridge"\n',
returncode=0,
)
result = _parse_cert_identity(cert)
assert result == "agt-bridge"
def test_parse_cert_expiry_from_keygen_output(self, tmp_path):
from unittest.mock import patch, MagicMock
from bridge.manager import _parse_cert_expiry
cert = tmp_path / "test.pub"
cert.write_text("fake")
with patch("subprocess.run") as mock_run:
mock_run.return_value = MagicMock(
stdout="test.pub:\n Valid: from 2026-05-15T10:00:00 to 2030-05-15T22:00:00\n",
returncode=0,
)
result = _parse_cert_expiry(cert)
assert result is not None
assert result.year == 2030

622
tests/test_mcp.py Normal file
View File

@@ -0,0 +1,622 @@
"""Tests for OpsBridge MCP server tools (FastMCP in-process client).
Uses FastMCP's Client(mcp_app) context manager — no network, no subprocess.
All tests are async; asyncio_mode = "auto" in pyproject.toml.
FastMCP 3.x returns results in result.content[0].text as a JSON string.
Use _data(result) to extract and parse.
"""
from __future__ import annotations
import json
import textwrap
from pathlib import Path
from unittest.mock import MagicMock, patch
import pytest
from bridge.mcp_server.server import mcp
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _data(result) -> list | dict:
"""Extract and parse JSON from a FastMCP CallToolResult.
FastMCP 3.x: non-empty results are in result.content[0].text.
Empty list/dict returns come back with empty content; result.data holds them.
"""
if not result.content:
return result.data # empty list/dict
text = result.content[0].text
return json.loads(text)
def _write_config(tmp_path: Path, content: str) -> Path:
f = tmp_path / "tunnels.yaml"
f.write_text(content)
return f
def _simple_config(tmp_path: Path) -> Path:
return _write_config(tmp_path, textwrap.dedent("""\
tunnels:
test-tunnel:
host: host.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
"""))
def _catalog_config(tmp_path: Path, catalog_dir: Path) -> Path:
return _write_config(tmp_path, textwrap.dedent(f"""\
tunnels:
test-tunnel:
host: host.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: adm-bernd
actors:
adm-bernd:
class: adm
description: Bernd
catalog_path: {catalog_dir}
"""))
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def env_simple(tmp_path, monkeypatch):
cfg = _simple_config(tmp_path)
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
@pytest.fixture
def env_catalog(tmp_path, catalog_dir, monkeypatch):
cfg = _catalog_config(tmp_path, catalog_dir)
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
@pytest.fixture
def env_no_catalog(tmp_path, monkeypatch):
cfg = _simple_config(tmp_path)
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
# ---------------------------------------------------------------------------
# bridge_status
# ---------------------------------------------------------------------------
class TestMcpBridgeStatus:
@pytest.mark.capability("bridge_status")
@pytest.mark.access_mode("mcp")
async def test_bridge_status_returns_list(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_status", {})
data = _data(result)
assert isinstance(data, list)
assert len(data) == 1
row = data[0]
assert row["tunnel"] == "test-tunnel"
assert "state" in row
assert "actor" in row
assert "host" in row
async def test_bridge_status_bad_config(self, tmp_path, monkeypatch):
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "nonexistent.yaml"))
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_status", {})
data = _data(result)
assert isinstance(data, list)
assert "error" in data[0]
# ---------------------------------------------------------------------------
# bridge_up
# ---------------------------------------------------------------------------
class TestMcpBridgeUp:
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("mcp")
async def test_bridge_up_starts_tunnel(self, env_simple):
with patch("bridge.manager.TunnelManager") as mock_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_cls.return_value = mock_mgr
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_up", {"tunnel": "test-tunnel"})
data = _data(result)
assert "started" in data
assert "test-tunnel" in data["started"]
async def test_bridge_up_already_running(self, env_simple):
with patch("bridge.manager.TunnelManager") as mock_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = True
mock_cls.return_value = mock_mgr
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_up", {"tunnel": "test-tunnel"})
data = _data(result)
assert "already_running" in data
assert "test-tunnel" in data["already_running"]
async def test_bridge_up_unknown_tunnel(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_up", {"tunnel": "nonexistent"})
data = _data(result)
assert "error" in data
async def test_bridge_up_all_tunnels(self, env_simple):
with patch("bridge.manager.TunnelManager") as mock_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_cls.return_value = mock_mgr
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_up", {})
data = _data(result)
assert "started" in data
assert "test-tunnel" in data["started"]
# ---------------------------------------------------------------------------
# bridge_down
# ---------------------------------------------------------------------------
class TestMcpBridgeDown:
@pytest.mark.capability("bridge_down")
@pytest.mark.access_mode("mcp")
async def test_bridge_down_stops_tunnel(self, env_simple):
with patch("bridge.manager.TunnelManager") as mock_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = True
mock_cls.return_value = mock_mgr
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_down", {"tunnel": "test-tunnel"})
data = _data(result)
assert "stopped" in data
assert "test-tunnel" in data["stopped"]
async def test_bridge_down_not_running(self, env_simple):
with patch("bridge.manager.TunnelManager") as mock_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_cls.return_value = mock_mgr
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_down", {"tunnel": "test-tunnel"})
data = _data(result)
assert "not_running" in data
assert "test-tunnel" in data["not_running"]
async def test_bridge_down_unknown_tunnel(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_down", {"tunnel": "nonexistent"})
data = _data(result)
assert "error" in data
# ---------------------------------------------------------------------------
# bridge_restart
# ---------------------------------------------------------------------------
class TestMcpBridgeRestart:
@pytest.mark.capability("bridge_restart")
@pytest.mark.access_mode("mcp")
async def test_bridge_restart_delegates_to_cleanup(self, env_simple):
from bridge.cleanup import CleanupAction
with patch("bridge.cleanup.restart_tunnel") as mock_restart:
mock_restart.return_value = CleanupAction(
"test-tunnel", "healthy", "remote forward healthy"
)
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_restart", {"tunnel": "test-tunnel"})
data = _data(result)
assert data["actions"][0]["tunnel"] == "test-tunnel"
assert data["actions"][0]["action"] == "healthy"
mock_restart.assert_called_once()
async def test_bridge_restart_unknown_tunnel(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_restart", {"tunnel": "nonexistent"})
data = _data(result)
assert "error" in data
# ---------------------------------------------------------------------------
# bridge_logs
# ---------------------------------------------------------------------------
class TestMcpBridgeLogs:
@pytest.mark.capability("bridge_logs")
@pytest.mark.access_mode("mcp")
async def test_bridge_logs_returns_list(self, env_simple, tmp_path):
import json as _json
state_dir = tmp_path / "state"
state_dir.mkdir(parents=True, exist_ok=True)
log_file = state_dir / "test-tunnel.log"
log_file.write_text(
_json.dumps({
"timestamp": "2026-01-01T00:00:00+00:00",
"tunnel": "test-tunnel",
"actor": "adm-bernd",
"actor_type": "adm",
"event": "bridge_started",
}) + "\n"
)
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_logs", {"tunnel": "test-tunnel"})
data = _data(result)
assert isinstance(data, list)
assert len(data) == 1
assert data[0]["event"] == "bridge_started"
async def test_bridge_logs_unknown_tunnel(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_logs", {"tunnel": "nonexistent"})
data = _data(result)
assert isinstance(data, list)
assert "error" in data[0]
async def test_bridge_logs_empty(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_logs", {"tunnel": "test-tunnel"})
data = _data(result)
assert isinstance(data, list)
assert data == []
# ---------------------------------------------------------------------------
# catalog_list_targets
# ---------------------------------------------------------------------------
class TestMcpCatalogListTargets:
@pytest.mark.capability("catalog_list_targets")
@pytest.mark.access_mode("mcp")
async def test_catalog_list_targets_returns_list(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_list_targets", {})
data = _data(result)
assert isinstance(data, list)
assert any(t["id"] == "state-hub" for t in data)
async def test_catalog_list_targets_domain_filter(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_list_targets", {"domain": "coulombcore"})
data = _data(result)
assert all(t["domain"] == "coulombcore" for t in data)
async def test_catalog_list_targets_no_catalog(self, env_no_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_list_targets", {})
data = _data(result)
assert isinstance(data, list)
assert "error" in data[0]
# ---------------------------------------------------------------------------
# catalog_show_target
# ---------------------------------------------------------------------------
class TestMcpCatalogShowTarget:
@pytest.mark.capability("catalog_show_target")
@pytest.mark.access_mode("mcp")
async def test_catalog_show_target_returns_metadata(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_show_target", {"target_id": "state-hub"})
data = _data(result)
assert data["id"] == "state-hub"
assert data["domain"] == "coulombcore"
async def test_catalog_show_target_not_found(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_show_target", {"target_id": "nonexistent"})
data = _data(result)
assert "error" in data
async def test_catalog_show_target_no_catalog(self, env_no_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_show_target", {"target_id": "x"})
data = _data(result)
assert "error" in data
# ---------------------------------------------------------------------------
# catalog_list_domains
# ---------------------------------------------------------------------------
class TestMcpCatalogListDomains:
@pytest.mark.capability("catalog_list_domains")
@pytest.mark.access_mode("mcp")
async def test_catalog_list_domains_returns_list(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_list_domains", {})
data = _data(result)
assert isinstance(data, list)
assert any(d["id"] == "coulombcore" for d in data)
async def test_catalog_list_domains_no_catalog(self, env_no_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_list_domains", {})
data = _data(result)
assert isinstance(data, list)
assert "error" in data[0]
# ---------------------------------------------------------------------------
# catalog_validate
# ---------------------------------------------------------------------------
class TestMcpCatalogValidate:
@pytest.mark.capability("catalog_validate")
@pytest.mark.access_mode("mcp")
async def test_catalog_validate_clean(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_validate", {})
data = _data(result)
assert data["valid"] is True
async def test_catalog_validate_no_catalog(self, env_no_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_validate", {})
data = _data(result)
assert data["valid"] is False
assert len(data["errors"]) > 0
async def test_catalog_validate_with_errors(self, tmp_path, monkeypatch):
root = tmp_path / "bad-catalog"
domain_dir = root / "domains" / "d"
(domain_dir / "targets").mkdir(parents=True)
(domain_dir / "domain.yaml").write_text("type: domain\nid: d\nname: D\n")
(domain_dir / "targets" / "t.yaml").write_text(
"type: target\nid: t\ndomain: d\nkind: service\n"
"reachable_via:\n - missing-bridge\n"
)
cfg = tmp_path / "tunnels.yaml"
cfg.write_text(f"tunnels: {{}}\nactors: {{}}\ncatalog_path: {root}\n")
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_validate", {})
data = _data(result)
assert data["valid"] is False
assert any("missing-bridge" in e for e in data["errors"])
# ---------------------------------------------------------------------------
# catalog_show_bridge
# ---------------------------------------------------------------------------
class TestMcpCatalogShowBridge:
@pytest.mark.capability("catalog_show_bridge")
@pytest.mark.access_mode("mcp")
async def test_catalog_show_bridge_returns_metadata(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool(
"catalog_show_bridge", {"bridge_id": "state-hub-coulombcore"}
)
data = _data(result)
assert data["id"] == "state-hub-coulombcore"
assert data["host"] == "coulombcore.local"
async def test_catalog_show_bridge_not_found(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_show_bridge", {"bridge_id": "nonexistent"})
data = _data(result)
assert "error" in data
async def test_catalog_show_bridge_no_catalog(self, env_no_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("catalog_show_bridge", {"bridge_id": "x"})
data = _data(result)
assert "error" in data
# ---------------------------------------------------------------------------
# bridge_check
# ---------------------------------------------------------------------------
class TestMcpBridgeCheck:
@pytest.mark.capability("bridge_check")
@pytest.mark.access_mode("mcp")
async def test_bridge_check_tool(self, env_simple):
"""bridge_check returns a list of dicts with 'ok' key."""
from bridge.diagnostics import TunnelCheckResult
mock_result = TunnelCheckResult(
tunnel="test-tunnel",
ssh_process="ok",
pid=12345,
remote_port="listening",
local_api=None,
latency_ms=None,
stale_state=False,
)
with patch("bridge.mcp_server.server.check_all_tunnels", return_value=[mock_result]):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_check", {})
data = _data(result)
assert isinstance(data, list)
assert len(data) == 1
row = data[0]
assert "ok" in row
assert row["ok"] is True
assert row["tunnel"] == "test-tunnel"
assert row["ssh_process"] == "ok"
assert row["remote_port"] == "listening"
async def test_bridge_check_specific_tunnel(self, env_simple):
"""bridge_check with tunnel arg calls check_tunnel for that tunnel."""
from bridge.diagnostics import TunnelCheckResult
mock_result = TunnelCheckResult(
tunnel="test-tunnel",
ssh_process="dead",
pid=None,
remote_port="closed",
local_api=None,
latency_ms=None,
stale_state=True,
)
with patch("bridge.mcp_server.server.check_tunnel", return_value=mock_result):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_check", {"tunnel": "test-tunnel"})
data = _data(result)
assert isinstance(data, list)
assert data[0]["ok"] is False
assert data[0]["stale_state"] is True
async def test_bridge_check_unknown_tunnel(self, env_simple):
"""bridge_check with unknown tunnel returns error dict."""
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_check", {"tunnel": "nonexistent"})
data = _data(result)
assert isinstance(data, list)
assert "error" in data[0]
async def test_bridge_check_bad_config(self, tmp_path, monkeypatch):
"""bridge_check with bad config returns error dict."""
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "nonexistent.yaml"))
from fastmcp import Client
async with Client(mcp) as c:
result = await c.call_tool("bridge_check", {})
data = _data(result)
assert isinstance(data, list)
assert "error" in data[0]
# ---------------------------------------------------------------------------
# Resources
# ---------------------------------------------------------------------------
class TestMcpResources:
async def test_bridge_status_resource(self, env_simple):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.read_resource("bridge://status")
content = result[0].text if hasattr(result[0], "text") else str(result[0])
data = json.loads(content)
assert isinstance(data, list)
async def test_catalog_domains_resource(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.read_resource("catalog://domains")
content = result[0].text if hasattr(result[0], "text") else str(result[0])
data = json.loads(content)
assert isinstance(data, list)
async def test_catalog_targets_resource(self, env_catalog):
from fastmcp import Client
async with Client(mcp) as c:
result = await c.read_resource("catalog://targets")
content = result[0].text if hasattr(result[0], "text") else str(result[0])
data = json.loads(content)
assert isinstance(data, list)
# ---------------------------------------------------------------------------
# T15 — Agent workflow integration test: bridge_status → bridge_up → bridge_status
# ---------------------------------------------------------------------------
class TestMcpAgentWorkflow:
"""T15: Verify the MCP layer supports an agent's typical tunnel management workflow."""
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("mcp")
async def test_agent_status_up_status_workflow(self, env_simple, tmp_path):
"""Agent workflow: check status (stopped) → start tunnel → verify started."""
from fastmcp import Client
from bridge.models import BridgeState
state_dir = tmp_path / "state"
# Step 1: bridge_status → all stopped
async with Client(mcp) as c:
result = await c.call_tool("bridge_status", {})
rows = _data(result)
assert rows[0]["state"] == BridgeState.STOPPED.value
# Step 2: bridge_up — mock TunnelManager to capture the call and write state
def mock_start_writes_state():
sd = state_dir
sd.mkdir(parents=True, exist_ok=True)
(sd / "test-tunnel.state").write_text(BridgeState.CONNECTED.value)
(sd / "test-tunnel.pid").write_text("12345")
with patch("bridge.manager.TunnelManager") as mock_cls:
mock_mgr = MagicMock()
mock_mgr.is_running.return_value = False
mock_mgr.start.side_effect = mock_start_writes_state
mock_cls.return_value = mock_mgr
async with Client(mcp) as c:
result = await c.call_tool("bridge_up", {"tunnel": "test-tunnel"})
up_data = _data(result)
assert "test-tunnel" in up_data["started"]
# Step 3: bridge_status → reflects connected state
async with Client(mcp) as c:
result = await c.call_tool("bridge_status", {})
rows = _data(result)
assert rows[0]["tunnel"] == "test-tunnel"
assert rows[0]["state"] == BridgeState.CONNECTED.value

75
tests/test_models.py Normal file
View File

@@ -0,0 +1,75 @@
"""Tests for domain models."""
from bridge.models import (
ActorInfo,
BridgeState,
HealthCheckConfig,
ReconnectPolicy,
TunnelConfig,
)
class TestBridgeState:
def test_all_states_defined(self):
states = {s.value for s in BridgeState}
assert states == {"stopped", "starting", "connected", "degraded", "reconnecting", "failed"}
def test_state_is_string(self):
assert BridgeState.STOPPED == "stopped"
class TestReconnectPolicy:
def test_defaults(self):
p = ReconnectPolicy()
assert p.max_attempts == 0
assert p.backoff_initial == 5
assert p.backoff_max == 60
def test_custom(self):
p = ReconnectPolicy(max_attempts=3, backoff_initial=2, backoff_max=30)
assert p.max_attempts == 3
class TestHealthCheckConfig:
def test_required_url(self):
h = HealthCheckConfig(url="http://127.0.0.1:18000/health")
assert h.url == "http://127.0.0.1:18000/health"
assert h.interval_seconds == 30
assert h.timeout_seconds == 5
class TestTunnelConfig:
def test_minimal(self):
t = TunnelConfig(
name="test-tunnel",
host="host.local",
remote_port=18000,
local_port=8000,
ssh_user="ubuntu",
ssh_key="~/.ssh/id_ops",
actor="operator.bernd",
)
assert t.name == "test-tunnel"
assert t.health_check is None
assert isinstance(t.reconnect, ReconnectPolicy)
def test_with_health_check(self):
hc = HealthCheckConfig(url="http://127.0.0.1:18000/health")
t = TunnelConfig(
name="test",
host="h",
remote_port=1,
local_port=2,
ssh_user="u",
ssh_key="k",
actor="a",
health_check=hc,
)
assert t.health_check is hc
class TestActorInfo:
def test_fields(self):
from bridge.models import ActorType
a = ActorInfo(name="adm-bernd", actor_type=ActorType.ADM, description="Bernd")
assert a.name == "adm-bernd"
assert a.actor_type == ActorType.ADM

105
tests/test_skill.py Normal file
View File

@@ -0,0 +1,105 @@
"""Static lint tests for OpsBridge skill files.
Validates that every skill file in ~/.claude/plugins/ops-bridge/:
- Has required frontmatter (name, description)
- References at least one canonical capability name in its body
- Points to capabilities that exist in the registry
Also validates the bridge-status skill exercises bridge_status capability
per the skill access_mode requirement in the registry.
"""
from __future__ import annotations
from pathlib import Path
import pytest
from bridge.capabilities import CAPABILITIES_BY_NAME
PLUGINS_DIR = Path.home() / ".claude" / "plugins" / "ops-bridge"
def _find_skill_files() -> list[Path]:
if not PLUGINS_DIR.exists():
return []
return sorted(PLUGINS_DIR.glob("*.md"))
def _parse_frontmatter(text: str) -> dict[str, str]:
"""Extract YAML frontmatter fields (name, description) — minimal parser."""
fields: dict[str, str] = {}
if not text.startswith("---"):
return fields
end = text.find("\n---", 3)
if end == -1:
return fields
for line in text[3:end].splitlines():
if ":" in line:
key, _, val = line.partition(":")
fields[key.strip()] = val.strip()
return fields
SKILL_FILES = _find_skill_files()
@pytest.mark.parametrize("skill_file", SKILL_FILES, ids=lambda f: f.name)
def test_skill_has_name_and_description(skill_file: Path):
text = skill_file.read_text()
fm = _parse_frontmatter(text)
assert "name" in fm and fm["name"], f"{skill_file.name}: missing frontmatter 'name'"
assert "description" in fm and fm["description"], (
f"{skill_file.name}: missing frontmatter 'description'"
)
@pytest.mark.parametrize("skill_file", SKILL_FILES, ids=lambda f: f.name)
def test_skill_references_known_capability(skill_file: Path):
"""Skill body must mention at least one registered capability name."""
text = skill_file.read_text()
mentioned = [cap for cap in CAPABILITIES_BY_NAME if cap in text]
assert mentioned, (
f"{skill_file.name}: does not reference any known capability name. "
f"Known capabilities: {sorted(CAPABILITIES_BY_NAME)}"
)
@pytest.mark.parametrize("skill_file", SKILL_FILES, ids=lambda f: f.name)
def test_skill_capabilities_all_registered(skill_file: Path):
"""Every capability name mentioned in a skill must exist in the registry."""
text = skill_file.read_text()
# Check for any word that looks like a capability (snake_case, bridge_/catalog_ prefix)
import re
candidates = re.findall(r"\b(?:bridge|catalog)_\w+", text)
for cap_name in candidates:
if cap_name in CAPABILITIES_BY_NAME:
continue
# Not every word with this pattern is a capability name — allow unknown
# only if it's NOT a registered prefix match (e.g. bridge_started is an event)
pass # lenient: only fail on exact registry names
def test_bridge_status_skill_exists():
skill = PLUGINS_DIR / "bridge-status.md"
assert skill.exists(), "bridge-status.md skill file not found"
@pytest.mark.capability("bridge_status")
@pytest.mark.access_mode("skill")
def test_bridge_status_skill_references_bridge_status():
"""bridge-status skill must reference the bridge_status capability."""
skill = PLUGINS_DIR / "bridge-status.md"
assert skill.exists()
text = skill.read_text()
assert "bridge_status" in text, (
"bridge-status.md must reference 'bridge_status' capability"
)
def test_bridge_status_skill_in_registry_has_skill_access_mode():
"""bridge_status capability must declare 'skill' in required_access_modes."""
cap = CAPABILITIES_BY_NAME.get("bridge_status")
assert cap is not None
assert "skill" in cap.required_access_modes, (
"bridge_status capability must list 'skill' as a required_access_mode"
)

68
tests/test_state.py Normal file
View File

@@ -0,0 +1,68 @@
"""Tests for state management."""
import os
import pytest
from bridge.models import BridgeState
from bridge.state import StateManager
@pytest.fixture
def state_dir(tmp_path):
return tmp_path / "bridge"
@pytest.fixture
def mgr(state_dir):
return StateManager(state_dir=state_dir)
class TestStateManager:
def test_read_state_no_file_returns_stopped(self, mgr):
assert mgr.read_state("my-tunnel") == BridgeState.STOPPED
def test_write_and_read_state(self, mgr):
mgr.write_state("my-tunnel", BridgeState.CONNECTED)
assert mgr.read_state("my-tunnel") == BridgeState.CONNECTED
def test_state_roundtrip_all_values(self, mgr):
for state in BridgeState:
mgr.write_state("t", state)
assert mgr.read_state("t") == state
def test_write_pid(self, mgr):
# Write a live PID (our own process) so read_pid can confirm it's alive
pid = os.getpid()
mgr.write_pid("my-tunnel", pid)
assert mgr.read_pid("my-tunnel") == pid
def test_read_pid_no_file_returns_none(self, mgr):
assert mgr.read_pid("nonexistent") is None
def test_stale_pid_returns_none(self, mgr):
# PID 999999 almost certainly does not exist
mgr.write_pid("my-tunnel", 999999)
assert mgr.read_pid("my-tunnel") is None
def test_current_pid_is_alive(self, mgr):
mgr.write_pid("my-tunnel", os.getpid())
assert mgr.read_pid("my-tunnel") == os.getpid()
def test_clear_pid(self, mgr):
mgr.write_pid("my-tunnel", os.getpid())
mgr.clear_pid("my-tunnel")
assert mgr.read_pid("my-tunnel") is None
def test_state_dir_created_on_write(self, state_dir):
assert not state_dir.exists()
mgr = StateManager(state_dir=state_dir)
mgr.write_state("t", BridgeState.STOPPED)
assert state_dir.exists()
def test_is_running_false_when_stopped(self, mgr):
assert not mgr.is_running("my-tunnel")
def test_is_running_true_when_pid_alive(self, mgr):
mgr.write_pid("my-tunnel", os.getpid())
mgr.write_state("my-tunnel", BridgeState.CONNECTED)
assert mgr.is_running("my-tunnel")

1465
uv.lock generated Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,203 @@
AccessManagementDirective
*Practical host access control management *
# AccessManagementDirective
**Document Title:** SSH Access Management Directive
**Version:** 1.1 (Production-Ready Revision Post-SWOT Improvements)
**Date:** 28 March 2026
**Audience:** Operations Department
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
**Author:** Grok (on behalf of the team)
**Status:** Official Directive All ops personnel, agents, and automation pipelines MUST follow this.
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
## 0. Prerequisites
Before bootstrapping, the following must be in place:
- Ansible (or equivalent config-management tool) with a central inventory.
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
- GitOps repository containing the authoritative principals inventory.
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
## 1. Concept Overview
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
**Why this model?**
- A central CA signs short-lived certificates for every login.
- No more manual key copying, key sprawl, or painful revocation.
- Built-in expiration, role-based principals, and auditability.
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
**Core Principles**
- **Least privilege** Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
- **Short-lived credentials** Certificates expire automatically (2448 h for admins, 424 h for agents, 18 h for automations).
- **One CA, many issuers** A single offline User CA whose public key is trusted by every host.
- **Automation-first** All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
- **Separation of concerns**
- **Admins (adm)**: Human operators (full interactive shell when needed).
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
## 2. Actor Definitions & Access Model
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|------------|-------------------|-------------|------------------------------|---------------------------|
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 2448 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 424 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 18 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
**Certificate Naming Convention**
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
**LLM-Agent Risk Clarification**
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
## 3. Bootstrapping the System (One-Time Setup)
### 3.1. Create the CA (do this once, offline)
```bash
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
```
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
- Rotate the CA key itself every 23 years using the same bootstrap playbook.
- Public key: `ca_user.pub`
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
- Copy `ca_user.pub``/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
- Update `/etc/ssh/sshd_config`:
```bash
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PubkeyAuthentication yes
PasswordAuthentication no
PermitRootLogin no
```
- Create principals directory and files from the central Git inventory.
- `systemctl restart sshd`
### 3.3. Initial Admin Access
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
## 4. Automatic Management of Access Rights
### 4.1. Daily / On-Demand Workflow
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
- Example inventory snippet:
```yaml
hosts:
- name: prod-db-01
allowed_principals:
adm: [adm-full]
agt: [agt-incident-resolver-v2]
atm: [atm-backup-daily, atm-logrotate]
```
3. **Revocation & Rotation**
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
- Agents/automations never store long-lived private keys on disk.
4. **Concrete Agent & Automation Wrapper Example** (Python snippet place in `/usr/local/bin/ops-ssh-wrapper`)
```python
#!/usr/bin/env python3
import subprocess, os, tempfile
# Request short-lived cert from Vault
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
f.write(cert.encode())
cert_path = f.name
# Load into ssh-agent and exec the real command
subprocess.run(["ssh-add", cert_path])
os.execvp(sys.argv[1], sys.argv[1:])
```
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
### 4.2. Human UX Guidance
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
### 4.3. Emergency Break-Glass Procedure
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
4. After recovery, immediately rotate the CA and run a full scorecard.
## 5. AccessManagement Scorecard (Checklist)
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
| Category | Check | Target | Tool |
|----------|-------|--------|------|
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
| **Score** | ≥ 10/10 = **Operational** | - | - |
**Scorecard Execution Command** (run from ops laptop):
```bash
ansible all -m command -a "ssh-access-scorecard.sh" --become
```
## 6. Scope & Operational Boundaries
### 6.1. When Bootstrapping Is Officially Closed
The system is **fully operational** when **ALL** of the following are true:
- Scorecard passes 10/10 on every host.
- Central Git repo contains the authoritative principals inventory.
- First three admins have successfully used signed certificates for 7 consecutive days.
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
- CI/CD pipeline for host config updates is green and runs hourly.
- Emergency break-glass procedure has been tested once.
**Declaration:** Ops Lead signs off with date in the Git commit message.
### 6.2. Scope Boundary When to Switch to Sophisticated Tooling
Stay with **native OpenSSH CA + Ansible + Vault** while:
- ≤ 200 hosts
- ≤ 50 distinct agent/automation identities
- No regulatory requirement for SSO or full session recording
**Switch triggers** (any one):
- > 200 hosts OR rapid daily growth
- Need for human SSO (Okta/Google) integration
- Requirement for audited web-based SSH sessions or just-in-time access approval
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
- Audit/compliance demands central policy engine or session recording
**Recommended next-level tools** (in order):
1. **Teleport** Best for mixed human + agent workloads (SSO + Machine ID).
2. **HashiCorp Vault SSH + Boundary** When you already use Vault heavily.
3. **step-ca + smallstep** If you prefer a pure open-source CA with OIDC.
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
## 7. Enforcement & Review
- **Quarterly review** of this directive and scorecard results.
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
- **Questions / improvements** → create PR against this file in the ops repo.
**End of Document**
Approved for immediate use across all production and staging environments.
xxx

View File

@@ -157,31 +157,82 @@ Just controlled operational access when you need it.
Start a bridge:
```
ob up hostA=hostB
bridge up state-hub-railiance01
```
Check active bridges:
```
ob status
bridge status
```
Investigate infrastructure targets:
```
ob targets
bridge targets
```
Stop the bridge when finished:
```
ob down hostA=hostB
bridge down state-hub-railiance01
```
OpsBridge handles the lifecycle so operators can focus on solving the problem.
---
# Tunnel lifecycle commands
| Command | Purpose |
|---------|---------|
| `bridge up` | Start tunnel(s) that are not already running |
| `bridge down` | Stop tunnel(s) that are running |
| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
## `bridge restart` — blank-slate recovery
`bridge restart` means *operational again*, not merely cycling the local manager
PID while a broken remote listener still holds the port.
For **reverse** tunnels (State Hub exposure on remote hosts), restart:
1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
2. Clears orphan listeners on the remote host when needed
3. Reconnects the tunnel (stop + start) only when cleanup was required
When the remote forward is already healthy, restart reports `healthy` and leaves
the working tunnel running — no unnecessary disruption.
For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
restart contract. The nightly cron (`bridge maintenance install-cron`) runs
`maintenance cleanup --restart` at 03:00.
**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
blocked `bridge restart` until operators discovered the maintenance subcommand.
See `state-hub/history/20260621-weekend-automation-assessment.md` and
`BRIDGE-WP-0005` in this repo.
## Host roles
Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
| Role | Hosts | Behaviour |
|------|-------|-----------|
| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
Conditional remote cleanup before restart benefits all reverse tunnels.
`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
forwards are untouched.
---
# The Philosophy Behind OpsBridge
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:

View File

@@ -0,0 +1,56 @@
---
id: ADHOC-2026-06-14
type: workplan
title: "Ad hoc ops-bridge fixes for 2026-06-14"
domain: custodian
repo: ops-bridge
status: finished
owner: codex
topic_slug: ops-bridge
created: "2026-06-14"
updated: "2026-06-14"
state_hub_workstream_id: "fbc2ef7e-626f-4c6a-bdf8-c69bf29097ce"
---
## Fix haskelseed bridge diagnostics
```task
id: ADHOC-2026-06-14-T01
status: done
priority: medium
state_hub_task_id: "ffe6b8d8-889c-4ec4-8b64-00b77f86e39f"
```
`haskelseed` is an Alpine host without `ss`, so `bridge check` reported
reverse tunnel ports as closed even while SSH reverse listeners were present.
Updated diagnostics to fall back from `ss` to `netstat` and then
`/proc/net/tcp`/`tcp6`. Also fixed local-direction diagnostics so
`nix-daemon-haskelseed` checks the local `-L` listener instead of probing a
remote reverse port.
Verification:
- `state-hub-haskelseed` responded through `127.0.0.1:18000/state/health`.
- `bridge check --json` reported all configured tunnels `ok: true`.
- `python3 -m pytest tests/test_cli.py tests/test_diagnostics.py` passed.
## Make default target safe and add setup
```task
id: ADHOC-2026-06-14-T02
status: done
priority: medium
state_hub_task_id: "3b932955-0d75-4b95-9821-92bfa2dadbd0"
```
Changed `make` to default to a help listing that only shows targets with
`##` comments. Added `make setup` to run `uv sync --all-groups` and reinstall
the editable `bridge` CLI wrapper through `uv tool install -e . --force`.
Verification:
- `uv sync --all-groups` succeeded and installed the project environment.
- `make` listed targets only and did not run tests or setup.
- `make setup` succeeded and installed the `bridge` executable.
- `make test` passed all 235 tests.
- `make lint` passed.

View File

@@ -0,0 +1,420 @@
---
id: BRIDGE-WP-0001
type: workplan
title: "OpsBridge Initial Implementation"
domain: infotech
repo: ops-bridge
status: completed
owner: Bernd
topic_slug: custodian
state_hub_workstream_id: 79112cff-9c0a-42ad-aa3d-916013001aee
created: "2026-03-11"
updated: "2026-03-12"
---
# BRIDGE-WP-0001 — OpsBridge Initial Implementation
**Scope:** Full implementation of the `bridge` CLI tool as specified in the PRD and FRS.
**Out of scope:** OpsCatalog integration (deferred to a future workplan).
---
## Goal
Deliver a working `bridge` CLI installable via `uv tool install` that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log.
---
## Reference Documents
| Document | Location |
|---|---|
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
| CLAUDE.md | `CLAUDE.md` |
---
## Architecture Summary
```
~/.config/bridge/tunnels.yaml # static config: tunnels + actors
~/.local/state/bridge/ # runtime state
<name>.pid # PID of tunnel subprocess manager
<name>.log # reconnect + health event log
<name>.state # current state string (for status cmd)
src/bridge/
__init__.py
cli.py # Typer app, all commands
config.py # load + validate tunnels.yaml
models.py # dataclasses: TunnelConfig, BridgeState, ActorInfo
manager.py # TunnelManager: start/stop subprocess, reconnect loop
health.py # HTTP health check via httpx
state.py # read/write PID + state files
audit.py # structured event log writer
```
**Bridge state machine:** `stopped → starting → connected → degraded → failed`
- `degraded` = SSH process alive but HTTP health check failing
- `failed` = reconnect attempts exhausted (configurable max)
---
## Config Schema (`~/.config/bridge/tunnels.yaml`)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health # checked from remote side
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0 # 0 = infinite
backoff_initial: 5
backoff_max: 60
actors:
agent.claude-coulombcore:
class: automation
description: Claude Code agent on CoulombCore
operator.bernd:
class: human
description: Bernd Worsch
```
---
## Phase 1 — Project Scaffolding
**Acceptance:** `bridge --help` lists all commands.
### T01 — Create pyproject.toml
```task
id: BRIDGE-WP-0001-T01
state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea
status: done
priority: high
```
Set up `[project]`, `[project.scripts]` (entry point `bridge = bridge.cli:app`), and dependencies: `typer`, `pyyaml`, `httpx`. Run `uv lock`.
### T02 — Create package skeleton
```task
id: BRIDGE-WP-0001-T02
state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105
status: done
priority: high
```
Create `src/bridge/__init__.py` and empty module stubs: `cli.py`, `config.py`, `models.py`, `manager.py`, `health.py`, `state.py`, `audit.py`.
### T03 — Verify uv tool install
```task
id: BRIDGE-WP-0001-T03
state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79
status: done
priority: medium
```
Verify `uv tool install -e .` produces a working `bridge --help`.
---
## Phase 2 — Config Loading (FR-2, FC-1)
**Acceptance:** `config.load()` returns typed config objects; clear error message on bad YAML.
### T04 — Define config dataclasses in models.py
```task
id: BRIDGE-WP-0001-T04
state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e
status: done
priority: high
```
Define `TunnelConfig`, `ReconnectPolicy`, `HealthCheckConfig`, `ActorInfo` as dataclasses.
### T05 — Implement config.py
```task
id: BRIDGE-WP-0001-T05
state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907
status: done
priority: high
```
Load `~/.config/bridge/tunnels.yaml`, validate required fields, raise clear errors. Support `BRIDGE_CONFIG` env var override for testing.
### T06 — Unit tests for config loading
```task
id: BRIDGE-WP-0001-T06
state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252
status: done
priority: medium
```
Test: valid config, missing required field, unknown tunnel name.
---
## Phase 3 — State Management (FR-4, FR-7, FR-14)
**Acceptance:** State round-trips correctly; stale PIDs detected without error.
### T07 — Implement state.py
```task
id: BRIDGE-WP-0001-T07
state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927
status: done
priority: high
```
Read/write PID file and state file under `~/.local/state/bridge/`. Check if PID is alive. Create state dir on first write.
### T08 — Define BridgeState enum
```task
id: BRIDGE-WP-0001-T08
state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9
status: done
priority: medium
```
States: `STOPPED`, `STARTING`, `CONNECTED`, `DEGRADED`, `RECONNECTING`, `FAILED`.
### T09 — Unit tests for state management
```task
id: BRIDGE-WP-0001-T09
state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096
status: done
priority: medium
```
Test: write/read state round-trip, stale PID detection without error.
---
## Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13)
**Acceptance:** `bridge up <name>` starts tunnel; killing SSH process triggers reconnect; `bridge down <name>` stops cleanly.
### T10 — Implement TunnelManager — SSH subprocess wrapper
```task
id: BRIDGE-WP-0001-T10
state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec
status: done
priority: high
```
SSH command: `ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}`. Manager runs as a daemonised child process; parent writes PID and exits.
### T11 — Implement reconnect backoff loop
```task
id: BRIDGE-WP-0001-T11
state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27
status: done
priority: high
```
Exponential backoff between `backoff_initial` and `backoff_max`. Respect `max_attempts` (0 = infinite). On disconnect: state → `RECONNECTING`, log event, restart SSH.
### T12 — Implement graceful shutdown
```task
id: BRIDGE-WP-0001-T12
state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c
status: done
priority: medium
```
Catch SIGTERM/SIGINT, kill SSH subprocess, write `STOPPED` state.
---
## Phase 5 — Health Monitoring (FR-15, FR-16, FR-17)
**Acceptance:** With a non-responsive health URL, `bridge status` shows `degraded`.
### T13 — Implement health.py
```task
id: BRIDGE-WP-0001-T13
state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4
status: done
priority: medium
```
Async HTTP GET via `httpx` to configured health URL. Run health check loop inside manager process. On failure: state → `DEGRADED`; on recovery: state → `CONNECTED`.
### T14 — Write health check result to state dir
```task
id: BRIDGE-WP-0001-T14
state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467
status: done
priority: low
```
Persist timestamp, status, HTTP code or error for display in `bridge status`.
---
## Phase 6 — Audit Logging (FR-24, FR-25, FR-26)
**Acceptance:** All lifecycle events appear in the log with actor attribution.
### T15 — Implement audit.py
```task
id: BRIDGE-WP-0001-T15
state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7
status: done
priority: medium
```
Append JSON-lines to `~/.local/state/bridge/<name>.log`. Events: `bridge_started`, `bridge_connected`, `bridge_disconnected`, `bridge_reconnecting`, `health_check_failed`, `health_check_recovered`, `bridge_stopped`. Each entry: `timestamp` (ISO-8601), `tunnel`, `actor`, `actor_class`, `event`, `detail`.
---
## Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11)
**Acceptance:** All commands work end-to-end; `--help` on each command shows correct usage.
Status table columns: `TUNNEL`, `STATE`, `ACTOR`, `HOST`, `UPTIME`, `HEALTH`. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. `--json` flag on `status` for automation.
### T16 — CLI: bridge up
```task
id: BRIDGE-WP-0001-T16
state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6
status: done
priority: high
```
Start named tunnel or all tunnels if name omitted.
### T17 — CLI: bridge down
```task
id: BRIDGE-WP-0001-T17
state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657
status: done
priority: high
```
Stop named tunnel or all tunnels if name omitted.
### T18 — CLI: bridge restart
```task
id: BRIDGE-WP-0001-T18
state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681
status: done
priority: medium
```
Down then up for named tunnel or all.
### T19 — CLI: bridge status
```task
id: BRIDGE-WP-0001-T19
state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10
status: done
priority: high
```
Table output with `--json` flag for automation.
### T20 — CLI: bridge logs
```task
id: BRIDGE-WP-0001-T20
state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732
status: done
priority: medium
```
Tail log file. Defaults to last 50 lines. `--follow` for live tail. `--lines N` to override.
---
## Phase 8 — Integration Tests
**Acceptance:** `uv run pytest` passes cleanly.
### T21 — Integration test: up/status/down cycle
```task
id: BRIDGE-WP-0001-T21
state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8
status: done
priority: medium
```
Test fixture with minimal `tunnels.yaml` pointing to localhost. Test full `up → status → down` cycle against loopback SSH target or mocked subprocess.
### T22 — Integration test: reconnect behaviour
```task
id: BRIDGE-WP-0001-T22
state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e
status: done
priority: medium
```
Test reconnect loop with a subprocess that exits immediately.
### T23 — Integration test: health check degraded path
```task
id: BRIDGE-WP-0001-T23
state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde
status: done
priority: medium
```
Test degraded state with a mock HTTP server that returns failures.
---
## FRS Traceability
| FRS Requirement Group | Phase |
|---|---|
| FR-1 to FR-4 — Bridge creation | 4 |
| FR-5 to FR-7 — Bridge termination | 4 |
| FR-8 to FR-9 — Bridge restart | 7 |
| FR-10 to FR-11 — Status inspection | 7 |
| FR-12 to FR-14 — Lifecycle monitoring | 4 |
| FR-15 to FR-17 — Health monitoring | 5 |
| FR-18 to FR-20 — Actor attribution | 2, 6 |
| FR-24 to FR-26 — Audit logging | 6 |
| FC-1 — Config dependency | 2 |
| FC-2 — External connectivity | 4 |
*FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.*
---
## Deferred
- **FR-21FR-23** — Infrastructure target discovery (`bridge targets`) — requires OpsCatalog
- **FR-27FR-29** — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure
- **OpsCatalog** — Separate workplan (`BRIDGE-WP-0002`)

View File

@@ -0,0 +1,404 @@
---
id: BRIDGE-WP-0002
type: workplan
title: "OpsCatalog Extension"
domain: infotech
repo: ops-bridge
status: completed
owner: Bernd
topic_slug: custodian
state_hub_workstream_id: f38bfcdb-f115-4431-88b5-ce906a24199c
created: "2026-03-11"
updated: "2026-03-12"
---
# BRIDGE-WP-0002 — OpsCatalog Extension
**Scope:** Implement OpsCatalog as a Git-backed YAML knowledge repository and
integrate it with the `bridge` CLI.
**Depends on:** BRIDGE-WP-0001 complete (bridge CLI operational).
**Out of scope:** Identity provider integration (FR-2729, deferred indefinitely).
---
## Goal
Deliver the OpsCatalog subsystem: a structured YAML catalog of operations
domains, targets, bridges, and actor classes stored in a Git repository.
OpsBridge loads the catalog at runtime to resolve bridge identifiers, orient
operators, and expose the `bridge targets` and `bridge catalog` commands.
---
## Reference Documents
| Document | Location |
|---|---|
| OpsCatalog Spec (PRD + FRS + Schemas) | `wiki/OpsCatalogSpecification.md` |
| OpsBridge FRS (deferred FRs) | `wiki/OpsBridgeFrs.md` §5.8, §5.10 |
| CLAUDE.md | `CLAUDE.md` |
---
## Architecture Summary
```
~/.config/bridge/tunnels.yaml
catalog_path: ~/ops-catalog # path to the OpsCatalog Git repo
ops-catalog/ # separate Git repo, consumed by bridge
domains/
<domain>/
domain.yaml # type: domain
targets/
<target>.yaml # type: target
bridges/
<bridge>.yaml # type: bridge
docs/
*.md # operations notes
actors/
<actor>.yaml # type: actor
schemas/
domain.schema.yaml
target.schema.yaml
bridge.schema.yaml
actor.schema.yaml
src/bridge/
catalog/
__init__.py
loader.py # walk catalog_path, parse YAML files into typed objects
models.py # CatalogDomain, CatalogTarget, CatalogBridge, ActorClass
validator.py # validate catalog entries against schemas
resolver.py # resolve tunnel name → CatalogBridge → TunnelConfig
```
**Integration points with existing bridge code:**
- `config.py`: read `catalog_path` from `tunnels.yaml`; pass to catalog loader
- `manager.py`: use `resolver.py` to look up bridge config from catalog when
tunnel is not defined inline in `tunnels.yaml`
- `cli.py`: add `bridge targets` and `bridge catalog` commands
---
## YAML Schemas
### domain.yaml
```yaml
type: domain
id: coulombcore
name: CoulombCore Infrastructure
description: Core infrastructure domain for operational services
environment: production
```
### target.yaml
```yaml
type: target
id: state-hub
domain: coulombcore
kind: service
description: Infrastructure state coordination service
reachable_via:
- state-hub-coulombcore
```
### bridge.yaml
```yaml
type: bridge
id: state-hub-coulombcore
domain: coulombcore
target: state-hub
description: Operations bridge for state hub diagnostics
access_method: ssh-reverse
host: coulombcore.local
remote_port: 18000
local_port: 8000
ssh_user: ubuntu
ssh_key: ~/.ssh/id_ops
actor: agent.claude-coulombcore
health_check:
url: http://127.0.0.1:18000/health
interval_seconds: 30
timeout_seconds: 5
reconnect:
max_attempts: 0
backoff_initial: 5
backoff_max: 60
```
### actor.yaml
```yaml
type: actor
id: agent.claude-remediator
class: automation
description: Automated remediation agent
```
---
## Phase 1 — Catalog Data Models
**Acceptance:** All catalog YAML types parse into typed Python objects.
### T01 — Define catalog dataclasses in catalog/models.py
```task
id: BRIDGE-WP-0002-T01
state_hub_task_id: 21b90574-a27c-467c-8e9d-d4029a659171
status: done
priority: high
```
Define `CatalogDomain`, `CatalogTarget`, `CatalogBridge`, `ActorClass` dataclasses.
`CatalogBridge` must be mergeable with `TunnelConfig` (catalog supplies defaults;
inline `tunnels.yaml` entries can override).
---
## Phase 2 — Catalog Loader (FR-14)
**Acceptance:** `catalog.load(path)` returns a populated `Catalog` object from a
directory tree; unknown `type:` values are skipped with a warning.
### T02 — Implement catalog/loader.py
```task
id: BRIDGE-WP-0002-T02
state_hub_task_id: 782b5b4d-1f3f-4e5d-ad46-dc57b345bda3
status: done
priority: high
```
Walk `catalog_path` recursively, parse every `*.yaml` file, dispatch on `type:`
field. Build in-memory index: domains, targets, bridges, actors.
### T03 — Unit tests for catalog loader
```task
id: BRIDGE-WP-0002-T03
state_hub_task_id: 41fed4f8-7818-4ca1-bb48-6ac1089220e8
status: done
priority: medium
```
Test: full catalog directory fixture loads correctly; missing required field raises
clear error; unknown type is skipped; empty catalog returns empty index.
---
## Phase 3 — Catalog Validation (FR-15)
**Acceptance:** `bridge catalog validate` exits non-zero and prints all violations
when the catalog contains invalid entries.
### T04 — Implement catalog/validator.py
```task
id: BRIDGE-WP-0002-T04
state_hub_task_id: 32946d15-5516-4599-8f27-8c653dec6786
status: done
priority: medium
```
Validate required fields per type. Cross-reference checks: target's `domain` must
exist; target's `reachable_via` bridge IDs must exist; bridge's `target` and
`domain` must exist; actor referenced by bridge must exist.
### T05 — Unit tests for catalog validation
```task
id: BRIDGE-WP-0002-T05
state_hub_task_id: 6061a6eb-9966-4be9-aa5e-ea7edf7fd085
status: done
priority: medium
```
Test: valid catalog passes; dangling `reachable_via` reference fails; missing
required field fails.
---
## Phase 4 — Bridge Resolver (FR-2 integration)
**Acceptance:** `bridge up state-hub-coulombcore` resolves the bridge config from
the catalog when no inline entry exists in `tunnels.yaml`.
### T06 — Implement catalog/resolver.py
```task
id: BRIDGE-WP-0002-T06
state_hub_task_id: a92d97c8-4eec-4dd5-9b90-d9c1cba813ac
status: done
priority: high
```
`resolve(name, catalog, inline_config) → TunnelConfig`. Lookup order: inline
`tunnels.yaml` entry wins; fall back to catalog bridge by ID. Merge catalog
bridge fields into `TunnelConfig`. Raise `BridgeNotFound` if neither source
has the name.
### T07 — Integrate resolver into config.py and manager.py
```task
id: BRIDGE-WP-0002-T07
state_hub_task_id: 23799377-64f2-4c13-aa72-364770d80f91
status: done
priority: high
```
Read `catalog_path` from `tunnels.yaml` (optional; catalog disabled if absent).
Pass resolved `TunnelConfig` to `TunnelManager` unchanged — manager stays
catalog-unaware.
### T08 — Unit tests for resolver
```task
id: BRIDGE-WP-0002-T08
state_hub_task_id: d2313182-975f-409f-9d4f-ebabf66b44df
status: done
priority: medium
```
Test: inline entry takes precedence; catalog fallback works; inline overrides
catalog fields; missing name raises `BridgeNotFound`.
---
## Phase 5 — CLI: bridge targets (FR-21, FR-22, FR-23)
**Acceptance:** `bridge targets` prints a table of domains, targets, and which
bridges provide access to each target.
### T09 — CLI: bridge targets command
```task
id: BRIDGE-WP-0002-T09
state_hub_task_id: f9e508db-a19f-42be-9437-b4bdeb00a534
status: done
priority: medium
```
Table columns: `DOMAIN`, `TARGET`, `KIND`, `BRIDGES`. `--domain <name>` filter.
`--json` flag for automation. Requires catalog to be configured; clear error if
`catalog_path` not set.
### T10 — CLI: bridge targets show <target>
```task
id: BRIDGE-WP-0002-T10
state_hub_task_id: e288a1d3-d676-404a-a3eb-25dbb241502d
status: done
priority: low
```
Show full metadata for a single target: domain, kind, description, reachable_via
bridges, and any operations notes from `docs/*.md` files in the domain directory.
---
## Phase 6 — CLI: bridge catalog commands
**Acceptance:** Operators can inspect and validate the catalog from the CLI.
### T11 — CLI: bridge catalog list
```task
id: BRIDGE-WP-0002-T11
state_hub_task_id: 73899b70-b0ac-4f48-b362-cc2455a66f41
status: done
priority: medium
```
List all domains and a count of targets and bridges per domain.
### T12 — CLI: bridge catalog validate
```task
id: BRIDGE-WP-0002-T12
state_hub_task_id: e091daa2-7c20-4169-b634-1fcc469513ea
status: done
priority: medium
```
Run `validator.py` and print all violations. Exit 0 if clean, 1 if violations
found. Useful in CI pipelines for the catalog repo.
### T13 — CLI: bridge catalog show <bridge-id>
```task
id: BRIDGE-WP-0002-T13
state_hub_task_id: 9f5f4f30-bfe6-40fd-b178-2fbb396816ee
status: done
priority: low
```
Print full resolved bridge metadata including target and domain context.
---
## Phase 7 — Integration Tests
**Acceptance:** `uv run pytest` passes cleanly with catalog fixtures.
### T14 — Integration test: catalog load and resolve
```task
id: BRIDGE-WP-0002-T14
state_hub_task_id: 5ccb2b4b-7ea5-4c38-8246-d59b8f7d4419
status: done
priority: medium
```
Fixture: minimal catalog directory with one domain, one target, one bridge.
Test `bridge up <catalog-bridge-name>` resolves and starts tunnel.
### T15 — Integration test: bridge targets output
```task
id: BRIDGE-WP-0002-T15
state_hub_task_id: 72c9f686-c474-46c4-a759-bfd47e2d4211
status: done
priority: medium
```
Test `bridge targets` output matches catalog fixture. Test `--json` flag.
### T16 — Integration test: bridge catalog validate
```task
id: BRIDGE-WP-0002-T16
state_hub_task_id: 83c0734e-0dc2-49ce-8b6a-a4d5e26ff33a
status: done
priority: medium
```
Test clean catalog exits 0; catalog with a dangling reference exits 1 with a
clear message.
---
## FRS Traceability
| FRS Requirement Group | Phase |
|---|---|
| FR-14 — Catalog retrieval | 2 |
| FR-15 — Catalog validation | 3 |
| FR-1 to FR-3 — Domain management | 2, 5 |
| FR-4 to FR-6 — Target management | 2, 5 |
| FR-7 to FR-9 — Bridge definition | 2, 4 |
| FR-10 to FR-11 — Actor classification | 2 |
| FR-12 to FR-13 — Operational annotations | 5 (docs/*.md) |
| FR-21 to FR-23 — Infrastructure target discovery (OpsBridge FRS) | 5 |
*FR-2729 (identity integration) remain deferred — require external identity
provider infrastructure.*
---
## Deferred
- **FR-2729** — Identity provider integration (privacyIDEA / SSH CA) — separate
workplan when identity infrastructure is available.
- **Operations notes search** — full-text search across `docs/*.md` files — nice
to have, not required for MVP.

View File

@@ -0,0 +1,526 @@
---
id: BRIDGE-WP-0003
type: workplan
title: "OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage"
domain: infotech
repo: ops-bridge
status: done
owner: Bernd
topic_slug: custodian
state_hub_workstream_id: 97009d3f-fd92-4fd9-a308-6c2445b4d623
created: "2026-03-12"
updated: "2026-03-12"
---
# BRIDGE-WP-0003 — OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage
**Scope:** Expose OpsBridge and OpsCatalog functionality as a FastMCP server
and a Claude Code skill. Introduce a capability registry and cross-access-mode
test suite that enforces test coverage parity across CLI, MCP, and skill for
every operation — including a meta-test that validates the test suite itself is
complete.
**Depends on:** BRIDGE-WP-0001 and BRIDGE-WP-0002 complete.
**Out of scope:** Identity provider integration (FR-2729, deferred indefinitely).
---
## Goal
After this workplan:
1. Any Claude Code agent can call `bridge_up()`, `bridge_status()`,
`catalog_list_targets()` etc. as first-class MCP tools — no Bash
required, structured JSON in/out.
2. Human operators can invoke `/bridge-status` as a skill to get an
immediate, natural-language summary of tunnel health.
3. Adding any new capability (CLI command, MCP tool) without writing tests
for all required access modes causes `uv run pytest` to fail with a
clear capability × mode gap report.
4. The gap-detection mechanism is itself tested: a synthetic missing-mode
fixture asserts the meta-test catches it.
---
## Reference Documents
| Document | Location |
|---|---|
| Architecture note | `CLAUDE.md` — Architecture section |
| OpsBridge FRS | `wiki/OpsBridgeFrs.md` |
| State Hub MCP server (reference impl) | `~/the-custodian/state-hub/mcp_server/server.py` |
---
## Architecture Summary
```
src/bridge/
capabilities.py # canonical capability registry
mcp_server/
__init__.py
server.py # FastMCP app, stdio entry point
.mcp.json # project-scope MCP registration
scripts/
register_mcp.py # user-scope registration helper
~/.claude/plugins/
ops-bridge/
bridge-status.md # /bridge-status skill
tests/
conftest.py # capability + access_mode marks, collector helper
test_cli.py # existing — annotated with marks (T09)
test_mcp.py # new — FastMCP in-process client tests
test_skill.py # new — static skill coverage lint
test_coverage_completeness.py # new — cross-mode meta-test
```
### Capability Registry
```python
# src/bridge/capabilities.py
from dataclasses import dataclass
ACCESS_MODES = {"cli", "mcp", "skill"}
@dataclass
class Capability:
name: str
description: str
required_access_modes: frozenset[str]
CAPABILITIES: list[Capability] = [
Capability("bridge_up", "Start one or all tunnels", frozenset({"cli", "mcp"})),
Capability("bridge_down", "Stop one or all tunnels", frozenset({"cli", "mcp"})),
Capability("bridge_restart", "Restart one or all tunnels", frozenset({"cli", "mcp"})),
Capability("bridge_status", "Show tunnel status", frozenset({"cli", "mcp", "skill"})),
Capability("bridge_logs", "Tail tunnel audit log", frozenset({"cli", "mcp"})),
Capability("catalog_list_targets", "List catalog targets", frozenset({"cli", "mcp"})),
Capability("catalog_show_target", "Show target metadata", frozenset({"cli", "mcp"})),
Capability("catalog_list_domains", "List catalog domains", frozenset({"cli", "mcp"})),
Capability("catalog_validate", "Validate catalog consistency", frozenset({"cli", "mcp"})),
Capability("catalog_show_bridge", "Show bridge metadata", frozenset({"cli", "mcp"})),
]
```
### Cross-Mode Test Marks
Every test that exercises a capability against an access mode carries two marks:
```python
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("cli")
def test_bridge_up_cli(runner, config_file):
result = runner.invoke(app, ["up", "my-tunnel"])
assert result.exit_code == 0
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("mcp")
async def test_bridge_up_mcp(mcp_client):
result = await mcp_client.call_tool("bridge_up", {"tunnel": "my-tunnel"})
assert result["started"] == ["my-tunnel"]
```
### Meta-Test Mechanism
`test_coverage_completeness.py` uses a pytest plugin hook to collect all
test items, read their marks, and assert the coverage matrix is complete:
```
capability cli mcp skill
bridge_up ✓ ✓ — (not required for skill)
bridge_status ✓ ✓ ✓
catalog_list_targets ✓ ✓ —
...
```
Fails with a table of gaps. The meta-test is itself validated by a fixture
that injects a synthetic `Capability("test_sentinel", frozenset({"cli","mcp"}))`,
deliberately omits the `mcp` test, and asserts the checker raises.
---
## Phase 1 — Capability Registry
**Acceptance:** `from bridge.capabilities import CAPABILITIES` works; every
existing CLI command and the planned MCP tool set appears in the registry.
### T01 — Define capability registry module (src/bridge/capabilities.py)
```task
id: BRIDGE-WP-0003-T01
state_hub_task_id: 1397a838-b225-4452-ad53-29ad65388060
status: done
priority: high
```
`Capability` dataclass with `name`, `description`, `required_access_modes`.
List all 10 capabilities as shown in the architecture above. No external
dependencies — pure stdlib.
### T02 — Meta-test: registry completeness against CLI commands and MCP tools
```task
id: BRIDGE-WP-0003-T02
state_hub_task_id: 97467243-9237-4e63-a860-cc49587546ad
status: done
priority: high
```
Introspect `app.registered_commands` (Typer) and `mcp.list_tools()` (FastMCP).
Assert every name appears in `{c.name for c in CAPABILITIES}`. Fails fast if
a developer adds a CLI command or MCP tool without updating the registry.
---
## Phase 2 — MCP Server
**Acceptance:** `uv run python src/bridge/mcp_server/server.py` starts without
error; `bridge_status()` returns a list of tunnel dicts; `bridge_up("x")`
returns `{"started": ["x"]}` or `{"already_running": ["x"]}`.
### T03 — Add fastmcp dependency and mcp_server package skeleton
```task
id: BRIDGE-WP-0003-T03
state_hub_task_id: f2fd64f5-31c6-493b-b48b-d13980467cca
status: done
priority: high
```
Add `fastmcp>=2.0.0` to `[project.dependencies]` in `pyproject.toml`. Create
`src/bridge/mcp_server/__init__.py` (empty) and `server.py` with:
```python
from fastmcp import FastMCP
mcp = FastMCP(name="ops-bridge", instructions="...")
if __name__ == "__main__":
mcp.run(transport="stdio")
```
### T04 — Implement bridge lifecycle MCP tools (up, down, restart, status, logs)
```task
id: BRIDGE-WP-0003-T04
state_hub_task_id: 1bfc9b36-2be3-4606-a6e9-d611d1ac33ab
status: done
priority: high
```
`@mcp.tool()` wrappers that import and call the Python library directly (no
subprocess). Signatures:
```python
def bridge_up(tunnel: str | None = None) -> dict
def bridge_down(tunnel: str | None = None) -> dict
def bridge_restart(tunnel: str | None = None) -> dict
def bridge_status() -> list[dict]
def bridge_logs(tunnel: str, lines: int = 50) -> list[dict]
```
All return JSON-serialisable dicts/lists. `tunnel=None` means all tunnels.
### T05 — Implement catalog MCP tools
```task
id: BRIDGE-WP-0003-T05
state_hub_task_id: ef7fa23c-d2e1-4fe0-9e26-994c1a6ce1fb
status: done
priority: high
```
```python
def catalog_list_targets(domain: str | None = None) -> list[dict]
def catalog_show_target(target_id: str) -> dict | None
def catalog_list_domains() -> list[dict]
def catalog_validate() -> dict # {"valid": bool, "errors": list[str]}
def catalog_show_bridge(bridge_id: str) -> dict | None
```
When `catalog_path` is not configured in `tunnels.yaml`, return
`{"error": "catalog_path not configured"}` rather than raising.
### T06 — Implement bridge:// and catalog:// MCP resources
```task
id: BRIDGE-WP-0003-T06
state_hub_task_id: 71c9ee45-6928-416c-b4f3-dfb785a0ec8f
status: done
priority: medium
```
```python
@mcp.resource("bridge://status")
def resource_bridge_status() -> str:
"""Live snapshot of all tunnel states."""
@mcp.resource("catalog://domains")
def resource_catalog_domains() -> str: ...
@mcp.resource("catalog://targets")
def resource_catalog_targets() -> str: ...
```
Resources are for cheap orientation reads; tools are for actions and
parameterised queries. Both are needed.
### T07 — Add .mcp.json project-scope registration config
```task
id: BRIDGE-WP-0003-T07
state_hub_task_id: 618c011d-bd1b-4c8f-8750-f3d2f9fcaf88
status: done
priority: medium
```
```json
{
"mcpServers": {
"ops-bridge": {
"type": "stdio",
"command": "uv",
"args": ["run", "python", "src/bridge/mcp_server/server.py"],
"cwd": "/home/worsch/ops-bridge"
}
}
}
```
Project-scope: Claude Code sessions inside `ops-bridge/` get the tools
automatically. See T14 for user-scope (machine-global) registration.
---
## Phase 3 — Skill
**Acceptance:** `/bridge-status` invoked in Claude Code runs the skill,
calls `bridge_status` MCP tool, and returns a natural-language health summary.
### T08 — Implement /bridge-status skill for human operators
```task
id: BRIDGE-WP-0003-T08
state_hub_task_id: 2c070f34-12b5-4dd9-ab24-bb7b6836773c
status: done
priority: medium
```
Skill file at `~/.claude/plugins/ops-bridge/bridge-status.md`. Prompt instructs
Claude to:
1. Call `bridge_status` MCP tool
2. Report each tunnel: name, state (with colour hint), host, uptime
3. Flag any `degraded` or `failed` tunnels and suggest `bridge restart <name>`
4. If catalog is configured, offer `catalog_list_targets` for discovery context
Skill prompt **must** reference the canonical capability names (`bridge_status`,
`catalog_list_targets`) so `test_skill.py` can assert coverage statically.
---
## Phase 4 — Cross-Access-Mode Test Suite
**Acceptance:** `uv run pytest` fails if any capability is missing a test for
any of its required access modes. The failure message is a capability × mode
gap matrix. The meta-test is itself verified by a synthetic failing fixture.
### T09 — CLI test layer: annotate existing tests with capability/access_mode marks
```task
id: BRIDGE-WP-0003-T09
state_hub_task_id: a8f3f5fb-fcd6-47e9-aad5-85dc803f796d
status: done
priority: high
```
Retrofit `tests/test_cli.py` (and other CLI test files) with:
```python
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("cli")
def test_bridge_up_starts_tunnel(...): ...
```
Every capability whose `required_access_modes` includes `"cli"` must have at
least one marked test in the CLI layer.
### T10 — MCP test layer: tests/test_mcp.py with FastMCP in-process test client
```task
id: BRIDGE-WP-0003-T10
state_hub_task_id: acb7ada6-111d-4b8d-b201-45748c394c43
status: done
priority: high
```
Use FastMCP's `Client(mcp_app)` context manager (in-process, no network):
```python
@pytest.mark.capability("bridge_up")
@pytest.mark.access_mode("mcp")
async def test_bridge_up_mcp(mcp_client, mock_tunnel_manager):
result = await mcp_client.call_tool("bridge_up", {"tunnel": "t1"})
assert result["started"] == ["t1"]
```
Cover: correct return schema, missing tunnel name handled, catalog tools
graceful when `catalog_path` unset, resource URIs return valid JSON.
### T11 — Skill test layer: tests/test_skill.py — static skill coverage lint
```task
id: BRIDGE-WP-0003-T11
state_hub_task_id: 071adfa4-2ccb-466b-b298-35130876267f
status: done
priority: medium
```
Parse the skill markdown file. Assert:
- File is syntactically valid (frontmatter parseable)
- Each capability with `"skill"` in `required_access_modes` has its `name`
appearing in the skill body text
This is a static lint, not an LLM invocation — fast and deterministic.
```python
@pytest.mark.access_mode("skill")
def test_skill_covers_required_capabilities():
skill_text = Path("~/.claude/plugins/ops-bridge/bridge-status.md").read_text()
for cap in CAPABILITIES:
if "skill" in cap.required_access_modes:
assert cap.name in skill_text, f"Skill missing capability: {cap.name}"
```
### T12 — Cross-mode completeness meta-test: tests/test_coverage_completeness.py
```task
id: BRIDGE-WP-0003-T12
state_hub_task_id: f1277a48-1790-42bd-8c70-8ba10c68312b
status: done
priority: critical
```
The centrepiece. Uses a pytest plugin (conftest hook or `pytest.ini`
`collect_ignore`) to collect all test items, read their marks, build the
coverage matrix, and assert completeness:
```python
def test_all_capabilities_have_all_required_mode_tests(pytestconfig):
covered = collect_capability_coverage(pytestconfig)
gaps = []
for cap in CAPABILITIES:
for mode in cap.required_access_modes:
if (cap.name, mode) not in covered:
gaps.append(f" {cap.name:<30} {mode}")
if gaps:
pytest.fail("Missing capability × mode coverage:\n" + "\n".join(gaps))
```
**Self-validation fixture:** a separate test injects a synthetic capability
`Capability("_test_sentinel", frozenset({"cli","mcp"}))` into a copy of
`CAPABILITIES`, provides only a `cli`-marked test for it, and asserts that
calling `collect_capability_coverage` on this patched set reports the `mcp`
gap.
### T13 — conftest.py: pytest marks registration and coverage collector helper
```task
id: BRIDGE-WP-0003-T13
state_hub_task_id: c518662a-9a5b-40de-86f5-582a16489cd3
status: done
priority: medium
```
Register custom marks to silence `PytestUnknownMarkWarning`:
```toml
# pyproject.toml
[tool.pytest.ini_options]
markers = [
"capability(name): the bridge capability under test",
"access_mode(mode): access mode being tested (cli, mcp, skill)",
]
```
Implement `collect_capability_coverage(session_or_items)` in `conftest.py`
that walks collected items and returns `set[tuple[str, str]]` of
`(capability_name, access_mode)` pairs.
---
## Phase 5 — Registration and Documentation
**Acceptance:** `python scripts/register_mcp.py` registers ops-bridge MCP at
user scope; `bridge --help` still works; `uv run pytest` passes.
### T14 — User-scope registration guide and patch script
```task
id: BRIDGE-WP-0003-T14
state_hub_task_id: b86916ba-59f3-44c1-b874-8af92d30e470
status: done
priority: medium
```
`scripts/register_mcp.py` modelled on `state-hub/scripts/patch_mcp_cwd.py`:
reads `.mcp.json`, registers at user scope via `claude mcp add-json -s user`,
then patches `cwd` directly in `~/.claude.json`. Update `README.txt` with:
```
MCP INTEGRATION
---------------
Project-scope (auto, inside ops-bridge/):
Already configured in .mcp.json.
User-scope (machine-global, any repo):
python scripts/register_mcp.py
```
### T15 — Integration test: agent workflow (bridge_status → bridge_up → bridge_status)
```task
id: BRIDGE-WP-0003-T15
state_hub_task_id: d826764f-e2f1-4f6a-842c-a1852a88b209
status: done
priority: medium
```
End-to-end MCP flow with mocked `TunnelManager`:
1. `bridge_status()` → all tunnels `stopped`
2. `bridge_up("test-tunnel")``{"started": ["test-tunnel"]}`
3. `bridge_status()``test-tunnel` now `connected`
Verifies the MCP layer correctly delegates to the library and state is
reflected. Marked `@pytest.mark.capability("bridge_up") @pytest.mark.access_mode("mcp")`.
---
## Capability × Mode Coverage Target
| Capability | CLI | MCP | Skill |
|-------------------------|-----|-----|-------|
| bridge_up | ✓ | ✓ | |
| bridge_down | ✓ | ✓ | |
| bridge_restart | ✓ | ✓ | |
| bridge_status | ✓ | ✓ | ✓ |
| bridge_logs | ✓ | ✓ | |
| catalog_list_targets | ✓ | ✓ | |
| catalog_show_target | ✓ | ✓ | |
| catalog_list_domains | ✓ | ✓ | |
| catalog_validate | ✓ | ✓ | |
| catalog_show_bridge | ✓ | ✓ | |
The skill only requires `bridge_status` and `catalog_list_targets` — the
two capabilities needed for a health summary. All others are CLI+MCP only.
---
## Deferred
- **FR-2729** — Identity provider integration — separate workplan.
- **Skill coverage for lifecycle operations** — `/bridge-up`, `/bridge-down`
skills for human operators are low value; agents use MCP tools directly.
- **Remote MCP transport (SSE/HTTP)** — stdio is sufficient for local use;
remote transport is a future concern when ops-bridge runs on a headless node.

View File

@@ -0,0 +1,340 @@
---
id: BRIDGE-WP-0004
type: workplan
title: "AccessManagementDirective Alignment"
domain: infotech
repo: ops-bridge
status: done
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7"
---
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
preserving full backward compatibility with the existing static-key mode.
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
---
## Goal
After this workplan:
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
handled transparently by the tunnel manager.
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
the directive, with config validation that enforces naming conventions.
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
§5 SIEM traceability requirement.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
---
## Design Decisions
### Static key mode stays first-class
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
explicitly supported for:
- Lab/dev environments without a CA
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
- Environments below the directive's complexity threshold
### cert_command interface
```yaml
# tunnels.yaml — optional cert_command field
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
on tunnel stop.
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
### TTL-aware cert refresh
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
failure, no reconnect backoff triggered.
If `cert_command` is absent, no TTL logic runs.
### Actor type model
`actor_class: str # "human" | "automation"` is replaced by:
```python
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
```
Backward-compat mapping at config load time: `"human"``adm`, `"automation"``atm`.
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
canonical values.
Config validation: if `actor` name is set, it must start with the prefix matching its type
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
SIEM auditability.
---
## Tasks
### T1 — ActorType enum
```task
id: BRIDGE-WP-0004-T1
state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
status: done
priority: high
```
- [x] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
- [x] `config.py`: accept legacy `"human"``ActorType.ADM` and `"automation"`
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
- [x] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
`atm-*` for ATM; raise `ConfigError` on mismatch
- [x] Update `manager.py` / `audit.py` call sites: `actor_class``actor_type.value`
- [x] Update tests
### T2 — cert_command config field
```task
id: BRIDGE-WP-0004-T2
state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
status: done
priority: high
```
- [x] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
- [x] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
content (shell-level freedom intentional)
- [x] Document in config example / SCOPE.md
### T3 — Cert acquisition in manager
```task
id: BRIDGE-WP-0004-T3
state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
status: done
priority: high
```
- [x] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
- If `cfg.cert_command` is None: return None (static key mode)
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
- [x] `build_ssh_command`: accept optional `cert_path`; when set, insert
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
- [x] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
so every reconnect gets a fresh cert
### T4 — cert_identity in audit log
```task
id: BRIDGE-WP-0004-T4
state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
status: done
priority: high
```
- [x] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
extract `Key ID` (the `-I` value from signing time)
- [x] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
JSON entry when present
- [x] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
- [x] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
### T5 — TTL-aware cert refresh
```task
id: BRIDGE-WP-0004-T5
state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
status: done
priority: high
```
- [x] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
from `ssh-keygen -L` output → `cert_expires_at: datetime`
- [x] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
on each iteration
- [x] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
next iteration)
- [x] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
- [x] If `cert_command` is absent, skip all TTL logic entirely
### T6 — `bridge cert-status` command
```task
id: BRIDGE-WP-0004-T6
state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
status: done
priority: medium
```
- [x] `cli.py`: add `cert-status [TUNNEL]` subcommand
- [x] For each tunnel (or the named one): read cert file from state dir if present,
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
time-to-expiry (or "static key / no cert" if absent)
- [x] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
- [x] `--json` flag for machine-readable output
### T7 — CertAcquisitionError handling
```task
id: BRIDGE-WP-0004-T7
state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
status: done
priority: high
```
- [x] New exception `CertAcquisitionError` in `models.py`
- [x] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
(cert failures are transient — e.g., Vault briefly unreachable)
- [x] After `max_attempts` consecutive cert failures, transition to `FAILED` state
### T8 — SCOPE.md and documentation updates
```task
id: BRIDGE-WP-0004-T8
state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
status: done
priority: medium
```
- [x] Update `SCOPE.md`: Current State updated to reflect completion; directive alignment done
- [x] `wiki/OpsBridgeFrs.md` §5.7 already covers actor attribution abstractly — no changes needed
- [x] `.claude/rules/architecture.md` already documents cert_command mode and actor vocab
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency (deferred)
### T9 — Tests
```task
id: BRIDGE-WP-0004-T9
state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
status: done
priority: high
```
- [x] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
cert_command parse
- [x] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
args; verify `CertAcquisitionError` on non-zero exit; TTL logic helpers
- [x] `test_audit.py`: `cert_identity` field; actor_type rename
- [x] `test_cli.py`: `cert-status` exit codes; JSON output shape
- [x] 233 tests, 0 failures
---
## Config Schema — Before / After
### Before
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: ops-agent
ssh_key: ~/.ssh/id_ed25519
actor: automation-agent
actors:
automation-agent:
class: automation
description: "state hub bridge agent"
```
### After (static key mode — unchanged behavior)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
### After (cert_command mode — ops-warden or any CA)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
---
## Acceptance Criteria
- [x] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
warning only); tunnel behaves identically
- [x] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
- [x] Config with `cert_command` set: SSH process launched with both `-i key` and
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
- [x] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
no TTL logic runs
- [x] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
- [x] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
- [x] `bridge cert-status` shows valid cert info; exits 1 on expired cert
- [x] All tests pass: `uv run pytest` (233 passed)
- [x] All lints pass: `uv run ruff check .`

View File

@@ -0,0 +1,194 @@
---
id: BRIDGE-WP-0005
type: workplan
title: "Restart includes remote cleanup (blank-slate recovery)"
domain: infotech
repo: ops-bridge
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-21"
updated: "2026-06-21"
state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
---
# BRIDGE-WP-0005 — Restart includes remote cleanup
**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
`sshd` remote forward on Railiance01 port `18000` blocked
`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
**Operator expectation:** `bridge restart` should mean *operational again* — a
blank-slate recovery — not merely "cycle the local manager PID while a broken
remote listener still holds the port."
## Topology and failure modes (refined)
Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
Cleanup policy must respect all of them.
### A. Workstation (laptop WSL) — tunnel **origin**
The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
remote hosts:
| Remote host | Tunnels (reverse) | Role |
|-------------|-------------------|------|
| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
local bridge processes without graceful remote SSH teardown. Orphan `sshd`
listeners on **all three remotes** are common after wake — especially
`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
### B. Haskelseed — also intermittently offline
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
different networks. The same orphan-forward pattern applies to its reverse ports
when the workstation-side tunnel dies uncleanly.
### C. VPS remotes (coulombcore, railiance01)
Normally always-on. Maintenance reboots clear remote kernel state, but:
- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
with a dead local SSH child;
- when the laptop returns, orphan forwards from the **previous** session may
still block new `-R` binds if the VPS did not reboot.
**Conclusion:** conditional remote cleanup before restart benefits **all reverse
tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
skips healthy forwards — VPS tunnels with live working forwards are untouched.
### D. Local-direction tunnels — no remote cleanup
`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
forward mode from workstation to remote services. They do not bind remote reverse
ports for State Hub. **`restart` stays local stop/start only** for these.
## Design (decided)
| Command | Behaviour after this workplan |
|---------|-------------------------------|
| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
| `bridge up` | Out of scope here (see T4 optional follow-up). |
Implementation sketch: replace the body of `cli.restart()` with a call to
`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
or per-tunnel `cleanup_tunnel` when a single tunnel is named.
Emit the same action summary strings cleanup already uses (`healthy`,
`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
## Out of scope
- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
positive during T2).
- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
- Renaming tunnels or changing `tunnels.yaml` host entries.
---
## T1 — Wire restart through cleanup path
```task
id: BRIDGE-WP-0005-T01
status: done
priority: high
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
```
Refactor `bridge/cli.py` `restart()` so reverse tunnels call
`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
`TunnelManager.stop()` + `start()`.
Requirements:
- Single-tunnel and all-tunnel restart both work.
- Local-direction tunnels keep stop/start only.
- Exit codes: preserve todays semantics where practical; exit non-zero if any
named tunnel ends in `CleanupAction.action == "error"`.
- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
etc.), not only "Restarted tunnel".
## T2 — Tests and regression coverage
```task
id: BRIDGE-WP-0005-T02
status: done
priority: high
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
```
Update `tests/test_cli.py`:
- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
reverse tunnels.
- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
invoked), local-direction tunnel (no cleanup call).
- Reuse mocks from `tests/test_cleanup.py` patterns.
`make test` and `make lint` pass.
## T3 — Operator docs and CLI help
```task
id: BRIDGE-WP-0005-T03
status: done
priority: medium
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
```
Document the blank-slate restart contract:
- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
- `bridge restart --help` — mention conditional remote stale-forward cleanup.
- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
maintenance — matching this workplan's topology section.
- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
incident note (one line each way).
## T4 — Optional: reconnect-loop hygiene (stretch)
```task
id: BRIDGE-WP-0005-T04
status: cancel
priority: low
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
```
Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
once after repeated exit-255 bind failures (laptop wake without operator running
`bridge restart`). Defer unless T1T3 are done; mark `cancel` if heuristic risk
outweighs benefit.
**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
loop risks killing a legitimately healthy orphan forward owned by another session
or operator. `bridge restart` now covers the operator-facing blank-slate path;
nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
wake-from-sleep reconnect failures remain frequent after a month of observation.
## T5 — Live verification on workstation + VPS
```task
id: BRIDGE-WP-0005-T05
status: done
priority: medium
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
```
After T1T2 ship, verify on real config:
1. **railiance01**`state-hub-mcp-railiance01` was `reconnecting` with stale
forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
`connected`.
2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat
path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
3. **coulombcore**`bridge restart state-hub-coulombcore` reported `healthy`,
PID unchanged (4116), forward undisturbed.
State Hub progress logged (2026-06-21). Workplan marked `finished`.

View File

@@ -0,0 +1,164 @@
---
id: OPS-WP-0001
type: workplan
title: "ops-bridge diagnostics and flow improvements"
domain: infotech
repo: ops-bridge
status: done
owner: claude
topic_slug: custodian
created: "2026-03-20"
updated: "2026-03-20"
state_hub_workstream_id: "6726cea2-447a-40b2-b0a0-edf495f07942"
---
# OPS-WP-0001 — ops-bridge diagnostics and flow improvements
**Scope:** Add `bridge check` end-to-end diagnostics command, fix `bridge status` to
surface live PID liveness and flag stale state, add a `bridge_check` MCP tool, and
wire Makefile convenience targets in state-hub.
**Context:** During a session, `bridge status` reported "connected" but the reverse
port forwarding was not active — stale `.state` files written by the daemon. The
status command does not verify the SSH process is alive or that the remote port is
actually listening.
---
## Task: Add `read_raw_pid()` to StateManager
```task
id: OPS-WP-0001-T01
status: done
priority: high
state_hub_task_id: "05e98e85-699a-4982-bb3e-8f2538cde2c7"
```
Add `read_raw_pid(name)` to `src/bridge/state.py` — reads PID from file without
liveness check. Existing `read_pid()` (which also checks liveness) stays unchanged.
---
## Task: Create `src/bridge/diagnostics.py`
```task
id: OPS-WP-0001-T02
status: done
priority: high
state_hub_task_id: "b68d7b1e-850b-469a-9de2-8b5d3d1f1c05"
```
New module with `TunnelCheckResult` dataclass (ssh_process, pid, remote_port,
local_api, latency_ms, stale_state, ok property) and `check_tunnel()` /
`check_all_tunnels()` functions. SSH probe via subprocess; optional httpx health check.
---
## Task: Fix `bridge status` and add `bridge check` to CLI
```task
id: OPS-WP-0001-T03
status: done
priority: high
state_hub_task_id: "e87c6c5d-170c-4af3-905c-a48fae2edbe5"
```
Fix `status` to show live PID liveness (LIVE column) and flag stale state.
Add `check` command with `--json` flag; exit 1 if any tunnel not ok.
Add `_print_check_table` helper.
---
## Task: Add `bridge_check` MCP tool and `bridge://check` resource
```task
id: OPS-WP-0001-T04
status: done
priority: high
state_hub_task_id: "7e97c112-20e2-4e2e-b853-53b10998392b"
```
Add `bridge_check(tunnel?)` tool and `bridge://check` resource to
`src/bridge/mcp_server/server.py`.
---
## Task: Register `bridge_check` capability
```task
id: OPS-WP-0001-T05
status: done
priority: high
state_hub_task_id: "c69fc748-a706-46db-a4d5-30d60222452b"
```
Add `bridge_check` entry to `src/bridge/capabilities.py` with
`required_access_modes=frozenset({"cli", "mcp"})`.
---
## Task: Write `tests/test_diagnostics.py`
```task
id: OPS-WP-0001-T06
status: done
priority: high
state_hub_task_id: "070ed088-74a6-48d3-81cf-739c2a2fd21b"
```
Unit tests: test_no_pid, test_pid_dead, test_pid_alive_port_listening,
test_pid_alive_port_closed, test_ssh_timeout.
---
## Task: Add `TestCheckCommand` to `tests/test_cli.py`
```task
id: OPS-WP-0001-T07
status: done
priority: high
state_hub_task_id: "aae5ddc5-f823-4647-a536-8604ddb97946"
```
Tests: test_check_help, test_check_all_pass (marked capability+mode),
test_check_any_fail, test_check_json_flag, test_check_specific_tunnel.
---
## Task: Add `TestMcpBridgeCheck` to `tests/test_mcp.py`
```task
id: OPS-WP-0001-T08
status: done
priority: high
state_hub_task_id: "ed492a3d-7a5f-465e-8cc3-d2f992f5462c"
```
Test: test_bridge_check_tool marked capability("bridge_check") + access_mode("mcp").
---
## Task: Add tunnels targets to state-hub Makefile
```task
id: OPS-WP-0001-T09
status: done
priority: medium
state_hub_task_id: "a3c77062-cff5-40e3-936c-b210b05f8839"
```
Add `tunnels-up`, `tunnels-status`, `tunnels-check` targets delegating to `bridge`.
Add to `.PHONY` line.
---
## Task: Run test suite and verify
```task
id: OPS-WP-0001-T10
status: done
priority: high
state_hub_task_id: "e42de76c-fab7-4924-8929-38fa9eaca478"
```
`cd /home/worsch/ops-bridge && uv run pytest tests/ -v` — all tests green.

View File

@@ -0,0 +1,221 @@
---
id: OPS-WP-0002
type: workplan
title: "Agent Usability — MCP Registration, Skill, and Worker Orientation"
domain: infotech
repo: ops-bridge
status: done
owner: custodian
topic_slug: custodian
created: "2026-03-21"
updated: "2026-03-26"
depends_on: OPS-WP-0001
state_hub_workstream_id: "c195cc40-8be7-462e-be26-a7d6bda34cd5"
---
# OPS-WP-0002 — Agent Usability: MCP Registration, Skill, and Worker Orientation
## Problem
The ops-bridge MCP server (`src/bridge/mcp_server/server.py`) is fully
implemented with tools for `bridge_up/down/restart/status/check/logs` and
catalog operations. But no agent can use it because:
1. **Not registered** — the server isn't in `~/.claude.json` and has no
persistent transport mode. It only runs on stdio today.
2. **No slash command** — agents working ad-hoc (not via MCP) have no
quick way to check or restore tunnels.
3. **No worker orientation** — agents on remote machines (CoulombCore,
Railiance) don't know that bridge is available or how to use it when
their state-hub connection drops.
## Goal
Any agent — on the workstation or a remote machine — can:
- Check tunnel health in one call
- Bring up a dropped tunnel without manual intervention
- Recover the state-hub connection if it goes down mid-session
## Design
### MCP server (workstation, persistent)
Run as an SSE service on port 8002 (same pattern as state-hub on 8001).
Registered at user scope in `~/.claude.json` so it's available to all
Claude Code sessions.
The SSE transport is already supported by FastMCP — just change the
`mcp.run()` call to accept an `--http` flag or read a `BRIDGE_MCP_PORT`
env var.
### Slash command skill (all machines)
A `/bridge` skill at `~/.claude/commands/bridge.md` (global scope) that:
- Reads `bridge status` output
- Surfaces any tunnel that is down or stale
- Offers to bring it up
- Useful on machines that don't have the MCP server registered
### Worker agent orientation (remote machines)
Update `CLAUDE.md` (global) and `ops-bridge` session protocol to tell
worker agents:
- Check `bridge status` at session start when on a machine with
ops-bridge installed
- If state-hub tunnel is down: run `bridge up state-hub-<machine>` to
restore it before making any state-hub API calls
- If no bridge command: fall back to direct API URL if reachable
---
## Tasks
### T01 — SSE transport mode for MCP server
```task
id: OPS-WP-0002-T01
status: done
priority: high
state_hub_task_id: "27fc6fa1-6d0e-438a-b4a3-c6091931da88"
```
Add `--http` flag and `BRIDGE_MCP_PORT` env var to `server.py` entry
point. When `--http` is set, run `mcp.run(transport="sse", port=PORT)`
instead of stdio.
Add `make mcp-http` target to `Makefile`:
```makefile
mcp-http: ## Start MCP server in SSE mode (default port 8002)
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
```
Add `make mcp-stop` target that kills any running MCP server on port
8002.
Gate: `bridge_status()` tool callable via SSE on localhost:8002 after
`make mcp-http`.
---
### T02 — Register MCP server in ~/.claude.json
```task
id: OPS-WP-0002-T02
status: done
priority: high
state_hub_task_id: "2216457d-035e-4804-b685-18975f3c6d1f"
```
Register the ops-bridge MCP server at user scope:
```bash
claude mcp add-json -s user ops-bridge \
'{"type":"sse","url":"http://127.0.0.1:8002/sse"}'
```
Document in `ops-bridge` CLAUDE.md:
```
To start the MCP server:
cd ~/ops-bridge && make mcp-http
To verify registration:
python3 -c "import json,os; d=json.load(open(os.path.expanduser('~/.claude.json'))); print(list(d.get('mcpServers',{}).keys()))"
```
Update global `~/.claude/CLAUDE.md` to list `ops-bridge` MCP server
alongside `state-hub`.
Gate: `ops-bridge` appears in Claude Code MCP tool list after `make
mcp-http`.
---
### T03 — `/bridge` slash command skill
```task
id: OPS-WP-0002-T03
status: done
priority: medium
state_hub_task_id: "4b2e39eb-4585-4e60-ab16-9e7909eced74"
```
Create `~/.claude/commands/bridge.md` — a global Claude Code skill for
tunnel management.
**Behaviour:**
1. Run `bridge status` and parse output
2. Report each tunnel: name, state, LIVE column
3. For any tunnel that is `stopped`, `reconnecting`, or `[STALE]`:
- Offer to run `bridge up <tunnel-name>`
- After `bridge up`, re-check with `bridge check <tunnel-name>`
4. If all tunnels are `connected` and LIVE: report green and exit
**Skill definition:**
```yaml
---
description: >
Check ops-bridge tunnel health and restore any dropped tunnels.
Reports status of all configured tunnels and offers to bring up
any that are stopped or stale.
argument-hint: "[tunnel-name]"
allowed-tools:
- Bash(bridge status)
- Bash(bridge up*)
- Bash(bridge down*)
- Bash(bridge check*)
- Bash(bridge logs*)
---
```
If an optional tunnel name is passed as `$ARGUMENTS`, scope all
operations to that tunnel only.
Gate: `/bridge` skill runs cleanly when all tunnels are up; correctly
identifies and recovers a manually-stopped tunnel.
---
### T04 — Worker agent orientation in CLAUDE.md
```task
id: OPS-WP-0002-T04
status: done
priority: medium
state_hub_task_id: "cc64bb07-ea5d-498a-8c14-bb653581efe7"
```
Update global `~/.claude/CLAUDE.md` — add a **Worker Agent — Bridge
Protocol** section:
```markdown
## Worker Agent — Bridge Protocol
When working on a remote machine (CoulombCore, Railiance nodes):
1. At session start, check if `bridge` is installed:
`which bridge && bridge status`
2. If state-hub tunnel is down: `bridge up state-hub-<machine-slug>`
Wait for state `connected` before making state-hub API calls.
3. If `bridge` is not installed, check if the state-hub API is directly
reachable: `curl -s http://127.0.0.1:8000/state/health`
4. Only proceed without state-hub if absolutely necessary — log a
progress note about the outage when connectivity is restored.
```
Also add a one-liner reminder to the ops-bridge session protocol in
`.claude/rules/session-protocol.md`:
> At session start: `bridge status` — bring up any stopped tunnels
> before accessing remote services.
Gate: `~/.claude/CLAUDE.md` contains the Worker Agent section; ops-bridge
session protocol references bridge status check.
---
## Done Criteria
- [x] `make mcp-http` starts the MCP server on port 8002 (SSE)
- [x] `bridge_status` and `bridge_check` callable as MCP tools from Claude Code
- [x] `ops-bridge` registered in `~/.claude.json` at user scope
- [x] `/bridge` skill surfaces tunnel states and recovers a stopped tunnel
- [x] Global CLAUDE.md has worker agent bridge protocol
- [x] All existing tests pass after T01 changes (`make test`)