generated from coulomb/repo-seed
Compare commits
51 Commits
dc1422fcaa
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 00671f5133 | |||
| 09f2cd4b7a | |||
| c3b4fb9d55 | |||
| fab7409c66 | |||
| 1dd664c792 | |||
| 10c6fdaec9 | |||
| 8c11acc00c | |||
| 499b8781cc | |||
| 4e9882909f | |||
| a6857fb8f7 | |||
| 675772ab3b | |||
| 6eb0b1c52f | |||
| d949f3e93e | |||
| de984736ca | |||
| 28ecef121e | |||
| 860c08f1db | |||
| bd169a07e2 | |||
| 22601ef3e6 | |||
| 569de1497c | |||
| fafd04ed2e | |||
| c1d87b47df | |||
| 204bf48bc8 | |||
| 595c495f7c | |||
| 90eda27a14 | |||
| 1361727e15 | |||
| 18e3c118dd | |||
| 621de64ee0 | |||
| f3a7236c5d | |||
| 4f3c8646b3 | |||
| 431beef31b | |||
| 1c7c6eedf8 | |||
| 75a559780e | |||
| d73b7be45d | |||
| a55c685f89 | |||
| bebd542a2e | |||
| 30bbaf303d | |||
| 101244bd1d | |||
| 6673cb0e48 | |||
| 60c742a456 | |||
| 3be41c315e | |||
| d4b5854483 | |||
| 365c0d611a | |||
| 44b5a9426a | |||
| af2d419bf6 | |||
| d248f14a9f | |||
| baee28eda2 | |||
| 91d031ae20 | |||
| a7eaf59ced | |||
| 2c7c440ea7 | |||
| 1364cbcece | |||
| 482edcd7eb |
20
.claude/rules/agents.md
Normal file
20
.claude/rules/agents.md
Normal file
@@ -0,0 +1,20 @@
|
||||
## Kaizen Agents
|
||||
|
||||
Specialized agent personas available on demand via the state-hub MCP.
|
||||
|
||||
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
|
||||
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
|
||||
|
||||
Common agents:
|
||||
|
||||
| Agent | Category | When to use |
|
||||
|-------|----------|-------------|
|
||||
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
|
||||
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
|
||||
| `test-maintenance` | testing | Diagnose and fix failing tests |
|
||||
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
|
||||
| `keepaTodofile` | process | Maintain TODO.md during work |
|
||||
| `project-management` | process | Track status, determine next steps |
|
||||
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
|
||||
|
||||
All 17 agents: call `list_kaizen_agents()` for the full list.
|
||||
8
.claude/rules/architecture.md
Normal file
8
.claude/rules/architecture.md
Normal file
@@ -0,0 +1,8 @@
|
||||
## Architecture
|
||||
|
||||
<!-- TODO: Describe the key design decisions and component structure.
|
||||
Key modules, data flows, external integrations, state machines, etc. -->
|
||||
|
||||
## Quick Reference
|
||||
|
||||
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
|
||||
50
.claude/rules/credential-routing.md
Normal file
50
.claude/rules/credential-routing.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
38
.claude/rules/first-session.md
Normal file
38
.claude/rules/first-session.md
Normal file
@@ -0,0 +1,38 @@
|
||||
## First Session Protocol
|
||||
|
||||
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
|
||||
The project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Read, don't write**
|
||||
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
|
||||
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
|
||||
- Scan repo root: README, directory structure, existing code or docs
|
||||
|
||||
**Step 2 — Survey in-progress work**
|
||||
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
|
||||
|
||||
**Step 3 — Propose workstreams to Bernd**
|
||||
Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
|
||||
roadmap phase. **Wait for approval before creating.**
|
||||
|
||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||
```
|
||||
workplans/BRIDGE-WP-NNNN-<slug>.md ← write this first
|
||||
```
|
||||
Then register in the hub:
|
||||
```
|
||||
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
|
||||
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
```
|
||||
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured infotech into N workstreams, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
)
|
||||
```
|
||||
|
||||
<!-- Delete or archive this file once past first session -->
|
||||
8
.claude/rules/repo-boundary.md
Normal file
8
.claude/rules/repo-boundary.md
Normal file
@@ -0,0 +1,8 @@
|
||||
## Repo boundary
|
||||
|
||||
This repo owns **ops-bridge** only. It does not own:
|
||||
|
||||
<!-- TODO: List what belongs in adjacent repos, e.g.:
|
||||
- SSH key management → railiance-infra/
|
||||
- State hub code → state-hub/
|
||||
-->
|
||||
5
.claude/rules/repo-identity.md
Normal file
5
.claude/rules/repo-identity.md
Normal file
@@ -0,0 +1,5 @@
|
||||
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
|
||||
|
||||
**Domain:** infotech
|
||||
**Repo slug:** ops-bridge
|
||||
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
|
||||
85
.claude/rules/session-protocol.md
Normal file
85
.claude/rules/session-protocol.md
Normal file
@@ -0,0 +1,85 @@
|
||||
## Session Protocol
|
||||
|
||||
Dev Hub (State Hub API): http://127.0.0.1:8000
|
||||
MCP server name in `~/.claude.json`: `dev-hub`
|
||||
|
||||
**Step 1 — Orient**
|
||||
|
||||
Read the offline-safe brief first — it works without a live hub connection:
|
||||
```bash
|
||||
cat .custodian-brief.md
|
||||
```
|
||||
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
||||
```
|
||||
get_domain_summary("infotech")
|
||||
```
|
||||
If MCP tools are unavailable in the current agent session, use the REST API:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
|
||||
```
|
||||
If the hub is offline: `cd ~/state-hub && make api`
|
||||
|
||||
**Step 2 — Check inbox**
|
||||
With MCP tools:
|
||||
```
|
||||
get_messages(to_agent="ops-bridge", unread_only=True)
|
||||
```
|
||||
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
|
||||
requests before proceeding.
|
||||
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
**Step 3 — Scan workplans**
|
||||
```bash
|
||||
ls workplans/
|
||||
```
|
||||
For each file with `status: ready`, `active`, or `blocked`, note pending
|
||||
`wait`/`todo`/`progress` tasks.
|
||||
|
||||
**Step 4 — Present brief**
|
||||
|
||||
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
|
||||
2. **Pending tasks** from `workplans/` + any `[repo:ops-bridge]` hub tasks
|
||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||
- `alignment_warnings`: flag if active work is not aligned with current goal
|
||||
4. **Suggested next action** — highest-priority open item
|
||||
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
|
||||
|
||||
If no workstreams: follow First Session Protocol (`first-session.md`).
|
||||
|
||||
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
|
||||
|
||||
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
|
||||
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
|
||||
|
||||
**Session close:**
|
||||
With MCP tools:
|
||||
```
|
||||
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
|
||||
```
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
|
||||
```
|
||||
If workplan files were modified, ensure the local copy is up to date first:
|
||||
```bash
|
||||
git -C <repo_path> pull --ff-only
|
||||
cd ~/state-hub && make fix-consistency REPO=ops-bridge
|
||||
```
|
||||
For repos where implementation runs on a remote machine (e.g. CoulombCore),
|
||||
use the combined target which pulls before fixing:
|
||||
```bash
|
||||
cd ~/state-hub && make fix-consistency-remote REPO=ops-bridge
|
||||
```
|
||||
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
|
||||
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
|
||||
until you pull — intentional to prevent clobbering remote progress.
|
||||
19
.claude/rules/stack-and-commands.md
Normal file
19
.claude/rules/stack-and-commands.md
Normal file
@@ -0,0 +1,19 @@
|
||||
## Stack
|
||||
|
||||
<!-- TODO: Fill in language, frameworks, and key dependencies -->
|
||||
- **Language:**
|
||||
- **Key deps:**
|
||||
|
||||
## Dev Commands
|
||||
|
||||
```bash
|
||||
# TODO: Fill in the standard commands for this repo
|
||||
|
||||
# Install dependencies
|
||||
|
||||
# Run tests
|
||||
|
||||
# Lint / type check
|
||||
|
||||
# Build / package (if applicable)
|
||||
```
|
||||
40
.claude/rules/workplan-convention.md
Normal file
40
.claude/rules/workplan-convention.md
Normal file
@@ -0,0 +1,40 @@
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
File location: `workplans/BRIDGE-WP-NNNN-<slug>.md`
|
||||
ID prefix: `BRIDGE-WP-`
|
||||
|
||||
Work items originate as files in this repo **before** being registered in the hub.
|
||||
|
||||
Canonical workplan/workstream frontmatter statuses are:
|
||||
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
|
||||
Use `proposed` for a newly drafted plan, `ready` after review against current
|
||||
repo state, and `finished` when implementation is complete. `stalled` and
|
||||
`needs_review` are derived health labels, not stored statuses.
|
||||
|
||||
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
||||
prefix: `YYMMDD-BRIDGE-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||
unchanged; the prefix is only for quick visual reference.
|
||||
|
||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
|
||||
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
|
||||
directly. Promote anything requiring analysis, design, approval, dependencies, or
|
||||
multiple planned phases into a normal workplan.
|
||||
|
||||
Ecosystem todos from other agents arrive as `[repo:ops-bridge]` hub tasks —
|
||||
visible at session start. Pick one up by creating the workplan file, then registering
|
||||
the workstream.
|
||||
|
||||
Task blocks use this shape:
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
```
|
||||
|
||||
Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
|
||||
blocked work and `cancel` for stopped work.
|
||||
|
||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
5
.claude/settings.json
Normal file
5
.claude/settings.json
Normal file
@@ -0,0 +1,5 @@
|
||||
{
|
||||
"enabledPlugins": {
|
||||
"commit-commands@claude-plugins-official": true
|
||||
}
|
||||
}
|
||||
7
.codex/config.toml
Normal file
7
.codex/config.toml
Normal file
@@ -0,0 +1,7 @@
|
||||
[mcp_servers.ops-bridge]
|
||||
command = "uv"
|
||||
args = [
|
||||
"run",
|
||||
"python",
|
||||
"src/bridge/mcp_server/server.py",
|
||||
]
|
||||
18
.custodian-brief.md
Normal file
18
.custodian-brief.md
Normal file
@@ -0,0 +1,18 @@
|
||||
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
|
||||
# Custodian Brief — ops-bridge
|
||||
|
||||
**Domain:** custodian
|
||||
**Last synced:** 2026-06-21 18:12 UTC
|
||||
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
|
||||
|
||||
## Active Workstreams
|
||||
|
||||
*(none — repo may need first-session setup)*
|
||||
|
||||
---
|
||||
## MCP Orientation (when available)
|
||||
|
||||
If the state-hub MCP server is reachable, call:
|
||||
`get_domain_summary("custodian")`
|
||||
This provides richer cross-domain context.
|
||||
If the MCP call fails, use this file as your orientation source.
|
||||
10
.mcp.json
Normal file
10
.mcp.json
Normal file
@@ -0,0 +1,10 @@
|
||||
{
|
||||
"mcpServers": {
|
||||
"ops-bridge": {
|
||||
"type": "stdio",
|
||||
"command": "uv",
|
||||
"args": ["run", "python", "src/bridge/mcp_server/server.py"],
|
||||
"cwd": "/home/worsch/ops-bridge"
|
||||
}
|
||||
}
|
||||
}
|
||||
26
.repo-classification.yaml
Normal file
26
.repo-classification.yaml
Normal file
@@ -0,0 +1,26 @@
|
||||
# Repo classification (Repo Classification Standard v1.0).
|
||||
|
||||
repo_classification:
|
||||
standard: Repo Classification Standard
|
||||
version: '1.0'
|
||||
classified_at: '2026-06-22'
|
||||
classified_by: human
|
||||
category: tooling
|
||||
domain: infotech
|
||||
secondary_domains: []
|
||||
capability_tags:
|
||||
- operations
|
||||
- access-control
|
||||
- platform
|
||||
- observability
|
||||
- orchestration
|
||||
business_stake:
|
||||
- operations
|
||||
- technology
|
||||
- automation
|
||||
business_mechanics:
|
||||
- control
|
||||
- operation
|
||||
- adaptation
|
||||
notes: SSH reverse-tunnel lifecycle manager keeping remote environments connected to the
|
||||
State Hub. Operational tooling -> product.
|
||||
219
AGENTS.md
Normal file
219
AGENTS.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# ops-bridge — Agent Instructions
|
||||
|
||||
## Repo Identity
|
||||
|
||||
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
|
||||
|
||||
**Domain:** infotech
|
||||
**Repo slug:** ops-bridge
|
||||
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
||||
**Workplan prefix:** `BRIDGE-WP-`
|
||||
|
||||
---
|
||||
|
||||
## State Hub Integration
|
||||
|
||||
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
|
||||
there is no MCP server for Codex agents.
|
||||
|
||||
| Context | URL |
|
||||
|---------|-----|
|
||||
| Local workstation | `http://127.0.0.1:8000` |
|
||||
| Remote via tunnel | `http://127.0.0.1:18000` |
|
||||
|
||||
### Orient at session start
|
||||
|
||||
```bash
|
||||
# Offline brief — works without hub connection
|
||||
cat .custodian-brief.md
|
||||
|
||||
# Active workstreams for this domain
|
||||
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Check inbox
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Mark a message read:
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
### Log progress (required at session close)
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"summary": "what was done",
|
||||
"event_type": "note",
|
||||
"author": "codex",
|
||||
"workstream_id": "<uuid>",
|
||||
"task_id": "<uuid>"
|
||||
}'
|
||||
```
|
||||
|
||||
Omit `workstream_id` / `task_id` when not applicable.
|
||||
|
||||
### Update task status
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"status": "progress"}'
|
||||
# values: wait | todo | progress | done | cancel
|
||||
```
|
||||
|
||||
### Flag a task for human review
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"needs_human": true, "intervention_note": "reason"}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session Protocol
|
||||
|
||||
**Start:**
|
||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||
2. Check inbox: `GET /messages/?to_agent=ops-bridge&unread_only=true`; mark read
|
||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
|
||||
|
||||
**During work:**
|
||||
- Update task statuses in workplan files as tasks progress
|
||||
- Record significant decisions via `POST /decisions/`
|
||||
|
||||
**Close:**
|
||||
1. Update workplan file task statuses to reflect progress
|
||||
2. Log: `POST /progress/` with a summary of what changed
|
||||
3. Note for the custodian operator: after workplan file changes, run from
|
||||
`~/state-hub`:
|
||||
```bash
|
||||
make fix-consistency REPO=ops-bridge
|
||||
```
|
||||
This syncs task status from files into the hub DB.
|
||||
|
||||
---
|
||||
|
||||
## Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
|
||||
<!-- REPO-AGENTS-EXTENSIONS -->
|
||||
<!-- Append repo-specific agent instructions below this marker.
|
||||
The state-hub template sync preserves content after this line. -->
|
||||
|
||||
---
|
||||
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
Work items originate as files in this repo — not in the hub. The hub is a
|
||||
read/cache/index layer that rebuilds from files.
|
||||
|
||||
**File location:** `workplans/OPS-WP-NNNN-<slug>.md`
|
||||
|
||||
**Archived location:** finished workplans may move to
|
||||
`workplans/archived/YYMMDD-OPS-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
|
||||
the completion/archive date; the frontmatter `id` does not change.
|
||||
|
||||
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
|
||||
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
|
||||
this only for low-risk work completed directly; create a normal workplan for
|
||||
anything needing analysis, design, approval, dependencies, or multiple phases.
|
||||
|
||||
**Frontmatter:**
|
||||
|
||||
```yaml
|
||||
---
|
||||
id: OPS-WP-NNNN
|
||||
type: workplan
|
||||
title: "..."
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: proposed | ready | active | blocked | backlog | finished | archived
|
||||
owner: codex
|
||||
topic_slug: ...
|
||||
created: "YYYY-MM-DD"
|
||||
updated: "YYYY-MM-DD"
|
||||
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
---
|
||||
```
|
||||
|
||||
Use `proposed` for a new draft, `ready` after review against current repo
|
||||
state, and `finished` after implementation. `stalled` and `needs_review` are
|
||||
derived health labels, not frontmatter statuses.
|
||||
|
||||
**Task block format** (one per `##` section):
|
||||
|
||||
```
|
||||
## Task Title
|
||||
|
||||
` ` `task
|
||||
id: OPS-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
` ` `
|
||||
|
||||
Task description text.
|
||||
```
|
||||
|
||||
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
|
||||
|
||||
To create a new workplan:
|
||||
1. Write the file following the format above
|
||||
2. Notify the custodian operator to run `make fix-consistency REPO=ops-bridge`
|
||||
(or send a message to the hub agent via `POST /messages/`)
|
||||
12
CLAUDE.md
Normal file
12
CLAUDE.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# ops-bridge — Claude Code Instructions
|
||||
|
||||
@SCOPE.md
|
||||
@.claude/rules/repo-identity.md
|
||||
@.claude/rules/session-protocol.md
|
||||
@.claude/rules/first-session.md
|
||||
@.claude/rules/workplan-convention.md
|
||||
@.claude/rules/stack-and-commands.md
|
||||
@.claude/rules/architecture.md
|
||||
@.claude/rules/repo-boundary.md
|
||||
@.claude/rules/credential-routing.md
|
||||
@.claude/rules/agents.md
|
||||
92
INTENT.md
Normal file
92
INTENT.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# INTENT
|
||||
|
||||
## Purpose
|
||||
|
||||
This repository exists to provide a **reliable, inspectable, and controllable connectivity layer**
|
||||
between distributed dev, build, test and execution environments for dev and ops personal human and agentic.
|
||||
|
||||
Its role is to ensure that remote machines can **consistently and safely “phone home”** without requiring complex network infrastructure or manual intervention.
|
||||
|
||||
---
|
||||
|
||||
## Primary Utility
|
||||
|
||||
The repository provides a **managed SSH reverse tunneling system** that:
|
||||
|
||||
* Maintains continuous connectivity between remote systems and a central hub
|
||||
* Makes connectivity **observable, auditable, and controllable**
|
||||
* Exposes this capability as both a **CLI tool and an MCP-accessible service**
|
||||
|
||||
It transforms raw SSH port-forwarding into a **first-class operational primitive**.
|
||||
|
||||
---
|
||||
|
||||
## Intended Users
|
||||
|
||||
* Human operators (`adm`) managing infrastructure and connectivity
|
||||
* LLM-based agents (`agt`) requiring stable access to local services
|
||||
* Deterministic automations (`atm`) coordinating distributed workloads
|
||||
|
||||
---
|
||||
|
||||
## Strategic Role in the System
|
||||
|
||||
This repository acts as the **connectivity backbone** of the custodian ecosystem:
|
||||
|
||||
* It enables remote agents and services to participate in a **locally anchored control plane**
|
||||
* It decouples **execution location** from **control location**
|
||||
* It supports a **hub-and-spoke topology** where the Custodian State Hub remains central
|
||||
|
||||
---
|
||||
|
||||
## Strategic Boundaries
|
||||
|
||||
This repository is **not** intended to:
|
||||
|
||||
* Replace SSH as a general-purpose access mechanism
|
||||
* Act as a credential authority or security policy engine
|
||||
* Provide full network virtualization (e.g., VPN, mesh networking)
|
||||
* Host or orchestrate application workloads
|
||||
|
||||
Its responsibility ends at **secure, observable, and managed connectivity via tunnels**.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
* **Continuity over convenience**
|
||||
Connectivity must persist across failures without manual recovery
|
||||
|
||||
* **Observability as a first-class concern**
|
||||
All lifecycle events must be traceable and attributable
|
||||
|
||||
* **Actor-aware operations**
|
||||
Every action is tied to a clearly defined actor type (`adm`, `agt`, `atm`)
|
||||
|
||||
* **Pluggable security integration**
|
||||
Works with both static keys and external certificate authorities without owning them
|
||||
|
||||
* **Toolability**
|
||||
All capabilities should be accessible programmatically (MCP) and operationally (CLI)
|
||||
|
||||
---
|
||||
|
||||
## Maturity Target
|
||||
|
||||
A mature version of this repository should:
|
||||
|
||||
* Provide **fully autonomous tunnel lifecycle management** across heterogeneous environments
|
||||
* Integrate seamlessly with **centralized access control and certificate systems**
|
||||
* Serve as a **standardized connectivity primitive** across all Custodian-managed systems
|
||||
* Offer **complete operational transparency** for all connectivity-related actions
|
||||
* Be robust enough to act as the **default connectivity layer** for distributed agent systems
|
||||
|
||||
---
|
||||
|
||||
## Stability Note
|
||||
|
||||
Changes to this file represent a **deliberate shift in repository purpose or role** within the system architecture.
|
||||
|
||||
Such changes should be rare and made with explicit intent.
|
||||
|
||||
|
||||
31
Makefile
Normal file
31
Makefile
Normal file
@@ -0,0 +1,31 @@
|
||||
.DEFAULT_GOAL := help
|
||||
|
||||
.PHONY: help setup test lint install mcp-http mcp-stop cron-install-cron cron-uninstall-cron
|
||||
|
||||
help: ## List available make targets
|
||||
@awk 'BEGIN {FS = ":.*## "}; /^[a-zA-Z0-9_.-]+:.*## / {printf " %-16s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
|
||||
|
||||
setup: ## Sync dependencies and install the bridge CLI wrapper
|
||||
uv sync --all-groups
|
||||
uv tool install -e . --force
|
||||
|
||||
test: ## Run the test suite
|
||||
uv run pytest
|
||||
|
||||
lint: ## Run ruff lint checks
|
||||
uv run ruff check .
|
||||
|
||||
install: ## Install the bridge CLI wrapper
|
||||
uv tool install -e . --force
|
||||
|
||||
mcp-http: ## Start MCP server in SSE mode (default port 8002)
|
||||
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
|
||||
|
||||
mcp-stop: ## Stop MCP server running on port 8002
|
||||
@lsof -ti:$${BRIDGE_MCP_PORT:-8002} | xargs -r kill -TERM && echo "MCP server stopped" || echo "No MCP server running on port $${BRIDGE_MCP_PORT:-8002}"
|
||||
|
||||
cron-install-cron: ## Install 03:00 nightly stale-forward cleanup cron
|
||||
bridge maintenance install-cron
|
||||
|
||||
cron-uninstall-cron: ## Remove nightly stale-forward cleanup cron
|
||||
bridge maintenance uninstall-cron
|
||||
@@ -1,3 +0,0 @@
|
||||
# repo-seed
|
||||
|
||||
A git repository template to bootstrap coulomb projects from.
|
||||
318
README.txt
Normal file
318
README.txt
Normal file
@@ -0,0 +1,318 @@
|
||||
ops-bridge
|
||||
==========
|
||||
|
||||
SSH reverse tunnel lifecycle manager. Keeps remote execution environments
|
||||
(COULOMBCORE, Railiance nodes) connected to the local Custodian State Hub
|
||||
so Claude Code sessions on those machines have full MCP connectivity.
|
||||
|
||||
|
||||
WHAT IT DOES
|
||||
------------
|
||||
|
||||
`bridge` is a CLI tool that manages named SSH reverse tunnels. Each tunnel:
|
||||
|
||||
- Is identified by a human-readable name (e.g. state-hub-coulombcore)
|
||||
- Runs as an SSH reverse port-forward: ssh -R remote:127.0.0.1:local host
|
||||
- Auto-reconnects on drop using exponential backoff
|
||||
- Optionally runs an HTTP health check to confirm the forwarded service
|
||||
is actually reachable (not just the SSH process alive)
|
||||
- Records structured audit events (bridge_started, bridge_connected,
|
||||
health_check_failed, etc.) to a JSON log per tunnel
|
||||
|
||||
Bridge states: stopped -> starting -> connected <-> degraded -> reconnecting
|
||||
|
||||
|
||||
INSTALL
|
||||
-------
|
||||
|
||||
Requires Python 3.11+ and uv (https://docs.astral.sh/uv/).
|
||||
|
||||
uv tool install /path/to/ops-bridge
|
||||
|
||||
This registers the `bridge` command globally. For development:
|
||||
|
||||
cd /path/to/ops-bridge
|
||||
uv tool install -e .
|
||||
|
||||
Verify:
|
||||
|
||||
bridge --help
|
||||
|
||||
|
||||
CONFIGURATION
|
||||
-------------
|
||||
|
||||
Config file: ~/.config/bridge/tunnels.yaml
|
||||
Override with: BRIDGE_CONFIG=/path/to/config.yaml
|
||||
|
||||
Minimal example:
|
||||
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
|
||||
actors:
|
||||
agent.claude-coulombcore:
|
||||
class: automation
|
||||
description: Claude Code agent on CoulombCore
|
||||
|
||||
With health check and reconnect policy:
|
||||
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
|
||||
health_check:
|
||||
url: http://127.0.0.1:18000/health # checked from the REMOTE host
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
|
||||
reconnect:
|
||||
max_attempts: 0 # 0 = retry forever
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
|
||||
actors:
|
||||
agent.claude-coulombcore:
|
||||
class: automation # "human" or "automation"
|
||||
description: Claude Code agent on CoulombCore
|
||||
operator.bernd:
|
||||
class: human
|
||||
description: Bernd Worsch
|
||||
|
||||
Required tunnel fields: host, remote_port, local_port, ssh_user, ssh_key, actor
|
||||
Required actor fields: class (must be "human" or "automation")
|
||||
|
||||
|
||||
CLI COMMANDS
|
||||
------------
|
||||
|
||||
Lifecycle:
|
||||
|
||||
bridge up [TUNNEL] Start one tunnel, or all if no name given
|
||||
bridge down [TUNNEL] Stop one tunnel, or all
|
||||
bridge restart [TUNNEL] Restart one tunnel, or all
|
||||
|
||||
Observation:
|
||||
|
||||
bridge status Show all tunnels: state, uptime, last event
|
||||
bridge status --json Machine-readable JSON output
|
||||
bridge logs TUNNEL Tail the audit log for a tunnel
|
||||
bridge logs TUNNEL --lines 100 --follow
|
||||
|
||||
Examples:
|
||||
|
||||
bridge up state-hub-coulombcore
|
||||
bridge status
|
||||
bridge logs state-hub-coulombcore --follow
|
||||
bridge down state-hub-coulombcore
|
||||
|
||||
|
||||
OPSCATALOG EXTENSION (optional)
|
||||
--------------------------------
|
||||
|
||||
If you maintain a Git-backed YAML catalog of your infrastructure, point
|
||||
bridge at it in your config:
|
||||
|
||||
catalog_path: ~/ops-infra/opscatalog/
|
||||
|
||||
Catalog layout:
|
||||
|
||||
opscatalog/
|
||||
domains/
|
||||
<domain-id>/
|
||||
domain.yaml
|
||||
targets/
|
||||
<target-id>.yaml
|
||||
bridges/
|
||||
<bridge-id>.yaml
|
||||
|
||||
Then you can use:
|
||||
|
||||
bridge targets [--domain DOMAIN] List all targets (optionally filtered)
|
||||
bridge targets show TARGET_ID Show full target metadata
|
||||
bridge catalog list List domains with counts
|
||||
bridge catalog validate Check catalog for consistency errors
|
||||
bridge catalog show BRIDGE_ID Show a catalog bridge's full metadata
|
||||
|
||||
Bridges defined in the catalog are resolved the same way as inline tunnels.
|
||||
Inline tunnels (in tunnels.yaml) take precedence over catalog bridges when
|
||||
both define the same name.
|
||||
|
||||
|
||||
STATE FILES
|
||||
-----------
|
||||
|
||||
Runtime state is stored in ~/.local/state/bridge/:
|
||||
|
||||
{name}.pid Manager process ID
|
||||
{name}.state Current bridge state (e.g. "connected")
|
||||
{name}.log Audit log, one JSON object per line
|
||||
|
||||
Override the state directory with: BRIDGE_STATE_DIR=/path/to/dir
|
||||
|
||||
|
||||
AUDIT LOG FORMAT
|
||||
----------------
|
||||
|
||||
Each event is one JSON object per line:
|
||||
|
||||
{
|
||||
"ts": "2026-03-12T14:23:01.456789",
|
||||
"tunnel": "state-hub-coulombcore",
|
||||
"event": "bridge_connected",
|
||||
"actor": "agent.claude-coulombcore",
|
||||
"actor_class": "automation",
|
||||
"detail": ""
|
||||
}
|
||||
|
||||
Event types: bridge_started, bridge_connected, bridge_disconnected,
|
||||
bridge_reconnecting, health_check_failed, health_check_recovered,
|
||||
bridge_stopped
|
||||
|
||||
|
||||
MCP INTEGRATION
|
||||
---------------
|
||||
|
||||
OpsBridge exposes its capabilities as a FastMCP server so Claude Code agents
|
||||
can call bridge_up(), bridge_status(), catalog_list_targets(), etc. as
|
||||
first-class MCP tools — no Bash required, structured JSON in/out.
|
||||
|
||||
Available tools: bridge_up, bridge_down, bridge_restart, bridge_status,
|
||||
bridge_logs, catalog_list_targets, catalog_show_target,
|
||||
catalog_list_domains, catalog_validate, catalog_show_bridge
|
||||
|
||||
Available resources: bridge://status, catalog://domains, catalog://targets
|
||||
|
||||
Project-scope (auto, inside ops-bridge/):
|
||||
Already configured in .mcp.json. Claude Code sessions inside this repo
|
||||
see the tools automatically.
|
||||
|
||||
User-scope (machine-global, any repo):
|
||||
python scripts/register_mcp.py
|
||||
|
||||
Human operator skill:
|
||||
/bridge-status — natural-language tunnel health summary
|
||||
(skill file: ~/.claude/plugins/ops-bridge/bridge-status.md)
|
||||
|
||||
Run the server directly (for debugging):
|
||||
uv run python src/bridge/mcp_server/server.py
|
||||
|
||||
|
||||
DEVELOPMENT
|
||||
-----------
|
||||
|
||||
uv run pytest Run all tests
|
||||
uv run pytest tests/test_cli.py -v Run a specific test file
|
||||
uv run ruff check . Lint
|
||||
|
||||
Source layout:
|
||||
|
||||
src/bridge/
|
||||
cli.py Typer CLI (entry point)
|
||||
models.py Core dataclasses and enums
|
||||
config.py Config loading from tunnels.yaml
|
||||
manager.py Tunnel lifecycle (subprocess, reconnect loop)
|
||||
state.py PID and state file management
|
||||
audit.py Audit event logging
|
||||
health.py HTTP health checker (async, httpx)
|
||||
catalog/ OpsCatalog extension
|
||||
|
||||
|
||||
SERVER PREREQUISITES
|
||||
--------------------
|
||||
|
||||
For reliable auto-reconnect after reboots or network drops, the remote sshd
|
||||
needs two settings in /etc/ssh/sshd_config:
|
||||
|
||||
ClientAliveInterval 30
|
||||
ClientAliveCountMax 3
|
||||
|
||||
Without these, dead SSH sessions hold their remote port forward open (the OS
|
||||
has not yet cleaned up the socket), so the next reconnect attempt hits
|
||||
"remote port forwarding failed" and exits with code 255. With ClientAlive
|
||||
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
|
||||
|
||||
NIGHTLY STALE-FORWARD CLEANUP
|
||||
------------------------------
|
||||
|
||||
When a bridge client dies without tearing down its SSH session, the remote
|
||||
host can keep port 18000 (etc.) bound to a zombie sshd listener. The port
|
||||
accepts connections but never forwards them, which breaks in-cluster proxies
|
||||
such as actcore-state-hub-bridge on railiance01.
|
||||
|
||||
Install a 03:00 local-time cron job that probes each reverse tunnel's remote
|
||||
forward, kills stale listeners when the local service is healthy but the
|
||||
remote forward is not, and restarts the tunnel:
|
||||
|
||||
bridge maintenance install-cron
|
||||
|
||||
Manual run:
|
||||
|
||||
bridge maintenance cleanup --restart
|
||||
|
||||
Inspect or remove the cron entry:
|
||||
|
||||
bridge maintenance show-cron
|
||||
bridge maintenance uninstall-cron
|
||||
|
||||
Logs append to ~/.local/state/bridge/cleanup.log
|
||||
|
||||
Apply and reload (no disconnect):
|
||||
|
||||
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
|
||||
sudo sed -i 's/#ClientAliveCountMax 3/ClientAliveCountMax 3/' /etc/ssh/sshd_config
|
||||
sudo kill -HUP $(cat /run/sshd.pid)
|
||||
|
||||
If fail2ban is running on the remote, whitelist the bridge host IP so rapid
|
||||
reconnect storms (e.g. after a key auth failure) do not trigger a ban.
|
||||
Add the client IP to ignoreip in /etc/fail2ban/jail.local:
|
||||
|
||||
[DEFAULT]
|
||||
ignoreip = 127.0.0.1/8 ::1 <your-bridge-host-ip>
|
||||
|
||||
Then reload: sudo systemctl reload fail2ban
|
||||
|
||||
Note: health_check.url must point to a LOCAL port (the local side of the
|
||||
tunnel), not the remote forwarded port. For a reverse tunnel
|
||||
(remote_port=18000, local_port=8000), the correct health check URL is
|
||||
http://127.0.0.1:8000/... — NOT http://127.0.0.1:18000/...
|
||||
For SSE endpoints (MCP), use a non-streaming endpoint from the same service
|
||||
(e.g. the state-hub /state/health) since the health checker waits for the
|
||||
response to complete.
|
||||
|
||||
|
||||
DESIGN NOTES
|
||||
------------
|
||||
|
||||
- No system daemons. Tunnel processes are managed as subprocesses; PIDs
|
||||
are tracked in ~/.local/state/bridge/.
|
||||
- Graceful shutdown: SIGTERM to the daemon allows a clean exit; SIGKILL
|
||||
follows after 5 seconds if unresponsive.
|
||||
- Actor attribution on every log event (human vs. automation) supports
|
||||
audit traceability (FRS §5.7).
|
||||
- SSH command invoked: ssh -N -R remote_port:127.0.0.1:local_port
|
||||
-i ssh_key ssh_user@host
|
||||
- ExitOnForwardFailure=yes is set, so SSH exits immediately if the remote
|
||||
port is already in use. This is intentional — it forces a clean reconnect
|
||||
rather than silently running without the port forward active.
|
||||
|
||||
|
||||
REPO STRUCTURE
|
||||
--------------
|
||||
|
||||
src/bridge/ Main source
|
||||
tests/ Test suite
|
||||
wiki/ PRD, FRS, OpsCatalog specification
|
||||
workplans/ Custodian State Hub workplan files (BRIDGE-WP-*)
|
||||
pyproject.toml Build config and dependencies
|
||||
134
SCOPE.md
Normal file
134
SCOPE.md
Normal file
@@ -0,0 +1,134 @@
|
||||
# SCOPE
|
||||
|
||||
> This file helps you quickly understand what this repository is about,
|
||||
> when it is relevant, and when it is not.
|
||||
> It is intentionally lightweight and may be incomplete.
|
||||
|
||||
---
|
||||
|
||||
## One-liner
|
||||
|
||||
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.
|
||||
|
||||
---
|
||||
|
||||
## Core Idea
|
||||
|
||||
Claude Code sessions run locally; the Custodian State Hub API runs locally. Remote machines (Railiance nodes, Temporal workers, Markitect services) need to reach the hub. Ops-bridge manages named SSH reverse tunnels with auto-reconnect, health checks, audit logging, and an MCP server so Claude Code can start/stop/inspect tunnels as tools.
|
||||
|
||||
---
|
||||
|
||||
## In Scope
|
||||
|
||||
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
|
||||
- Auto-reconnect with exponential backoff and configurable retry policy
|
||||
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
|
||||
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
|
||||
- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
|
||||
with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
|
||||
- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
|
||||
works without any CA or external tooling
|
||||
- **cert_command mode** (optional): pluggable shell command that issues a short-lived
|
||||
CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
|
||||
`cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
|
||||
- PID + state file management in `~/.local/state/bridge/`
|
||||
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
|
||||
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
|
||||
certs via the `cert_command` interface but never signs anything itself)
|
||||
- SSH key generation for human admins (self-service: `ssh-keygen`)
|
||||
- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
|
||||
- Long-running application hosting on remote machines (port-forward only, not deployment)
|
||||
- VPN or layer-3 connectivity
|
||||
- Monitoring/alerting beyond JSON audit logs
|
||||
- Replacing SSH for general interactive access
|
||||
|
||||
---
|
||||
|
||||
## Relevant When
|
||||
|
||||
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
|
||||
- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
|
||||
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub
|
||||
- Diagnosing connectivity issues between local hub and remote services
|
||||
- Checking certificate validity for active tunnels (`bridge cert-status`)
|
||||
- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials
|
||||
|
||||
---
|
||||
|
||||
## Not Relevant When
|
||||
|
||||
- All work is local (no remote services involved)
|
||||
- Manually running `ssh -R` is acceptable
|
||||
- No need for audit tracing of tunnel state changes
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
|
||||
- Status: active (v0.1 core complete; AccessManagementDirective alignment done — BRIDGE-WP-0004)
|
||||
- Implementation: ~80% — CLI tunneling fully functional, MCP integration working, health
|
||||
checks and audit logging complete; ActorType enum (adm/agt/atm) enforced; cert_command
|
||||
mode implemented with TTL-aware refresh and cert_identity audit logging; OpsCatalog
|
||||
framework present but not yet populated
|
||||
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures
|
||||
- Usage: running in lab for daily Railiance/Temporal connectivity
|
||||
|
||||
---
|
||||
|
||||
## How It Fits
|
||||
|
||||
- Upstream dependencies: SSH (system), OpenSSH server on remote hosts
|
||||
- Downstream consumers: all remote Claude Code agents depend on ops-bridge to reach local hub MCP; activity-core Temporal server reachable via bridge tunnel
|
||||
- Often used with: the-custodian (health checks point to hub API), activity-core (Temporal port-forwarding)
|
||||
|
||||
---
|
||||
|
||||
## Terminology
|
||||
|
||||
- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
|
||||
cert_command, cert_identity
|
||||
- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
|
||||
- Also known as: "the bridge"
|
||||
- Potentially confusing: "bridge state" is a tunnel-specific state machine
|
||||
(stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
|
||||
- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)
|
||||
|
||||
---
|
||||
|
||||
## Related / Overlapping
|
||||
|
||||
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
|
||||
- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
|
||||
`cert_command` when short-lived certificates are required
|
||||
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel
|
||||
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
|
||||
host-side principal deployment (`/etc/ssh/auth_principals/`)
|
||||
|
||||
---
|
||||
|
||||
## Provided Capabilities
|
||||
|
||||
```capability
|
||||
type: infrastructure
|
||||
title: SSH reverse tunnel connectivity
|
||||
description: Named, auto-reconnecting SSH reverse tunnels with health checks and audit logging — keeps remote execution environments continuously connected to the local Custodian State Hub.
|
||||
keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Getting Oriented
|
||||
|
||||
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
|
||||
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
|
||||
`~/.local/state/bridge/` (PID/state/cert files)
|
||||
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
|
||||
MCP: `bridge_status()`
|
||||
- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
|
||||
- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)
|
||||
55
architecture/adr-001-cross-mode-capability-registry.md
Normal file
55
architecture/adr-001-cross-mode-capability-registry.md
Normal file
@@ -0,0 +1,55 @@
|
||||
---
|
||||
id: ADR-001
|
||||
title: Cross-Mode Capability Registry and Coverage Enforcement
|
||||
status: accepted
|
||||
date: 2026-03-12
|
||||
---
|
||||
|
||||
## Context
|
||||
|
||||
OpsBridge exposes its operations through three access modes: CLI (`bridge` CLI), MCP server
|
||||
(FastMCP stdio), and Skills (Claude plugin prompts). As the capability surface grows, there is
|
||||
no guarantee that a new capability will be implemented consistently across all required modes,
|
||||
or that tests exist for each mode.
|
||||
|
||||
## Decision
|
||||
|
||||
Introduce a canonical **Capability Registry** (`src/bridge/capabilities.py`) that:
|
||||
|
||||
1. Lists every operation as a `Capability(name, description, required_access_modes)` dataclass.
|
||||
2. Declares which access modes each capability must support.
|
||||
3. Is imported by the cross-mode meta-test to enforce complete test coverage.
|
||||
|
||||
### Test coverage enforcement
|
||||
|
||||
Pytest marks `@pytest.mark.capability(name)` and `@pytest.mark.access_mode(mode)` are placed
|
||||
on the canonical test for each (capability, mode) pair. `tests/test_coverage_completeness.py`
|
||||
collects these marks at session scope and fails if any pair required by the registry has no
|
||||
corresponding test.
|
||||
|
||||
### FastMCP in-process testing
|
||||
|
||||
MCP tools are tested in `tests/test_mcp.py` using `fastmcp.Client(mcp_app)` — an in-process
|
||||
client that calls tools without spawning a subprocess or opening a network socket. This is the
|
||||
preferred approach because:
|
||||
|
||||
- Tests run in the same process as the server code, so patches/mocks work normally.
|
||||
- No port allocation, no cleanup, no flakiness from network timeouts.
|
||||
- FastMCP 3.x returns results via `result.content[0].text` (JSON string) for non-empty
|
||||
responses, and `result.data` (empty list/dict) when the return value is empty.
|
||||
|
||||
### Skill static lint
|
||||
|
||||
`tests/test_skill.py` validates skill Markdown files in `~/.claude/plugins/ops-bridge/`:
|
||||
|
||||
- Required frontmatter: `name`, `description`.
|
||||
- Body must reference at least one registered capability name.
|
||||
- The `bridge_status` skill must reference `bridge_status` and the registry must declare
|
||||
`skill` as a required mode for that capability.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every new capability must be added to the registry before or alongside its implementation.
|
||||
- Every new (capability, mode) pair requires a marked test or the meta-test fails.
|
||||
- The registry is the single source of truth for "what does OpsBridge do and where".
|
||||
- Skills must reference capability names by their canonical registry IDs.
|
||||
40
pyproject.toml
Normal file
40
pyproject.toml
Normal file
@@ -0,0 +1,40 @@
|
||||
[build-system]
|
||||
requires = ["hatchling"]
|
||||
build-backend = "hatchling.build"
|
||||
|
||||
[project]
|
||||
name = "ops-bridge"
|
||||
version = "0.1.0"
|
||||
description = "SSH reverse tunnel lifecycle manager"
|
||||
requires-python = ">=3.11"
|
||||
dependencies = [
|
||||
"typer>=0.12",
|
||||
"pyyaml>=6.0",
|
||||
"httpx>=0.27",
|
||||
"fastmcp>=2.0.0,<3.1.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
bridge = "bridge.cli:app"
|
||||
|
||||
[tool.hatch.build.targets.wheel]
|
||||
packages = ["src/bridge"]
|
||||
|
||||
[tool.pytest.ini_options]
|
||||
testpaths = ["tests"]
|
||||
pythonpath = ["src"]
|
||||
asyncio_mode = "auto"
|
||||
markers = [
|
||||
"capability(name): the bridge capability under test",
|
||||
"access_mode(mode): access mode being tested (cli, mcp, skill)",
|
||||
]
|
||||
|
||||
[tool.ruff]
|
||||
line-length = 88
|
||||
|
||||
[dependency-groups]
|
||||
dev = [
|
||||
"pytest>=8.0",
|
||||
"pytest-asyncio>=0.23",
|
||||
"ruff>=0.4",
|
||||
]
|
||||
12
registry/README.md
Normal file
12
registry/README.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Capability Registry
|
||||
|
||||
Markdown-first capability index for federation and reuse planning.
|
||||
|
||||
## Authoring
|
||||
|
||||
1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
|
||||
2. Add the row to `indexes/capabilities.yaml`.
|
||||
3. Run `reuse-surface validate` from a checkout with the CLI installed.
|
||||
4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
|
||||
|
||||
Federation contract: reuse-surface `docs/RegistryFederation.md`.
|
||||
0
registry/capabilities/.gitkeep
Normal file
0
registry/capabilities/.gitkeep
Normal file
4
registry/indexes/capabilities.yaml
Normal file
4
registry/indexes/capabilities.yaml
Normal file
@@ -0,0 +1,4 @@
|
||||
version: 1
|
||||
updated: '2026-06-16'
|
||||
domain: helix_forge
|
||||
capabilities: []
|
||||
96
scripts/register_mcp.py
Normal file
96
scripts/register_mcp.py
Normal file
@@ -0,0 +1,96 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Register the ops-bridge MCP server at user scope in ~/.claude.json.
|
||||
|
||||
Usage:
|
||||
python scripts/register_mcp.py [--dry-run]
|
||||
|
||||
This script:
|
||||
1. Reads the MCP server config from .mcp.json in the repo root.
|
||||
2. Calls `claude mcp add-json -s user ops-bridge <config>` to register.
|
||||
3. Patches the `cwd` field in ~/.claude.json (claude mcp add-json silently drops it).
|
||||
|
||||
After running, all Claude Code sessions on this machine have access to the
|
||||
`ops-bridge` MCP tools — even when opened outside the ops-bridge repo directory.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import subprocess
|
||||
import sys
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
REPO_ROOT = Path(__file__).parent.parent
|
||||
MCP_JSON = REPO_ROOT / ".mcp.json"
|
||||
CLAUDE_JSON = Path.home() / ".claude.json"
|
||||
SERVER_NAME = "ops-bridge"
|
||||
|
||||
|
||||
def load_server_config() -> dict:
|
||||
data = json.loads(MCP_JSON.read_text())
|
||||
servers = data.get("mcpServers", {})
|
||||
if SERVER_NAME not in servers:
|
||||
raise SystemExit(f"ERROR: '{SERVER_NAME}' not found in {MCP_JSON}")
|
||||
return servers[SERVER_NAME]
|
||||
|
||||
|
||||
def register(config: dict, dry_run: bool) -> None:
|
||||
config_json = json.dumps(config)
|
||||
cmd = ["claude", "mcp", "add-json", "-s", "user", SERVER_NAME, config_json]
|
||||
print(f"→ Running: {' '.join(cmd[:6])} '<config>'")
|
||||
if not dry_run:
|
||||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||||
if result.returncode != 0:
|
||||
print(f"FAILED:\n{result.stderr}", file=sys.stderr)
|
||||
raise SystemExit(1)
|
||||
print(f" OK: {result.stdout.strip()}")
|
||||
|
||||
|
||||
def patch_cwd(cwd: str, dry_run: bool) -> None:
|
||||
"""Patch the cwd field that claude mcp add-json silently drops."""
|
||||
if not CLAUDE_JSON.exists():
|
||||
print(f"WARNING: {CLAUDE_JSON} not found — skipping cwd patch")
|
||||
return
|
||||
|
||||
data = json.loads(CLAUDE_JSON.read_text())
|
||||
servers = data.setdefault("mcpServers", {})
|
||||
if SERVER_NAME not in servers:
|
||||
print(f"WARNING: '{SERVER_NAME}' not found in {CLAUDE_JSON} after registration")
|
||||
return
|
||||
|
||||
current_cwd = servers[SERVER_NAME].get("cwd")
|
||||
if current_cwd == cwd:
|
||||
print(f"→ cwd already correct: {cwd}")
|
||||
return
|
||||
|
||||
servers[SERVER_NAME]["cwd"] = cwd
|
||||
print(f"→ Patching cwd: {cwd}")
|
||||
if not dry_run:
|
||||
CLAUDE_JSON.write_text(json.dumps(data, indent=2) + "\n")
|
||||
print(" OK")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
|
||||
parser.add_argument("--dry-run", action="store_true", help="Show what would be done without making changes")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.dry_run:
|
||||
print("[DRY RUN] No changes will be made.\n")
|
||||
|
||||
config = load_server_config()
|
||||
cwd = config.get("cwd", str(REPO_ROOT))
|
||||
|
||||
print(f"Registering ops-bridge MCP server from {MCP_JSON}")
|
||||
register(config, dry_run=args.dry_run)
|
||||
patch_cwd(cwd, dry_run=args.dry_run)
|
||||
|
||||
if not args.dry_run:
|
||||
print("\nDone. Restart Claude Code for the changes to take effect.")
|
||||
else:
|
||||
print("\n[DRY RUN complete]")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
0
src/bridge/__init__.py
Normal file
0
src/bridge/__init__.py
Normal file
69
src/bridge/audit.py
Normal file
69
src/bridge/audit.py
Normal file
@@ -0,0 +1,69 @@
|
||||
"""Audit logging for OpsBridge lifecycle events."""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from datetime import datetime, timezone
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from typing import Any, Dict, List, Optional
|
||||
|
||||
|
||||
class AuditEvent(str, Enum):
|
||||
BRIDGE_STARTED = "bridge_started"
|
||||
BRIDGE_CONNECTED = "bridge_connected"
|
||||
BRIDGE_DISCONNECTED = "bridge_disconnected"
|
||||
BRIDGE_RECONNECTING = "bridge_reconnecting"
|
||||
HEALTH_CHECK_FAILED = "health_check_failed"
|
||||
HEALTH_CHECK_RECOVERED = "health_check_recovered"
|
||||
BRIDGE_STOPPED = "bridge_stopped"
|
||||
CERT_EXPIRING = "cert_expiring"
|
||||
|
||||
|
||||
def _default_state_dir() -> Path:
|
||||
return Path.home() / ".local" / "state" / "bridge"
|
||||
|
||||
|
||||
class AuditLogger:
|
||||
def __init__(self, state_dir: Optional[Path] = None):
|
||||
self._dir = Path(state_dir) if state_dir else _default_state_dir()
|
||||
|
||||
def _log_path(self, tunnel: str) -> Path:
|
||||
return self._dir / f"{tunnel}.log"
|
||||
|
||||
def log(
|
||||
self,
|
||||
tunnel: str,
|
||||
event: AuditEvent,
|
||||
actor: str,
|
||||
actor_type: str,
|
||||
detail: str = "",
|
||||
cert_identity: Optional[str] = None,
|
||||
) -> None:
|
||||
self._dir.mkdir(parents=True, exist_ok=True)
|
||||
entry: Dict[str, Any] = {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"tunnel": tunnel,
|
||||
"actor": actor,
|
||||
"actor_type": actor_type,
|
||||
"event": event.value,
|
||||
}
|
||||
if detail:
|
||||
entry["detail"] = detail
|
||||
if cert_identity:
|
||||
entry["cert_identity"] = cert_identity
|
||||
with self._log_path(tunnel).open("a") as f:
|
||||
f.write(json.dumps(entry) + "\n")
|
||||
|
||||
def read_events(self, tunnel: str) -> List[Dict[str, Any]]:
|
||||
path = self._log_path(tunnel)
|
||||
if not path.exists():
|
||||
return []
|
||||
events = []
|
||||
for line in path.read_text().splitlines():
|
||||
line = line.strip()
|
||||
if line:
|
||||
try:
|
||||
events.append(json.loads(line))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
return events
|
||||
83
src/bridge/capabilities.py
Normal file
83
src/bridge/capabilities.py
Normal file
@@ -0,0 +1,83 @@
|
||||
"""Canonical capability registry for OpsBridge.
|
||||
|
||||
Every operation that can be invoked via CLI, MCP, or Skill must be listed here.
|
||||
The cross-mode test suite uses this registry to enforce test coverage parity.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
|
||||
ACCESS_MODES = frozenset({"cli", "mcp", "skill"})
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class Capability:
|
||||
name: str
|
||||
description: str
|
||||
required_access_modes: frozenset[str]
|
||||
|
||||
|
||||
CAPABILITIES: list[Capability] = [
|
||||
Capability(
|
||||
name="bridge_up",
|
||||
description="Start one or all tunnels",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_down",
|
||||
description="Stop one or all tunnels",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_restart",
|
||||
description="Restart one or all tunnels",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_status",
|
||||
description="Show tunnel status",
|
||||
required_access_modes=frozenset({"cli", "mcp", "skill"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_logs",
|
||||
description="Tail tunnel audit log",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="catalog_list_targets",
|
||||
description="List catalog targets",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="catalog_show_target",
|
||||
description="Show target metadata",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="catalog_list_domains",
|
||||
description="List catalog domains",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="catalog_validate",
|
||||
description="Validate catalog consistency",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="catalog_show_bridge",
|
||||
description="Show bridge metadata",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_check",
|
||||
description="End-to-end tunnel diagnostics via SSH: SSH PID alive + remote port listening",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_cert_status",
|
||||
description="Show certificate status for tunnels using cert_command mode",
|
||||
required_access_modes=frozenset({"cli"}),
|
||||
),
|
||||
]
|
||||
|
||||
CAPABILITIES_BY_NAME: dict[str, Capability] = {c.name: c for c in CAPABILITIES}
|
||||
0
src/bridge/catalog/__init__.py
Normal file
0
src/bridge/catalog/__init__.py
Normal file
141
src/bridge/catalog/loader.py
Normal file
141
src/bridge/catalog/loader.py
Normal file
@@ -0,0 +1,141 @@
|
||||
"""Catalog loader — walks a catalog directory tree and parses YAML files."""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import yaml
|
||||
|
||||
from bridge.catalog.models import (
|
||||
ActorClass,
|
||||
Catalog,
|
||||
CatalogBridge,
|
||||
CatalogDomain,
|
||||
CatalogTarget,
|
||||
)
|
||||
from bridge.models import HealthCheckConfig, ReconnectPolicy
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
class CatalogLoadError(Exception):
|
||||
"""Raised when catalog loading fails."""
|
||||
|
||||
|
||||
def load_catalog(path: Path) -> Catalog:
|
||||
"""Walk the catalog directory and return a populated Catalog."""
|
||||
path = Path(path)
|
||||
if not path.exists():
|
||||
raise CatalogLoadError(f"Catalog path not found: {path}")
|
||||
|
||||
catalog = Catalog()
|
||||
for yaml_file in sorted(path.rglob("*.yaml")):
|
||||
_load_file(yaml_file, catalog)
|
||||
return catalog
|
||||
|
||||
|
||||
def _load_file(path: Path, catalog: Catalog) -> None:
|
||||
try:
|
||||
with path.open() as f:
|
||||
data = yaml.safe_load(f)
|
||||
except yaml.YAMLError as e:
|
||||
raise CatalogLoadError(f"Invalid YAML in {path}: {e}") from e
|
||||
|
||||
if not isinstance(data, dict):
|
||||
log.warning("Skipping %s: not a YAML mapping", path)
|
||||
return
|
||||
|
||||
entry_type = data.get("type")
|
||||
if not entry_type:
|
||||
log.warning("Skipping %s: no 'type' field", path)
|
||||
return
|
||||
|
||||
try:
|
||||
if entry_type == "domain":
|
||||
entry = _parse_domain(data, path)
|
||||
catalog.domains[entry.id] = entry
|
||||
elif entry_type == "target":
|
||||
entry = _parse_target(data, path)
|
||||
catalog.targets[entry.id] = entry
|
||||
elif entry_type == "bridge":
|
||||
entry = _parse_bridge(data, path)
|
||||
catalog.bridges[entry.id] = entry
|
||||
elif entry_type == "actor":
|
||||
entry = _parse_actor(data, path)
|
||||
catalog.actors[entry.id] = entry
|
||||
else:
|
||||
log.warning("Skipping %s: unknown type '%s'", path, entry_type)
|
||||
except CatalogLoadError:
|
||||
raise
|
||||
except Exception as e:
|
||||
raise CatalogLoadError(f"Error parsing {path}: {e}") from e
|
||||
|
||||
|
||||
def _require(data: dict, field: str, path: Path) -> Any:
|
||||
if field not in data:
|
||||
raise CatalogLoadError(f"Missing required field '{field}' in {path}")
|
||||
return data[field]
|
||||
|
||||
|
||||
def _parse_domain(data: dict, path: Path) -> CatalogDomain:
|
||||
return CatalogDomain(
|
||||
id=str(_require(data, "id", path)),
|
||||
name=str(_require(data, "name", path)),
|
||||
description=str(data.get("description", "")),
|
||||
environment=str(data.get("environment", "")),
|
||||
)
|
||||
|
||||
|
||||
def _parse_target(data: dict, path: Path) -> CatalogTarget:
|
||||
return CatalogTarget(
|
||||
id=str(_require(data, "id", path)),
|
||||
domain=str(_require(data, "domain", path)),
|
||||
kind=str(_require(data, "kind", path)),
|
||||
description=str(data.get("description", "")),
|
||||
reachable_via=list(data.get("reachable_via") or []),
|
||||
)
|
||||
|
||||
|
||||
def _parse_bridge(data: dict, path: Path) -> CatalogBridge:
|
||||
health_check = None
|
||||
if "health_check" in data and data["health_check"]:
|
||||
hc = data["health_check"]
|
||||
health_check = HealthCheckConfig(
|
||||
url=str(_require(hc, "url", path)),
|
||||
interval_seconds=int(hc.get("interval_seconds", 30)),
|
||||
timeout_seconds=int(hc.get("timeout_seconds", 5)),
|
||||
)
|
||||
|
||||
reconnect = None
|
||||
if "reconnect" in data and data["reconnect"]:
|
||||
r = data["reconnect"]
|
||||
reconnect = ReconnectPolicy(
|
||||
max_attempts=int(r.get("max_attempts", 0)),
|
||||
backoff_initial=int(r.get("backoff_initial", 5)),
|
||||
backoff_max=int(r.get("backoff_max", 60)),
|
||||
)
|
||||
|
||||
return CatalogBridge(
|
||||
id=str(_require(data, "id", path)),
|
||||
domain=str(_require(data, "domain", path)),
|
||||
target=str(_require(data, "target", path)),
|
||||
host=str(_require(data, "host", path)),
|
||||
remote_port=int(_require(data, "remote_port", path)),
|
||||
local_port=int(_require(data, "local_port", path)),
|
||||
ssh_user=str(_require(data, "ssh_user", path)),
|
||||
ssh_key=str(_require(data, "ssh_key", path)),
|
||||
actor=str(_require(data, "actor", path)),
|
||||
description=str(data.get("description", "")),
|
||||
access_method=str(data.get("access_method", "ssh-reverse")),
|
||||
health_check=health_check,
|
||||
reconnect=reconnect,
|
||||
)
|
||||
|
||||
|
||||
def _parse_actor(data: dict, path: Path) -> ActorClass:
|
||||
return ActorClass(
|
||||
id=str(_require(data, "id", path)),
|
||||
actor_class=str(_require(data, "class", path)),
|
||||
description=str(data.get("description", "")),
|
||||
)
|
||||
69
src/bridge/catalog/models.py
Normal file
69
src/bridge/catalog/models.py
Normal file
@@ -0,0 +1,69 @@
|
||||
"""Domain models for OpsCatalog."""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from typing import Dict, List, Optional
|
||||
|
||||
from bridge.models import HealthCheckConfig, ReconnectPolicy, TunnelConfig
|
||||
|
||||
|
||||
@dataclass
|
||||
class CatalogDomain:
|
||||
id: str
|
||||
name: str
|
||||
description: str = ""
|
||||
environment: str = ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class CatalogTarget:
|
||||
id: str
|
||||
domain: str
|
||||
kind: str
|
||||
description: str = ""
|
||||
reachable_via: List[str] = field(default_factory=list)
|
||||
|
||||
|
||||
@dataclass
|
||||
class CatalogBridge:
|
||||
id: str
|
||||
domain: str
|
||||
target: str
|
||||
host: str
|
||||
remote_port: int
|
||||
local_port: int
|
||||
ssh_user: str
|
||||
ssh_key: str
|
||||
actor: str
|
||||
description: str = ""
|
||||
access_method: str = "ssh-reverse"
|
||||
health_check: Optional[HealthCheckConfig] = None
|
||||
reconnect: Optional[ReconnectPolicy] = None
|
||||
|
||||
def to_tunnel_config(self) -> TunnelConfig:
|
||||
return TunnelConfig(
|
||||
name=self.id,
|
||||
host=self.host,
|
||||
remote_port=self.remote_port,
|
||||
local_port=self.local_port,
|
||||
ssh_user=self.ssh_user,
|
||||
ssh_key=self.ssh_key,
|
||||
actor=self.actor,
|
||||
reconnect=self.reconnect if self.reconnect is not None else ReconnectPolicy(),
|
||||
health_check=self.health_check,
|
||||
)
|
||||
|
||||
|
||||
@dataclass
|
||||
class ActorClass:
|
||||
id: str
|
||||
actor_class: str
|
||||
description: str = ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class Catalog:
|
||||
domains: Dict[str, CatalogDomain] = field(default_factory=dict)
|
||||
targets: Dict[str, CatalogTarget] = field(default_factory=dict)
|
||||
bridges: Dict[str, CatalogBridge] = field(default_factory=dict)
|
||||
actors: Dict[str, ActorClass] = field(default_factory=dict)
|
||||
35
src/bridge/catalog/resolver.py
Normal file
35
src/bridge/catalog/resolver.py
Normal file
@@ -0,0 +1,35 @@
|
||||
"""Catalog resolver — resolves a bridge name to a TunnelConfig."""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Dict, Optional
|
||||
|
||||
from bridge.catalog.models import Catalog
|
||||
from bridge.models import TunnelConfig
|
||||
|
||||
|
||||
class BridgeNotFound(Exception):
|
||||
"""Raised when a bridge name cannot be resolved from inline config or catalog."""
|
||||
|
||||
|
||||
def resolve(
|
||||
name: str,
|
||||
catalog: Optional[Catalog],
|
||||
inline_tunnels: Dict[str, TunnelConfig],
|
||||
) -> TunnelConfig:
|
||||
"""Resolve bridge name to TunnelConfig.
|
||||
|
||||
Lookup order:
|
||||
1. inline_tunnels (from tunnels.yaml) — wins if present
|
||||
2. catalog bridges — fallback
|
||||
3. raises BridgeNotFound if neither has the name
|
||||
"""
|
||||
if name in inline_tunnels:
|
||||
return inline_tunnels[name]
|
||||
|
||||
if catalog is not None and name in catalog.bridges:
|
||||
return catalog.bridges[name].to_tunnel_config()
|
||||
|
||||
raise BridgeNotFound(
|
||||
f"Bridge '{name}' not found in inline config"
|
||||
+ (" or catalog" if catalog is not None else " (no catalog configured)")
|
||||
)
|
||||
42
src/bridge/catalog/validator.py
Normal file
42
src/bridge/catalog/validator.py
Normal file
@@ -0,0 +1,42 @@
|
||||
"""Catalog validator — cross-reference checks for catalog consistency."""
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import List
|
||||
|
||||
from bridge.catalog.models import Catalog
|
||||
|
||||
|
||||
class ValidationError(Exception):
|
||||
"""Raised when catalog validation fails (used for programmatic access)."""
|
||||
|
||||
|
||||
def validate_catalog(catalog: Catalog) -> List[str]:
|
||||
"""Return a list of validation error strings (empty = valid)."""
|
||||
errors: List[str] = []
|
||||
|
||||
for target in catalog.targets.values():
|
||||
if target.domain not in catalog.domains:
|
||||
errors.append(
|
||||
f"Target '{target.id}': domain '{target.domain}' does not exist in catalog"
|
||||
)
|
||||
for bridge_id in target.reachable_via:
|
||||
if bridge_id not in catalog.bridges:
|
||||
errors.append(
|
||||
f"Target '{target.id}': reachable_via references unknown bridge '{bridge_id}'"
|
||||
)
|
||||
|
||||
for bridge in catalog.bridges.values():
|
||||
if bridge.domain not in catalog.domains:
|
||||
errors.append(
|
||||
f"Bridge '{bridge.id}': domain '{bridge.domain}' does not exist in catalog"
|
||||
)
|
||||
if bridge.target not in catalog.targets:
|
||||
errors.append(
|
||||
f"Bridge '{bridge.id}': target '{bridge.target}' does not exist in catalog"
|
||||
)
|
||||
if bridge.actor not in catalog.actors:
|
||||
errors.append(
|
||||
f"Bridge '{bridge.id}': actor '{bridge.actor}' does not exist in catalog"
|
||||
)
|
||||
|
||||
return errors
|
||||
328
src/bridge/cleanup.py
Normal file
328
src/bridge/cleanup.py
Normal file
@@ -0,0 +1,328 @@
|
||||
"""Nightly maintenance: detect and clear stale SSH remote port forwards."""
|
||||
from __future__ import annotations
|
||||
|
||||
import subprocess
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
from urllib.parse import urlparse, urlunparse
|
||||
|
||||
import httpx
|
||||
|
||||
from bridge.diagnostics import _remote_port_probe_command, check_tunnel
|
||||
from bridge.manager import TunnelManager
|
||||
from bridge.models import TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
@dataclass
|
||||
class CleanupAction:
|
||||
tunnel: str
|
||||
action: str # skipped | healthy | cleaned | cleaned_and_restarted | error
|
||||
detail: str = ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class CleanupReport:
|
||||
actions: list[CleanupAction]
|
||||
|
||||
@property
|
||||
def cleaned_count(self) -> int:
|
||||
return sum(1 for a in self.actions if a.action.startswith("cleaned"))
|
||||
|
||||
|
||||
def remote_forward_health_url(cfg: TunnelConfig) -> Optional[str]:
|
||||
"""Map the local health_check URL to the remote forwarded port."""
|
||||
if cfg.health_check is None or cfg.direction == "local":
|
||||
return None
|
||||
parsed = urlparse(cfg.health_check.url)
|
||||
if not parsed.hostname:
|
||||
return None
|
||||
netloc = f"{parsed.hostname}:{cfg.remote_port}"
|
||||
return urlunparse(parsed._replace(netloc=netloc))
|
||||
|
||||
|
||||
def _ssh_base_cmd(cfg: TunnelConfig) -> list[str]:
|
||||
from pathlib import Path
|
||||
|
||||
return [
|
||||
"ssh",
|
||||
"-i",
|
||||
str(Path(cfg.ssh_key).expanduser()),
|
||||
"-o",
|
||||
"BatchMode=yes",
|
||||
"-o",
|
||||
"ConnectTimeout=10",
|
||||
"-o",
|
||||
"StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
]
|
||||
|
||||
|
||||
def _run_ssh(cfg: TunnelConfig, remote_command: str, *, timeout: float = 30) -> subprocess.CompletedProcess[str]:
|
||||
return subprocess.run(
|
||||
[*_ssh_base_cmd(cfg), remote_command],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
|
||||
def remote_port_listening(cfg: TunnelConfig) -> bool:
|
||||
proc = _run_ssh(cfg, _remote_port_probe_command(cfg.remote_port), timeout=15)
|
||||
return proc.stdout.strip() == "ok"
|
||||
|
||||
|
||||
def probe_remote_forward(cfg: TunnelConfig) -> tuple[bool, str]:
|
||||
"""Return (healthy, detail) for the remote forwarded service."""
|
||||
url = remote_forward_health_url(cfg)
|
||||
if url is None:
|
||||
return True, "no remote health url configured"
|
||||
timeout = cfg.health_check.timeout_seconds if cfg.health_check else 5
|
||||
remote_cmd = (
|
||||
f"curl -sf --max-time {timeout} {url!r} >/dev/null "
|
||||
"&& echo ok || echo fail"
|
||||
)
|
||||
try:
|
||||
proc = _run_ssh(cfg, remote_cmd, timeout=timeout + 15)
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "remote health probe timed out"
|
||||
output = proc.stdout.strip()
|
||||
if output == "ok":
|
||||
return True, "remote forward healthy"
|
||||
if proc.returncode != 0 and proc.stderr.strip():
|
||||
return False, proc.stderr.strip()
|
||||
return False, "remote forward unhealthy"
|
||||
|
||||
|
||||
def local_service_healthy(cfg: TunnelConfig) -> Optional[bool]:
|
||||
if cfg.health_check is None:
|
||||
return None
|
||||
try:
|
||||
resp = httpx.get(
|
||||
cfg.health_check.url,
|
||||
timeout=cfg.health_check.timeout_seconds,
|
||||
)
|
||||
return resp.is_success
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def _remote_cleanup_script(port: int) -> str:
|
||||
return f"""set -eu
|
||||
port={port}
|
||||
pids=""
|
||||
if command -v lsof >/dev/null 2>&1; then
|
||||
pids=$(sudo -n lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
|
||||
if [ -z "$pids" ]; then
|
||||
pids=$(lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
|
||||
fi
|
||||
fi
|
||||
if [ -z "$pids" ] && command -v fuser >/dev/null 2>&1; then
|
||||
pids=$(fuser -n tcp $port 2>/dev/null | tr -s ' ' '\\n' | grep -E '^[0-9]+$' || true)
|
||||
fi
|
||||
if [ -z "$pids" ]; then
|
||||
echo "no_listeners"
|
||||
exit 0
|
||||
fi
|
||||
echo "killing:$pids"
|
||||
for pid in $pids; do
|
||||
kill "$pid" 2>/dev/null || sudo -n kill "$pid" 2>/dev/null || true
|
||||
done
|
||||
sleep 1
|
||||
if ss -tln 2>/dev/null | grep -q ":$port "; then
|
||||
echo "still_listening"
|
||||
else
|
||||
echo "cleared"
|
||||
fi
|
||||
"""
|
||||
|
||||
|
||||
def clear_stale_remote_binding(cfg: TunnelConfig) -> tuple[bool, str]:
|
||||
try:
|
||||
proc = _run_ssh(cfg, _remote_cleanup_script(cfg.remote_port), timeout=30)
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "remote cleanup timed out"
|
||||
output = proc.stdout.strip()
|
||||
if "cleared" in output:
|
||||
return True, output
|
||||
if "no_listeners" in output:
|
||||
return True, "no listeners found"
|
||||
if "still_listening" in output:
|
||||
return False, output
|
||||
detail = output or proc.stderr.strip() or f"exit {proc.returncode}"
|
||||
return False, detail
|
||||
|
||||
|
||||
def should_cleanup_tunnel(
|
||||
cfg: TunnelConfig,
|
||||
state_mgr: StateManager,
|
||||
) -> tuple[bool, str]:
|
||||
"""Decide whether a reverse tunnel's remote binding looks stale."""
|
||||
if cfg.direction == "local":
|
||||
return False, "local tunnel"
|
||||
|
||||
if not remote_port_listening(cfg):
|
||||
return False, "remote port closed"
|
||||
|
||||
remote_ok, remote_detail = probe_remote_forward(cfg)
|
||||
if remote_ok:
|
||||
return False, remote_detail
|
||||
|
||||
check = check_tunnel(cfg, state_mgr)
|
||||
local_ok = local_service_healthy(cfg)
|
||||
|
||||
if local_ok is True and not remote_ok:
|
||||
return True, f"stale forward: {remote_detail}"
|
||||
|
||||
if check.ssh_process != "ok" and check.remote_port == "listening":
|
||||
return True, f"orphan forward while ssh {check.ssh_process}: {remote_detail}"
|
||||
|
||||
if check.ssh_process == "ok" and not remote_ok:
|
||||
return True, f"broken forward with live client: {remote_detail}"
|
||||
|
||||
return False, remote_detail
|
||||
|
||||
|
||||
def cleanup_tunnel(
|
||||
cfg: TunnelConfig,
|
||||
state_mgr: StateManager,
|
||||
*,
|
||||
restart: bool,
|
||||
) -> CleanupAction:
|
||||
name = cfg.name
|
||||
try:
|
||||
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
|
||||
if not needed:
|
||||
return CleanupAction(name, "healthy", reason)
|
||||
|
||||
ok, detail = clear_stale_remote_binding(cfg)
|
||||
if not ok:
|
||||
return CleanupAction(name, "error", f"cleanup failed: {detail}")
|
||||
|
||||
if not restart:
|
||||
return CleanupAction(name, "cleaned", f"{reason}; {detail}")
|
||||
|
||||
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
|
||||
was_running = mgr.is_running()
|
||||
if was_running:
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
action = "cleaned_and_restarted"
|
||||
verb = "restarted" if was_running else "started"
|
||||
return CleanupAction(name, action, f"{reason}; {verb} tunnel; {detail}")
|
||||
except Exception as exc:
|
||||
return CleanupAction(name, "error", str(exc))
|
||||
|
||||
|
||||
def restart_tunnel(
|
||||
cfg: TunnelConfig,
|
||||
state_mgr: StateManager,
|
||||
) -> CleanupAction:
|
||||
"""Restart one tunnel with blank-slate recovery for reverse tunnels."""
|
||||
if cfg.direction == "local":
|
||||
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
return CleanupAction(cfg.name, "restarted", "local tunnel stop/start")
|
||||
return cleanup_tunnel(cfg, state_mgr, restart=True)
|
||||
|
||||
|
||||
def restart_all_tunnels(
|
||||
cfg,
|
||||
state_mgr: StateManager,
|
||||
) -> list[CleanupAction]:
|
||||
"""Restart every inline tunnel (reverse via cleanup path, local via stop/start)."""
|
||||
return [restart_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]
|
||||
|
||||
|
||||
def cleanup_all_tunnels(
|
||||
cfg,
|
||||
state_mgr: StateManager,
|
||||
*,
|
||||
restart: bool,
|
||||
tunnel_name: Optional[str] = None,
|
||||
) -> CleanupReport:
|
||||
tunnels = cfg.tunnels.values()
|
||||
if tunnel_name is not None:
|
||||
if tunnel_name not in cfg.tunnels:
|
||||
raise KeyError(tunnel_name)
|
||||
tunnels = [cfg.tunnels[tunnel_name]]
|
||||
|
||||
actions = [
|
||||
cleanup_tunnel(tcfg, state_mgr, restart=restart)
|
||||
for tcfg in tunnels
|
||||
if tcfg.direction != "local"
|
||||
]
|
||||
return CleanupReport(actions=actions)
|
||||
|
||||
|
||||
CRON_MARKER = "# ops-bridge: maintenance cleanup"
|
||||
CRON_SCHEDULE = "0 3 * * *"
|
||||
CRON_LOG = "~/.local/state/bridge/cleanup.log"
|
||||
|
||||
|
||||
def build_cron_line() -> str:
|
||||
bridge_bin = "~/.local/bin/bridge"
|
||||
return (
|
||||
f"{CRON_SCHEDULE} BRIDGE_CONFIG=~/.config/bridge/tunnels.yaml "
|
||||
f"{bridge_bin} maintenance cleanup --restart "
|
||||
f">> {CRON_LOG} 2>&1 {CRON_MARKER}"
|
||||
)
|
||||
|
||||
|
||||
def read_installed_cron() -> Optional[str]:
|
||||
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
|
||||
if proc.returncode != 0:
|
||||
return None
|
||||
for line in proc.stdout.splitlines():
|
||||
if CRON_MARKER in line:
|
||||
return line.strip()
|
||||
return None
|
||||
|
||||
|
||||
def install_cleanup_cron() -> tuple[bool, str]:
|
||||
existing = read_installed_cron()
|
||||
if existing:
|
||||
return False, f"cron already installed: {existing}"
|
||||
|
||||
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
|
||||
current = proc.stdout if proc.returncode == 0 else ""
|
||||
new_line = build_cron_line()
|
||||
body = current.rstrip("\n")
|
||||
if body:
|
||||
body += "\n"
|
||||
body += new_line + "\n"
|
||||
write = subprocess.run(
|
||||
["crontab", "-"],
|
||||
input=body,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if write.returncode != 0:
|
||||
return False, write.stderr.strip() or "crontab write failed"
|
||||
return True, new_line
|
||||
|
||||
|
||||
def uninstall_cleanup_cron() -> tuple[bool, str]:
|
||||
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
|
||||
if proc.returncode != 0:
|
||||
return False, "no crontab installed"
|
||||
kept = [
|
||||
line
|
||||
for line in proc.stdout.splitlines()
|
||||
if CRON_MARKER not in line
|
||||
]
|
||||
if len(kept) == len(proc.stdout.splitlines()):
|
||||
return False, "cleanup cron not found"
|
||||
body = "\n".join(kept).rstrip("\n")
|
||||
if body:
|
||||
body += "\n"
|
||||
write = subprocess.run(
|
||||
["crontab", "-"],
|
||||
input=body,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if write.returncode != 0:
|
||||
return False, write.stderr.strip() or "crontab write failed"
|
||||
return True, "removed cleanup cron entry"
|
||||
773
src/bridge/cli.py
Normal file
773
src/bridge/cli.py
Normal file
@@ -0,0 +1,773 @@
|
||||
"""CLI for OpsBridge — bridge command."""
|
||||
from __future__ import annotations
|
||||
|
||||
import dataclasses
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
|
||||
from bridge.audit import AuditLogger
|
||||
from bridge.cleanup import (
|
||||
CleanupAction,
|
||||
build_cron_line,
|
||||
cleanup_all_tunnels,
|
||||
install_cleanup_cron,
|
||||
read_installed_cron,
|
||||
restart_all_tunnels,
|
||||
restart_tunnel,
|
||||
uninstall_cleanup_cron,
|
||||
)
|
||||
from bridge.config import ConfigError, load_config
|
||||
from bridge.diagnostics import check_all_tunnels, check_tunnel
|
||||
from bridge.manager import TunnelManager
|
||||
from bridge.state import StateManager, _pid_alive
|
||||
|
||||
app = typer.Typer(
|
||||
name="bridge",
|
||||
help="OpsBridge — SSH reverse tunnel lifecycle manager.",
|
||||
no_args_is_help=True,
|
||||
)
|
||||
|
||||
targets_app = typer.Typer(help="Inspect infrastructure targets from the OpsCatalog.")
|
||||
catalog_app = typer.Typer(help="Inspect and validate the OpsCatalog.")
|
||||
maintenance_app = typer.Typer(help="Scheduled maintenance for tunnel hygiene.")
|
||||
|
||||
app.add_typer(targets_app, name="targets")
|
||||
app.add_typer(catalog_app, name="catalog")
|
||||
app.add_typer(maintenance_app, name="maintenance")
|
||||
|
||||
|
||||
def _state_dir() -> Path:
|
||||
return Path(os.environ.get("BRIDGE_STATE_DIR", str(Path.home() / ".local" / "state" / "bridge")))
|
||||
|
||||
|
||||
def _load_or_exit():
|
||||
try:
|
||||
return load_config()
|
||||
except ConfigError as e:
|
||||
typer.echo(f"Error: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
def _load_catalog_or_exit(cfg):
|
||||
from bridge.catalog.loader import load_catalog
|
||||
if cfg.catalog_path is None:
|
||||
typer.echo("Error: catalog_path not configured in tunnels.yaml", err=True)
|
||||
raise typer.Exit(1)
|
||||
try:
|
||||
return load_catalog(cfg.catalog_path)
|
||||
except Exception as e:
|
||||
typer.echo(f"Error loading catalog: {e}", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
def _resolve_tunnel(cfg, name: str):
|
||||
"""Resolve tunnel name: inline first, then catalog, then error."""
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
|
||||
catalog = None
|
||||
if cfg.catalog_path is not None:
|
||||
try:
|
||||
catalog = load_catalog(cfg.catalog_path)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
try:
|
||||
return resolve(name, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
typer.echo(f"Error: tunnel '{name}' not found in config or catalog", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
def _all_tunnel_names(cfg):
|
||||
"""Return names from inline config (all-tunnels operations use inline only)."""
|
||||
return list(cfg.tunnels.keys())
|
||||
|
||||
|
||||
# ─── Tunnel lifecycle commands ────────────────────────────────────────────────
|
||||
|
||||
@app.command()
|
||||
def up(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
):
|
||||
"""Start one or all tunnels."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
|
||||
if tunnel:
|
||||
tcfg = _resolve_tunnel(cfg, tunnel)
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if mgr.is_running():
|
||||
typer.echo(f"Tunnel '{tunnel}' is already running.")
|
||||
raise typer.Exit(2)
|
||||
mgr.start()
|
||||
typer.echo(f"Started tunnel '{tunnel}'.")
|
||||
else:
|
||||
names = _all_tunnel_names(cfg)
|
||||
any_already_running = False
|
||||
for name in names:
|
||||
tcfg = cfg.tunnels[name]
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if mgr.is_running():
|
||||
typer.echo(f"Tunnel '{name}' is already running.")
|
||||
any_already_running = True
|
||||
else:
|
||||
mgr.start()
|
||||
typer.echo(f"Started tunnel '{name}'.")
|
||||
if any_already_running and len(names) == 1:
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
@app.command()
|
||||
def down(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
):
|
||||
"""Stop one or all tunnels."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
|
||||
if tunnel:
|
||||
tcfg = _resolve_tunnel(cfg, tunnel)
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if not mgr.is_running():
|
||||
typer.echo(f"Tunnel '{tunnel}' is not running.")
|
||||
raise typer.Exit(2)
|
||||
mgr.stop()
|
||||
typer.echo(f"Stopped tunnel '{tunnel}'.")
|
||||
else:
|
||||
names = _all_tunnel_names(cfg)
|
||||
any_not_running = False
|
||||
for name in names:
|
||||
tcfg = cfg.tunnels[name]
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if not mgr.is_running():
|
||||
typer.echo(f"Tunnel '{name}' is not running.")
|
||||
any_not_running = True
|
||||
else:
|
||||
mgr.stop()
|
||||
typer.echo(f"Stopped tunnel '{name}'.")
|
||||
if any_not_running and len(names) == 1:
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
def _emit_restart_actions(actions: list[CleanupAction]) -> None:
|
||||
any_error = False
|
||||
for action in actions:
|
||||
typer.echo(f"{action.tunnel}: {action.action} — {action.detail}")
|
||||
if action.action == "error":
|
||||
any_error = True
|
||||
if any_error:
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
@app.command()
|
||||
def restart(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
):
|
||||
"""Restart one or all tunnels.
|
||||
|
||||
Reverse tunnels run conditional remote stale-forward cleanup before
|
||||
reconnecting; healthy forwards are left running. Local-direction tunnels
|
||||
use local stop/start only.
|
||||
"""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
if tunnel:
|
||||
tcfg = _resolve_tunnel(cfg, tunnel)
|
||||
actions = [restart_tunnel(tcfg, state_mgr)]
|
||||
else:
|
||||
actions = restart_all_tunnels(cfg, state_mgr)
|
||||
|
||||
_emit_restart_actions(actions)
|
||||
|
||||
|
||||
@app.command()
|
||||
def status(
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""Show status of all tunnels."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
rows = []
|
||||
for name, tcfg in cfg.tunnels.items():
|
||||
state = state_mgr.read_state(name)
|
||||
raw_pid = state_mgr.read_raw_pid(name)
|
||||
pid_alive_val = _pid_alive(raw_pid) if raw_pid is not None else None
|
||||
stale = (
|
||||
state.value in ("connected", "degraded")
|
||||
and pid_alive_val is not True
|
||||
)
|
||||
rows.append({
|
||||
"tunnel": name,
|
||||
"state": state.value,
|
||||
"actor": tcfg.actor,
|
||||
"host": tcfg.host,
|
||||
"pid": raw_pid,
|
||||
"pid_alive": pid_alive_val,
|
||||
"stale": stale,
|
||||
"uptime": None,
|
||||
"health": None,
|
||||
})
|
||||
|
||||
if as_json:
|
||||
typer.echo(json.dumps(rows, indent=2))
|
||||
else:
|
||||
_print_status_table(rows)
|
||||
|
||||
|
||||
def _print_status_table(rows):
|
||||
if not rows:
|
||||
typer.echo("No tunnels configured.")
|
||||
return
|
||||
|
||||
def _state_display(row):
|
||||
s = row["state"]
|
||||
if row.get("stale"):
|
||||
s += " [STALE]"
|
||||
return s
|
||||
|
||||
def _live_display(row):
|
||||
alive = row.get("pid_alive")
|
||||
if alive is True:
|
||||
return "yes"
|
||||
elif alive is False:
|
||||
return "no"
|
||||
return "\u2014"
|
||||
|
||||
headers = ["TUNNEL", "STATE", "ACTOR", "HOST", "PID", "LIVE"]
|
||||
col_widths = [
|
||||
max(len("TUNNEL"), max((len(row["tunnel"]) for row in rows), default=0)),
|
||||
max(len("STATE"), max((len(_state_display(row)) for row in rows), default=0)),
|
||||
max(len("ACTOR"), max((len(str(row.get("actor", "") or "")) for row in rows), default=0)),
|
||||
max(len("HOST"), max((len(str(row.get("host", "") or "")) for row in rows), default=0)),
|
||||
max(len("PID"), max((len(str(row["pid"] or "")) for row in rows), default=0)),
|
||||
max(len("LIVE"), max((len(_live_display(row)) for row in rows), default=0)),
|
||||
]
|
||||
|
||||
def _fmt_row(vals):
|
||||
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
|
||||
|
||||
typer.echo(_fmt_row(headers))
|
||||
typer.echo(_fmt_row(["-" * w for w in col_widths]))
|
||||
for row in rows:
|
||||
typer.echo(_fmt_row([
|
||||
row["tunnel"],
|
||||
_state_display(row),
|
||||
row["actor"],
|
||||
row["host"],
|
||||
str(row["pid"] or ""),
|
||||
_live_display(row),
|
||||
]))
|
||||
|
||||
|
||||
@app.command()
|
||||
def logs(
|
||||
tunnel: str = typer.Argument(..., help="Tunnel name"),
|
||||
lines: int = typer.Option(50, "--lines", "-n", help="Number of lines to show"),
|
||||
follow: bool = typer.Option(False, "--follow", "-f", help="Follow the log"),
|
||||
):
|
||||
"""Show audit log for a tunnel."""
|
||||
cfg = _load_or_exit()
|
||||
_resolve_tunnel(cfg, tunnel) # validate name
|
||||
|
||||
sd = _state_dir()
|
||||
logger = AuditLogger(state_dir=sd)
|
||||
events = logger.read_events(tunnel)
|
||||
|
||||
if not events:
|
||||
typer.echo(f"No log entries for tunnel '{tunnel}'.")
|
||||
return
|
||||
|
||||
for entry in events[-lines:]:
|
||||
ts = entry.get("timestamp", "")
|
||||
event = entry.get("event", "")
|
||||
actor = entry.get("actor", "")
|
||||
detail = entry.get("detail", "")
|
||||
parts = [ts, event, f"actor={actor}"]
|
||||
if detail:
|
||||
parts.append(detail)
|
||||
typer.echo(" ".join(parts))
|
||||
|
||||
if follow:
|
||||
import time
|
||||
log_path = sd / f"{tunnel}.log"
|
||||
try:
|
||||
with log_path.open() as f:
|
||||
f.seek(0, 2)
|
||||
while True:
|
||||
line = f.readline()
|
||||
if line:
|
||||
try:
|
||||
entry = json.loads(line)
|
||||
ts = entry.get("timestamp", "")
|
||||
event = entry.get("event", "")
|
||||
actor = entry.get("actor", "")
|
||||
detail = entry.get("detail", "")
|
||||
parts = [ts, event, f"actor={actor}"]
|
||||
if detail:
|
||||
parts.append(detail)
|
||||
typer.echo(" ".join(parts))
|
||||
except json.JSONDecodeError:
|
||||
pass
|
||||
else:
|
||||
time.sleep(0.5)
|
||||
except KeyboardInterrupt:
|
||||
pass
|
||||
|
||||
|
||||
@app.command()
|
||||
def check(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""End-to-end diagnostics: verify SSH PID alive and remote port listening."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
if tunnel:
|
||||
results = [check_tunnel(_resolve_tunnel(cfg, tunnel), state_mgr)]
|
||||
else:
|
||||
results = check_all_tunnels(cfg, state_mgr)
|
||||
|
||||
if as_json:
|
||||
typer.echo(json.dumps(
|
||||
[{**dataclasses.asdict(r), "ok": r.ok} for r in results],
|
||||
indent=2,
|
||||
))
|
||||
else:
|
||||
_print_check_table(results)
|
||||
|
||||
if any(not r.ok for r in results):
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
def _print_check_table(results):
|
||||
if not results:
|
||||
typer.echo("No tunnels configured.")
|
||||
return
|
||||
headers = ["TUNNEL", "SSH", "PID", "PORT", "API", "OK"]
|
||||
rows_data = []
|
||||
for r in results:
|
||||
rows_data.append([
|
||||
r.tunnel,
|
||||
r.ssh_process,
|
||||
str(r.pid or ""),
|
||||
r.remote_port,
|
||||
r.local_api or "\u2014",
|
||||
"yes" if r.ok else "no",
|
||||
])
|
||||
col_widths = [
|
||||
max(len(h), max((len(row[i]) for row in rows_data), default=0))
|
||||
for i, h in enumerate(headers)
|
||||
]
|
||||
|
||||
def _fmt(vals):
|
||||
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
|
||||
|
||||
typer.echo(_fmt(headers))
|
||||
typer.echo(_fmt(["-" * w for w in col_widths]))
|
||||
for row in rows_data:
|
||||
typer.echo(_fmt(row))
|
||||
|
||||
|
||||
@app.command("cert-status")
|
||||
def cert_status(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""Show certificate status for tunnels using cert_command mode."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
|
||||
names = [tunnel] if tunnel else list(cfg.tunnels.keys())
|
||||
rows = []
|
||||
any_expired = False
|
||||
|
||||
for name in names:
|
||||
cert_file = sd / f"{name}-cert.pub"
|
||||
if not cert_file.exists():
|
||||
rows.append({"tunnel": name, "mode": "static-key", "cert_file": None})
|
||||
continue
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ssh-keygen", "-L", "-f", str(cert_file)],
|
||||
capture_output=True, text=True, check=False,
|
||||
)
|
||||
info = {"tunnel": name, "mode": "cert", "cert_file": str(cert_file)}
|
||||
for line in result.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if line.startswith("Key ID:"):
|
||||
info["key_id"] = line.split(":", 1)[1].strip().strip('"')
|
||||
elif line.startswith("Valid:"):
|
||||
parts = line.split()
|
||||
if len(parts) >= 5 and parts[1] == "from" and parts[3] == "to":
|
||||
info["valid_from"] = parts[2]
|
||||
info["valid_until"] = parts[4]
|
||||
try:
|
||||
expires = datetime.fromisoformat(parts[4])
|
||||
now = datetime.now()
|
||||
remaining = expires - now
|
||||
if remaining.total_seconds() <= 0:
|
||||
info["expired"] = True
|
||||
any_expired = True
|
||||
else:
|
||||
info["expired"] = False
|
||||
mins = int(remaining.total_seconds() // 60)
|
||||
info["ttl_remaining"] = f"{mins}m"
|
||||
except ValueError:
|
||||
pass
|
||||
rows.append(info)
|
||||
except FileNotFoundError:
|
||||
rows.append({"tunnel": name, "mode": "cert", "error": "ssh-keygen not found"})
|
||||
|
||||
if as_json:
|
||||
typer.echo(json.dumps(rows, indent=2))
|
||||
else:
|
||||
for row in rows:
|
||||
mode = row.get("mode", "unknown")
|
||||
if mode == "static-key":
|
||||
typer.echo(f"{row['tunnel']} static-key / no cert")
|
||||
elif "error" in row:
|
||||
typer.echo(f"{row['tunnel']} ERROR: {row['error']}")
|
||||
else:
|
||||
parts = [row["tunnel"]]
|
||||
if "key_id" in row:
|
||||
parts.append(f"id={row['key_id']}")
|
||||
if "valid_from" in row:
|
||||
parts.append(f"from={row['valid_from']}")
|
||||
if "valid_until" in row:
|
||||
parts.append(f"until={row['valid_until']}")
|
||||
if row.get("expired"):
|
||||
parts.append("EXPIRED")
|
||||
elif "ttl_remaining" in row:
|
||||
parts.append(f"ttl={row['ttl_remaining']}")
|
||||
typer.echo(" ".join(parts))
|
||||
|
||||
if any_expired:
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
# ─── targets commands ─────────────────────────────────────────────────────────
|
||||
|
||||
@targets_app.callback(invoke_without_command=True)
|
||||
def targets_default(
|
||||
ctx: typer.Context,
|
||||
domain: Optional[str] = typer.Option(None, "--domain", help="Filter by domain"),
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""List infrastructure targets from the OpsCatalog."""
|
||||
if ctx.invoked_subcommand is not None:
|
||||
return
|
||||
cfg = _load_or_exit()
|
||||
cat = _load_catalog_or_exit(cfg)
|
||||
|
||||
rows = []
|
||||
for t in cat.targets.values():
|
||||
if domain and t.domain != domain:
|
||||
continue
|
||||
rows.append({
|
||||
"domain": t.domain,
|
||||
"target": t.id,
|
||||
"kind": t.kind,
|
||||
"description": t.description,
|
||||
"bridges": t.reachable_via,
|
||||
})
|
||||
|
||||
if as_json:
|
||||
typer.echo(json.dumps(rows, indent=2))
|
||||
else:
|
||||
if not rows:
|
||||
typer.echo("No targets found.")
|
||||
return
|
||||
headers = ["DOMAIN", "TARGET", "KIND", "BRIDGES"]
|
||||
col_widths = [
|
||||
max(len(h), max((len(str(r.get(h.lower(), "") or "")) for r in rows), default=0))
|
||||
for h in headers
|
||||
]
|
||||
|
||||
def _fmt(vals):
|
||||
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
|
||||
|
||||
typer.echo(_fmt(headers))
|
||||
typer.echo(_fmt(["-" * w for w in col_widths]))
|
||||
for row in rows:
|
||||
typer.echo(_fmt([
|
||||
row["domain"],
|
||||
row["target"],
|
||||
row["kind"],
|
||||
", ".join(row["bridges"]),
|
||||
]))
|
||||
|
||||
|
||||
@targets_app.command("show")
|
||||
def targets_show(
|
||||
target: str = typer.Argument(..., help="Target ID"),
|
||||
):
|
||||
"""Show full metadata for a target."""
|
||||
cfg = _load_or_exit()
|
||||
cat = _load_catalog_or_exit(cfg)
|
||||
|
||||
if target not in cat.targets:
|
||||
typer.echo(f"Error: target '{target}' not found in catalog", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
t = cat.targets[target]
|
||||
typer.echo(f"Target: {t.id}")
|
||||
typer.echo(f"Domain: {t.domain}")
|
||||
typer.echo(f"Kind: {t.kind}")
|
||||
if t.description:
|
||||
typer.echo(f"Description: {t.description}")
|
||||
if t.reachable_via:
|
||||
typer.echo(f"Bridges: {', '.join(t.reachable_via)}")
|
||||
|
||||
# Show ops notes from docs/ if available
|
||||
if cfg.catalog_path:
|
||||
docs_dir = cfg.catalog_path / "domains" / t.domain / "docs"
|
||||
if docs_dir.exists():
|
||||
for md_file in sorted(docs_dir.glob("*.md")):
|
||||
typer.echo(f"\n--- {md_file.name} ---")
|
||||
typer.echo(md_file.read_text())
|
||||
|
||||
|
||||
# ─── catalog commands ─────────────────────────────────────────────────────────
|
||||
|
||||
@catalog_app.callback(invoke_without_command=True)
|
||||
def catalog_default(ctx: typer.Context):
|
||||
"""Inspect and validate the OpsCatalog."""
|
||||
if ctx.invoked_subcommand is None:
|
||||
typer.echo(ctx.get_help())
|
||||
|
||||
|
||||
@catalog_app.command("list")
|
||||
def catalog_list(
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""List all domains with target and bridge counts."""
|
||||
cfg = _load_or_exit()
|
||||
cat = _load_catalog_or_exit(cfg)
|
||||
|
||||
rows = []
|
||||
for domain in cat.domains.values():
|
||||
target_count = sum(1 for t in cat.targets.values() if t.domain == domain.id)
|
||||
bridge_count = sum(1 for b in cat.bridges.values() if b.domain == domain.id)
|
||||
rows.append({
|
||||
"domain": domain.id,
|
||||
"name": domain.name,
|
||||
"environment": domain.environment,
|
||||
"targets": target_count,
|
||||
"bridges": bridge_count,
|
||||
})
|
||||
|
||||
if as_json:
|
||||
typer.echo(json.dumps(rows, indent=2))
|
||||
else:
|
||||
if not rows:
|
||||
typer.echo("Catalog is empty.")
|
||||
return
|
||||
headers = ["DOMAIN", "NAME", "ENV", "TARGETS", "BRIDGES"]
|
||||
col_widths = [
|
||||
max(len(h), max((len(str(r.get(h.lower()[:3] if h == "ENV" else h.lower(), "") or "")) for r in rows), default=0))
|
||||
for h in headers
|
||||
]
|
||||
# Manual col widths for cleaner output
|
||||
col_widths = [
|
||||
max(len("DOMAIN"), max((len(r["domain"]) for r in rows), default=0)),
|
||||
max(len("NAME"), max((len(r["name"]) for r in rows), default=0)),
|
||||
max(len("ENV"), max((len(r["environment"]) for r in rows), default=0)),
|
||||
max(len("TARGETS"), max((len(str(r["targets"])) for r in rows), default=0)),
|
||||
max(len("BRIDGES"), max((len(str(r["bridges"])) for r in rows), default=0)),
|
||||
]
|
||||
|
||||
def _fmt(vals):
|
||||
return " ".join(str(v).ljust(w) for v, w in zip(vals, col_widths))
|
||||
|
||||
typer.echo(_fmt(headers))
|
||||
typer.echo(_fmt(["-" * w for w in col_widths]))
|
||||
for row in rows:
|
||||
typer.echo(_fmt([
|
||||
row["domain"], row["name"], row["environment"],
|
||||
str(row["targets"]), str(row["bridges"]),
|
||||
]))
|
||||
|
||||
|
||||
@catalog_app.command("validate")
|
||||
def catalog_validate():
|
||||
"""Validate catalog for consistency errors."""
|
||||
from bridge.catalog.validator import validate_catalog
|
||||
|
||||
cfg = _load_or_exit()
|
||||
cat = _load_catalog_or_exit(cfg)
|
||||
|
||||
errors = validate_catalog(cat)
|
||||
if errors:
|
||||
typer.echo(f"Catalog has {len(errors)} violation(s):")
|
||||
for err in errors:
|
||||
typer.echo(f" - {err}")
|
||||
raise typer.Exit(1)
|
||||
else:
|
||||
typer.echo(f"Catalog OK — {len(cat.domains)} domain(s), {len(cat.targets)} target(s), {len(cat.bridges)} bridge(s).")
|
||||
|
||||
|
||||
@catalog_app.command("show")
|
||||
def catalog_show(
|
||||
bridge_id: str = typer.Argument(..., help="Bridge ID"),
|
||||
):
|
||||
"""Show full metadata for a bridge."""
|
||||
cfg = _load_or_exit()
|
||||
cat = _load_catalog_or_exit(cfg)
|
||||
|
||||
if bridge_id not in cat.bridges:
|
||||
typer.echo(f"Error: bridge '{bridge_id}' not found in catalog", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
b = cat.bridges[bridge_id]
|
||||
typer.echo(f"Bridge: {b.id}")
|
||||
typer.echo(f"Domain: {b.domain}")
|
||||
typer.echo(f"Target: {b.target}")
|
||||
typer.echo(f"Host: {b.host}")
|
||||
typer.echo(f"Ports: {b.remote_port} -> {b.local_port}")
|
||||
typer.echo(f"SSH user: {b.ssh_user}")
|
||||
typer.echo(f"Actor: {b.actor}")
|
||||
typer.echo(f"Method: {b.access_method}")
|
||||
if b.description:
|
||||
typer.echo(f"Description: {b.description}")
|
||||
if b.health_check:
|
||||
typer.echo(f"Health: {b.health_check.url} (every {b.health_check.interval_seconds}s)")
|
||||
|
||||
# Domain context
|
||||
if b.domain in cat.domains:
|
||||
d = cat.domains[b.domain]
|
||||
typer.echo(f"\nDomain context: {d.name} [{d.environment}]")
|
||||
|
||||
# Target context
|
||||
if b.target in cat.targets:
|
||||
t = cat.targets[b.target]
|
||||
typer.echo(f"Target: {t.description or t.id} ({t.kind})")
|
||||
|
||||
|
||||
_CONVENTIONS_TEXT = """\
|
||||
Actor Naming Conventions (from AccessManagementDirective.md §2)
|
||||
|
||||
Every actor declared under `actors:` in ~/.config/bridge/tunnels.yaml must have
|
||||
a `class` field, and the actor name must start with the class-specific prefix:
|
||||
|
||||
class prefix purpose
|
||||
----- ------ ------------------------------------------------------------
|
||||
adm adm- Human operator (interactive shell when needed)
|
||||
agt agt- LLM-powered autonomous agent (Claude Code, etc.)
|
||||
atm atm- Deterministic script / cron job / pipeline
|
||||
|
||||
Legacy class aliases (deprecated, still accepted with a warning):
|
||||
human -> adm
|
||||
automation -> atm
|
||||
|
||||
Examples:
|
||||
adm-bernd: { class: adm, description: Bernd Worsch }
|
||||
agt-claude-coulombcore: { class: agt, description: Claude Code on CoulombCore }
|
||||
atm-backup-daily: { class: atm, description: Nightly DB backup }
|
||||
|
||||
Full specification:
|
||||
<ops-bridge repo>/wiki/AccessManagementDirective.md
|
||||
"""
|
||||
|
||||
|
||||
@maintenance_app.command("cleanup")
|
||||
def maintenance_cleanup(
|
||||
tunnel: Optional[str] = typer.Argument(
|
||||
None,
|
||||
help="Tunnel name (omit for all reverse tunnels)",
|
||||
),
|
||||
restart: bool = typer.Option(
|
||||
False,
|
||||
"--restart",
|
||||
help="Restart tunnels after clearing stale remote bindings",
|
||||
),
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""Clear stale SSH remote port forwards that block tunnel reconnects."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
try:
|
||||
report = cleanup_all_tunnels(
|
||||
cfg,
|
||||
state_mgr,
|
||||
restart=restart,
|
||||
tunnel_name=tunnel,
|
||||
)
|
||||
except KeyError:
|
||||
typer.echo(f"Error: tunnel '{tunnel}' not found in config", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if as_json:
|
||||
payload = {
|
||||
"cleaned_count": report.cleaned_count,
|
||||
"actions": [
|
||||
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
|
||||
for a in report.actions
|
||||
],
|
||||
}
|
||||
typer.echo(json.dumps(payload, indent=2))
|
||||
return
|
||||
|
||||
if not report.actions:
|
||||
typer.echo("No reverse tunnels configured.")
|
||||
return
|
||||
|
||||
for action in report.actions:
|
||||
typer.echo(f"{action.tunnel}: {action.action} — {action.detail}")
|
||||
typer.echo(f"done ({report.cleaned_count} cleaned)")
|
||||
|
||||
|
||||
@maintenance_app.command("install-cron")
|
||||
def maintenance_install_cron():
|
||||
"""Install a 03:00 daily cron job for `bridge maintenance cleanup --restart`."""
|
||||
installed, message = install_cleanup_cron()
|
||||
if installed:
|
||||
typer.echo("Installed nightly cleanup cron:")
|
||||
typer.echo(f" {message}")
|
||||
else:
|
||||
typer.echo(message)
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
@maintenance_app.command("uninstall-cron")
|
||||
def maintenance_uninstall_cron():
|
||||
"""Remove the nightly cleanup cron job."""
|
||||
removed, message = uninstall_cleanup_cron()
|
||||
if removed:
|
||||
typer.echo(message)
|
||||
else:
|
||||
typer.echo(message)
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
@maintenance_app.command("show-cron")
|
||||
def maintenance_show_cron():
|
||||
"""Show the configured nightly cleanup cron line."""
|
||||
existing = read_installed_cron()
|
||||
if existing:
|
||||
typer.echo(existing)
|
||||
else:
|
||||
typer.echo("Nightly cleanup cron is not installed.")
|
||||
typer.echo("Would install:")
|
||||
typer.echo(f" {build_cron_line()}")
|
||||
|
||||
|
||||
@app.command()
|
||||
def conventions():
|
||||
"""Show the actor naming conventions enforced by tunnels.yaml."""
|
||||
typer.echo(_CONVENTIONS_TEXT)
|
||||
165
src/bridge/config.py
Normal file
165
src/bridge/config.py
Normal file
@@ -0,0 +1,165 @@
|
||||
"""Config loading for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import warnings
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Dict, Optional
|
||||
|
||||
import yaml
|
||||
|
||||
from bridge.models import ActorInfo, ActorType, HealthCheckConfig, ReconnectPolicy, TunnelConfig
|
||||
|
||||
|
||||
class ConfigError(Exception):
|
||||
"""Raised when config is invalid or missing."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class BridgeConfig:
|
||||
tunnels: Dict[str, TunnelConfig]
|
||||
actors: Dict[str, ActorInfo]
|
||||
catalog_path: Optional[Path] = None
|
||||
|
||||
|
||||
def _default_config_path() -> Path:
|
||||
return Path.home() / ".config" / "bridge" / "tunnels.yaml"
|
||||
|
||||
|
||||
def load_config() -> BridgeConfig:
|
||||
"""Load and validate tunnels.yaml. Respects BRIDGE_CONFIG env var."""
|
||||
path = Path(os.environ.get("BRIDGE_CONFIG", str(_default_config_path())))
|
||||
|
||||
if not path.exists():
|
||||
raise ConfigError(f"Config file not found: {path}")
|
||||
|
||||
try:
|
||||
with path.open() as f:
|
||||
raw = yaml.safe_load(f)
|
||||
except yaml.YAMLError as e:
|
||||
raise ConfigError(f"Invalid YAML in {path}: {e}") from e
|
||||
|
||||
if not isinstance(raw, dict):
|
||||
raise ConfigError(f"Config must be a YAML mapping, got: {type(raw)}")
|
||||
|
||||
tunnels = _parse_tunnels(raw.get("tunnels") or {})
|
||||
actors = _parse_actors(raw.get("actors") or {})
|
||||
|
||||
catalog_path = None
|
||||
if "catalog_path" in raw and raw["catalog_path"]:
|
||||
catalog_path = Path(os.path.expanduser(str(raw["catalog_path"])))
|
||||
|
||||
return BridgeConfig(tunnels=tunnels, actors=actors, catalog_path=catalog_path)
|
||||
|
||||
|
||||
def _parse_tunnels(raw: dict) -> Dict[str, TunnelConfig]:
|
||||
tunnels = {}
|
||||
for name, data in raw.items():
|
||||
if not isinstance(data, dict):
|
||||
raise ConfigError(f"Tunnel '{name}' must be a mapping")
|
||||
tunnels[name] = _parse_tunnel(name, data)
|
||||
return tunnels
|
||||
|
||||
|
||||
def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
|
||||
required = ["host", "remote_port", "local_port", "ssh_user", "ssh_key", "actor"]
|
||||
for field in required:
|
||||
if field not in data:
|
||||
raise ConfigError(f"Tunnel '{name}' missing required field: {field}")
|
||||
|
||||
reconnect = ReconnectPolicy()
|
||||
if "reconnect" in data and data["reconnect"]:
|
||||
r = data["reconnect"]
|
||||
reconnect = ReconnectPolicy(
|
||||
max_attempts=r.get("max_attempts", 0),
|
||||
backoff_initial=r.get("backoff_initial", 5),
|
||||
backoff_max=r.get("backoff_max", 60),
|
||||
)
|
||||
|
||||
health_check = None
|
||||
if "health_check" in data and data["health_check"]:
|
||||
hc = data["health_check"]
|
||||
if "url" not in hc:
|
||||
raise ConfigError(f"Tunnel '{name}' health_check missing required field: url")
|
||||
health_check = HealthCheckConfig(
|
||||
url=hc["url"],
|
||||
interval_seconds=hc.get("interval_seconds", 30),
|
||||
timeout_seconds=hc.get("timeout_seconds", 5),
|
||||
)
|
||||
|
||||
direction = str(data.get("direction", "reverse"))
|
||||
if direction not in ("reverse", "local"):
|
||||
raise ConfigError(f"Tunnel '{name}' direction must be 'reverse' or 'local', got: {direction!r}")
|
||||
|
||||
cert_command = data.get("cert_command") or None
|
||||
if cert_command is not None:
|
||||
cert_command = str(cert_command)
|
||||
|
||||
return TunnelConfig(
|
||||
name=name,
|
||||
host=str(data["host"]),
|
||||
remote_port=int(data["remote_port"]),
|
||||
local_port=int(data["local_port"]),
|
||||
ssh_user=str(data["ssh_user"]),
|
||||
ssh_key=str(data["ssh_key"]),
|
||||
actor=str(data["actor"]),
|
||||
reconnect=reconnect,
|
||||
health_check=health_check,
|
||||
direction=direction,
|
||||
cert_command=cert_command,
|
||||
)
|
||||
|
||||
|
||||
_LEGACY_CLASS_MAP = {
|
||||
"human": ActorType.ADM,
|
||||
"automation": ActorType.ATM,
|
||||
}
|
||||
|
||||
_ACTOR_TYPE_PREFIXES = {
|
||||
ActorType.ADM: "adm-",
|
||||
ActorType.AGT: "agt-",
|
||||
ActorType.ATM: "atm-",
|
||||
}
|
||||
|
||||
|
||||
def _parse_actor_type(name: str, raw_class: str) -> ActorType:
|
||||
if raw_class in _LEGACY_CLASS_MAP:
|
||||
warnings.warn(
|
||||
f"Actor '{name}': class '{raw_class}' is deprecated; "
|
||||
f"use '{_LEGACY_CLASS_MAP[raw_class].value}' instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=4,
|
||||
)
|
||||
return _LEGACY_CLASS_MAP[raw_class]
|
||||
try:
|
||||
return ActorType(raw_class)
|
||||
except ValueError:
|
||||
raise ConfigError(
|
||||
f"Actor '{name}' has unknown class '{raw_class}'; "
|
||||
f"must be one of: adm, agt, atm (or legacy: human, automation). "
|
||||
f"Run `bridge conventions` for the full naming rules."
|
||||
)
|
||||
|
||||
|
||||
def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
|
||||
actors = {}
|
||||
for name, data in raw.items():
|
||||
if not isinstance(data, dict):
|
||||
raise ConfigError(f"Actor '{name}' must be a mapping")
|
||||
if "class" not in data:
|
||||
raise ConfigError(f"Actor '{name}' missing required field: class")
|
||||
actor_type = _parse_actor_type(name, str(data["class"]))
|
||||
required_prefix = _ACTOR_TYPE_PREFIXES[actor_type]
|
||||
if not name.startswith(required_prefix):
|
||||
raise ConfigError(
|
||||
f"Actor '{name}' has type '{actor_type.value}' but name must start "
|
||||
f"with '{required_prefix}' (got '{name}'). "
|
||||
f"Run `bridge conventions` for the full naming rules."
|
||||
)
|
||||
actors[name] = ActorInfo(
|
||||
name=name,
|
||||
actor_type=actor_type,
|
||||
description=str(data.get("description", "")),
|
||||
)
|
||||
return actors
|
||||
146
src/bridge/diagnostics.py
Normal file
146
src/bridge/diagnostics.py
Normal file
@@ -0,0 +1,146 @@
|
||||
"""End-to-end tunnel diagnostics for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
import socket
|
||||
import subprocess
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
|
||||
from bridge.models import BridgeState, TunnelConfig
|
||||
from bridge.state import StateManager, _pid_alive
|
||||
|
||||
|
||||
def _remote_port_probe_command(remote_port: int) -> str:
|
||||
"""Build a portable remote shell probe for a listening TCP port."""
|
||||
return (
|
||||
f"port={remote_port}; "
|
||||
"if command -v ss >/dev/null 2>&1; then "
|
||||
"ss -tnlp 2>/dev/null | grep -q \":$port \" && echo ok || echo closed; "
|
||||
"elif command -v netstat >/dev/null 2>&1; then "
|
||||
"netstat -tnlp 2>/dev/null | "
|
||||
"grep -q \"[.:]$port[[:space:]]\" && echo ok || echo closed; "
|
||||
"else "
|
||||
"hex=$(printf '%04X' \"$port\"); "
|
||||
"awk -v p=\":$hex\" "
|
||||
"'NR > 1 && $4 == \"0A\" && index($2, p) { found = 1 } "
|
||||
"END { print found ? \"ok\" : \"closed\" }' "
|
||||
"/proc/net/tcp /proc/net/tcp6 2>/dev/null; "
|
||||
"fi"
|
||||
)
|
||||
|
||||
|
||||
def _probe_local_port(local_port: int) -> str:
|
||||
"""Check whether the local side of an SSH -L tunnel is accepting TCP."""
|
||||
try:
|
||||
with socket.create_connection(("127.0.0.1", local_port), timeout=5):
|
||||
return "listening"
|
||||
except ConnectionRefusedError:
|
||||
return "closed"
|
||||
except socket.timeout:
|
||||
return "error:timeout"
|
||||
except OSError as e:
|
||||
return f"error:{e}"
|
||||
|
||||
|
||||
@dataclass
|
||||
class TunnelCheckResult:
|
||||
tunnel: str
|
||||
ssh_process: str # "ok" | "dead" | "no_pid"
|
||||
pid: Optional[int]
|
||||
remote_port: str # "listening" | "closed" | "error:<msg>"
|
||||
local_api: Optional[str] # "ok" | "error:<msg>" | None
|
||||
latency_ms: Optional[float]
|
||||
stale_state: bool # state file says connected but process is dead
|
||||
|
||||
@property
|
||||
def ok(self) -> bool:
|
||||
return self.ssh_process == "ok" and self.remote_port == "listening"
|
||||
|
||||
|
||||
def check_tunnel(cfg: TunnelConfig, state_mgr: StateManager) -> TunnelCheckResult:
|
||||
"""Run end-to-end diagnostics for a single tunnel.
|
||||
|
||||
Checks SSH PID liveness, remote port listening via SSH probe, and optional
|
||||
local API health check. Returns a TunnelCheckResult with all findings.
|
||||
"""
|
||||
name = cfg.name
|
||||
|
||||
# 1. PID liveness
|
||||
pid = state_mgr.read_raw_pid(name)
|
||||
if pid is None:
|
||||
ssh_process = "no_pid"
|
||||
elif _pid_alive(pid):
|
||||
ssh_process = "ok"
|
||||
else:
|
||||
ssh_process = "dead"
|
||||
|
||||
# 2. Stale state: state file says connected/degraded but process is dead
|
||||
state = state_mgr.read_state(name)
|
||||
stale_state = (
|
||||
state in (BridgeState.CONNECTED, BridgeState.DEGRADED)
|
||||
and ssh_process != "ok"
|
||||
)
|
||||
|
||||
# 3. Port probe: reverse tunnels listen remotely; local tunnels listen here.
|
||||
if cfg.direction == "local":
|
||||
remote_port = _probe_local_port(cfg.local_port)
|
||||
else:
|
||||
key_path = str(Path(cfg.ssh_key).expanduser())
|
||||
cmd = [
|
||||
"ssh",
|
||||
"-i", key_path,
|
||||
"-o", "BatchMode=yes",
|
||||
"-o", "ConnectTimeout=5",
|
||||
"-o", "StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
_remote_port_probe_command(cfg.remote_port),
|
||||
]
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10,
|
||||
)
|
||||
output = proc.stdout.strip()
|
||||
if output == "ok":
|
||||
remote_port = "listening"
|
||||
elif output == "closed":
|
||||
remote_port = "closed"
|
||||
else:
|
||||
remote_port = f"error:{proc.stderr.strip() or 'unknown'}"
|
||||
except subprocess.TimeoutExpired:
|
||||
remote_port = "error:timeout"
|
||||
except Exception as e:
|
||||
remote_port = f"error:{e}"
|
||||
|
||||
# 4. Local API health check (optional)
|
||||
local_api: Optional[str] = None
|
||||
latency_ms: Optional[float] = None
|
||||
if cfg.health_check is not None:
|
||||
try:
|
||||
t0 = time.monotonic()
|
||||
resp = httpx.get(cfg.health_check.url, timeout=cfg.health_check.timeout_seconds)
|
||||
latency_ms = (time.monotonic() - t0) * 1000
|
||||
local_api = "ok" if resp.is_success else f"error:http_{resp.status_code}"
|
||||
except Exception as e:
|
||||
local_api = f"error:{e}"
|
||||
|
||||
return TunnelCheckResult(
|
||||
tunnel=name,
|
||||
ssh_process=ssh_process,
|
||||
pid=pid,
|
||||
remote_port=remote_port,
|
||||
local_api=local_api,
|
||||
latency_ms=latency_ms,
|
||||
stale_state=stale_state,
|
||||
)
|
||||
|
||||
|
||||
def check_all_tunnels(cfg, state_mgr: StateManager) -> list[TunnelCheckResult]:
|
||||
"""Run diagnostics for all configured inline tunnels."""
|
||||
return [check_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]
|
||||
31
src/bridge/health.py
Normal file
31
src/bridge/health.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""HTTP health checker for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
import httpx
|
||||
|
||||
|
||||
@dataclass
|
||||
class HealthResult:
|
||||
ok: bool
|
||||
status_code: Optional[int] = None
|
||||
error: Optional[str] = None
|
||||
|
||||
|
||||
class HealthChecker:
|
||||
def __init__(self, url: str, timeout_seconds: int = 5):
|
||||
self._url = url
|
||||
self._timeout = timeout_seconds
|
||||
|
||||
async def check(self) -> HealthResult:
|
||||
try:
|
||||
async with httpx.AsyncClient(timeout=self._timeout) as client:
|
||||
response = await client.get(self._url)
|
||||
response.raise_for_status()
|
||||
return HealthResult(ok=True, status_code=response.status_code)
|
||||
except httpx.HTTPStatusError as e:
|
||||
return HealthResult(ok=False, status_code=e.response.status_code, error=str(e))
|
||||
except Exception as e:
|
||||
return HealthResult(ok=False, error=str(e))
|
||||
380
src/bridge/manager.py
Normal file
380
src/bridge/manager.py
Normal file
@@ -0,0 +1,380 @@
|
||||
"""Tunnel lifecycle manager for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
import os
|
||||
import signal
|
||||
import subprocess
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import List, Optional
|
||||
|
||||
from bridge.audit import AuditEvent, AuditLogger
|
||||
from bridge.health import HealthChecker
|
||||
from bridge.models import BridgeState, CertAcquisitionError, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def _actor_type_from_name(name: str) -> str:
|
||||
for prefix in ("adm", "agt", "atm"):
|
||||
if name.startswith(f"{prefix}-"):
|
||||
return prefix
|
||||
return "unknown"
|
||||
|
||||
|
||||
def build_ssh_command(cfg: TunnelConfig, cert_path: Optional[Path] = None) -> List[str]:
|
||||
"""Build the SSH tunnel command (reverse -R or local -L)."""
|
||||
key = os.path.expanduser(cfg.ssh_key)
|
||||
if cfg.direction == "local":
|
||||
forward_flag = ["-L", f"{cfg.local_port}:127.0.0.1:{cfg.remote_port}"]
|
||||
else:
|
||||
forward_flag = ["-R", f"{cfg.remote_port}:127.0.0.1:{cfg.local_port}"]
|
||||
cmd = [
|
||||
"ssh",
|
||||
"-N",
|
||||
*forward_flag,
|
||||
"-i", key,
|
||||
]
|
||||
if cert_path is not None:
|
||||
cmd += ["-i", str(cert_path)]
|
||||
cmd += [
|
||||
"-o", "ServerAliveInterval=10",
|
||||
"-o", "ServerAliveCountMax=3",
|
||||
"-o", "ExitOnForwardFailure=yes",
|
||||
"-o", "StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
]
|
||||
return cmd
|
||||
|
||||
|
||||
def _run_cert_command(cfg: TunnelConfig, state_dir: Path) -> Optional[Path]:
|
||||
"""Run cert_command and write cert to state dir. Returns cert path or None."""
|
||||
if cfg.cert_command is None:
|
||||
return None
|
||||
result = subprocess.run(
|
||||
cfg.cert_command,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
raise CertAcquisitionError(result.stderr.strip())
|
||||
cert_path = state_dir / f"{cfg.name}-cert.pub"
|
||||
cert_path.write_text(result.stdout)
|
||||
return cert_path
|
||||
|
||||
|
||||
def _parse_cert_identity(cert_path: Path) -> Optional[str]:
|
||||
"""Parse Key ID from ssh-keygen -L output."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ssh-keygen", "-L", "-f", str(cert_path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
for line in result.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if line.startswith("Key ID:"):
|
||||
return line.split(":", 1)[1].strip().strip('"')
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
def _parse_cert_expiry(cert_path: Path) -> Optional[datetime]:
|
||||
"""Parse Valid-before datetime from ssh-keygen -L output."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ssh-keygen", "-L", "-f", str(cert_path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
for line in result.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if line.startswith("Valid:"):
|
||||
# "Valid: from 2026-05-15T10:00:00 to 2026-05-15T22:00:00"
|
||||
parts = line.split()
|
||||
if len(parts) >= 5 and parts[3] == "to":
|
||||
return datetime.fromisoformat(parts[4])
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
class TunnelManager:
|
||||
"""Manages a single named SSH reverse tunnel.
|
||||
|
||||
start() daemonises: forks a child that runs the reconnect loop, then the
|
||||
parent returns immediately after writing the manager PID.
|
||||
"""
|
||||
|
||||
def __init__(self, cfg: TunnelConfig, state_dir: Optional[Path] = None):
|
||||
self._cfg = cfg
|
||||
self._state = StateManager(state_dir=state_dir)
|
||||
self._audit = AuditLogger(state_dir=state_dir)
|
||||
|
||||
def get_state(self) -> BridgeState:
|
||||
return self._state.read_state(self._cfg.name)
|
||||
|
||||
def is_running(self) -> bool:
|
||||
return self._state.is_running(self._cfg.name)
|
||||
|
||||
def _actor_info(self):
|
||||
actor = self._cfg.actor
|
||||
return actor, _actor_type_from_name(actor)
|
||||
|
||||
def _next_backoff(self, attempt: int) -> int:
|
||||
initial = self._cfg.reconnect.backoff_initial
|
||||
max_b = self._cfg.reconnect.backoff_max
|
||||
value = initial * (2 ** attempt)
|
||||
return min(value, max_b)
|
||||
|
||||
def start(self) -> None:
|
||||
"""Start the tunnel manager as a daemonised subprocess."""
|
||||
if self.is_running():
|
||||
log.info("Tunnel %s already running", self._cfg.name)
|
||||
return
|
||||
|
||||
self._state.write_state(self._cfg.name, BridgeState.STARTING)
|
||||
actor, actor_type = self._actor_info()
|
||||
self._audit.log(
|
||||
tunnel=self._cfg.name,
|
||||
event=AuditEvent.BRIDGE_STARTED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
|
||||
pid = os.fork()
|
||||
if pid > 0:
|
||||
# Parent: record manager PID and return
|
||||
self._state.write_pid(self._cfg.name, pid)
|
||||
return
|
||||
|
||||
# Child: become a daemon
|
||||
os.setsid()
|
||||
|
||||
try:
|
||||
self._run_loop()
|
||||
except Exception as e:
|
||||
log.exception("Tunnel manager loop crashed: %s", e)
|
||||
finally:
|
||||
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
|
||||
self._state.clear_pid(self._cfg.name)
|
||||
self._audit.log(
|
||||
tunnel=self._cfg.name,
|
||||
event=AuditEvent.BRIDGE_STOPPED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
|
||||
os._exit(0)
|
||||
|
||||
def stop(self) -> None:
|
||||
"""Stop the running tunnel manager."""
|
||||
pid = self._state.read_pid(self._cfg.name)
|
||||
if pid is None:
|
||||
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
|
||||
return
|
||||
|
||||
try:
|
||||
os.kill(pid, signal.SIGTERM)
|
||||
# Give up to 5 seconds for graceful shutdown
|
||||
for _ in range(50):
|
||||
try:
|
||||
os.kill(pid, 0)
|
||||
time.sleep(0.1)
|
||||
except ProcessLookupError:
|
||||
break
|
||||
else:
|
||||
# Force kill if still running
|
||||
try:
|
||||
os.kill(pid, signal.SIGKILL)
|
||||
except ProcessLookupError:
|
||||
pass
|
||||
except ProcessLookupError:
|
||||
pass
|
||||
|
||||
self._state.clear_pid(self._cfg.name)
|
||||
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
|
||||
actor, actor_type = self._actor_info()
|
||||
self._audit.log(
|
||||
tunnel=self._cfg.name,
|
||||
event=AuditEvent.BRIDGE_STOPPED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
|
||||
def _run_loop(self) -> None:
|
||||
"""Reconnect loop running in daemon child."""
|
||||
import asyncio
|
||||
|
||||
cfg = self._cfg
|
||||
actor, actor_type = self._actor_info()
|
||||
attempt = 0
|
||||
max_attempts = cfg.reconnect.max_attempts # 0 = infinite
|
||||
state_dir = self._state._dir
|
||||
|
||||
_stop = [False]
|
||||
|
||||
def _on_term(signum, frame):
|
||||
_stop[0] = True
|
||||
|
||||
signal.signal(signal.SIGTERM, _on_term)
|
||||
signal.signal(signal.SIGINT, _on_term)
|
||||
|
||||
while not _stop[0]:
|
||||
if max_attempts > 0 and attempt >= max_attempts:
|
||||
self._state.write_state(cfg.name, BridgeState.FAILED)
|
||||
break
|
||||
|
||||
# Acquire cert before each SSH launch (T3, T7)
|
||||
try:
|
||||
cert_path = _run_cert_command(cfg, state_dir)
|
||||
except CertAcquisitionError as e:
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_DISCONNECTED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
detail=f"cert acquisition failed: {e}",
|
||||
)
|
||||
attempt += 1
|
||||
if max_attempts > 0 and attempt >= max_attempts:
|
||||
self._state.write_state(cfg.name, BridgeState.FAILED)
|
||||
break
|
||||
backoff = self._next_backoff(attempt - 1)
|
||||
self._state.write_state(cfg.name, BridgeState.RECONNECTING)
|
||||
log.info("Cert acquisition failed, retrying in %ds", backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
|
||||
cert_identity = _parse_cert_identity(cert_path) if cert_path else None
|
||||
cert_expires_at = _parse_cert_expiry(cert_path) if cert_path else None
|
||||
|
||||
cmd = build_ssh_command(cfg, cert_path=cert_path)
|
||||
log.info("Starting SSH: %s", " ".join(cmd))
|
||||
self._state.write_state(cfg.name, BridgeState.STARTING)
|
||||
|
||||
try:
|
||||
proc = subprocess.Popen(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
|
||||
except FileNotFoundError:
|
||||
self._state.write_state(cfg.name, BridgeState.FAILED)
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_DISCONNECTED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
detail="ssh binary not found",
|
||||
)
|
||||
break
|
||||
|
||||
time.sleep(2)
|
||||
_ttl_refresh = False
|
||||
if proc.poll() is None:
|
||||
self._state.write_state(cfg.name, BridgeState.CONNECTED)
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_CONNECTED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
cert_identity=cert_identity,
|
||||
)
|
||||
attempt = 0
|
||||
|
||||
def _check_ttl() -> bool:
|
||||
"""Return True if cert is within 5 min of expiry and SSH should restart."""
|
||||
if cert_expires_at is None:
|
||||
return False
|
||||
return datetime.now() >= cert_expires_at - timedelta(minutes=5)
|
||||
|
||||
if cfg.health_check:
|
||||
checker = HealthChecker(
|
||||
url=cfg.health_check.url,
|
||||
timeout_seconds=cfg.health_check.timeout_seconds,
|
||||
)
|
||||
health_failing = False
|
||||
while not _stop[0] and proc.poll() is None:
|
||||
if _check_ttl():
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.CERT_EXPIRING,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
cert_identity=cert_identity,
|
||||
detail=str(cert_expires_at),
|
||||
)
|
||||
proc.terminate()
|
||||
_ttl_refresh = True
|
||||
break
|
||||
result = asyncio.run(checker.check())
|
||||
if result.ok:
|
||||
if health_failing:
|
||||
health_failing = False
|
||||
self._state.write_state(cfg.name, BridgeState.CONNECTED)
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.HEALTH_CHECK_RECOVERED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
else:
|
||||
if not health_failing:
|
||||
health_failing = True
|
||||
self._state.write_state(cfg.name, BridgeState.DEGRADED)
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.HEALTH_CHECK_FAILED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
detail=result.error or f"HTTP {result.status_code}",
|
||||
)
|
||||
time.sleep(cfg.health_check.interval_seconds)
|
||||
else:
|
||||
while not _stop[0] and proc.poll() is None:
|
||||
if _check_ttl():
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.CERT_EXPIRING,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
cert_identity=cert_identity,
|
||||
detail=str(cert_expires_at),
|
||||
)
|
||||
proc.terminate()
|
||||
_ttl_refresh = True
|
||||
break
|
||||
time.sleep(1)
|
||||
|
||||
if _ttl_refresh:
|
||||
# Planned cert refresh — don't count as failure, no backoff
|
||||
continue
|
||||
|
||||
if proc.poll() is not None:
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_DISCONNECTED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
detail=f"exit code {proc.returncode}",
|
||||
)
|
||||
|
||||
if _stop[0]:
|
||||
if proc.poll() is None:
|
||||
proc.terminate()
|
||||
break
|
||||
|
||||
attempt += 1
|
||||
backoff = self._next_backoff(attempt - 1)
|
||||
self._state.write_state(cfg.name, BridgeState.RECONNECTING)
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_RECONNECTING,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
detail=f"retry {attempt}, backoff {backoff}s",
|
||||
)
|
||||
log.info("Reconnecting in %ds (attempt %d)", backoff, attempt)
|
||||
time.sleep(backoff)
|
||||
0
src/bridge/mcp_server/__init__.py
Normal file
0
src/bridge/mcp_server/__init__.py
Normal file
529
src/bridge/mcp_server/server.py
Normal file
529
src/bridge/mcp_server/server.py
Normal file
@@ -0,0 +1,529 @@
|
||||
"""OpsBridge MCP server — exposes bridge and catalog operations as FastMCP tools.
|
||||
|
||||
Entry point (stdio):
|
||||
uv run python src/bridge/mcp_server/server.py
|
||||
|
||||
The server imports the Python library directly — no subprocess required.
|
||||
All tool functions return JSON-serialisable dicts/lists.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import dataclasses
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from fastmcp import FastMCP
|
||||
|
||||
from bridge.diagnostics import check_all_tunnels, check_tunnel
|
||||
from bridge.state import StateManager
|
||||
|
||||
mcp = FastMCP(
|
||||
name="ops-bridge",
|
||||
instructions=(
|
||||
"OpsBridge MCP server. Use bridge_status to check tunnel health, "
|
||||
"bridge_up/down/restart to manage lifecycle, bridge_logs for audit history. "
|
||||
"catalog_* tools require catalog_path to be configured in tunnels.yaml."
|
||||
),
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _state_dir() -> Path:
|
||||
return Path(os.environ.get("BRIDGE_STATE_DIR", str(Path.home() / ".local" / "state" / "bridge")))
|
||||
|
||||
|
||||
def _load_cfg():
|
||||
from bridge.config import load_config
|
||||
return load_config()
|
||||
|
||||
|
||||
def _load_cfg_or_error() -> tuple:
|
||||
"""Return (cfg, None) or (None, error_dict)."""
|
||||
try:
|
||||
return _load_cfg(), None
|
||||
except Exception as e:
|
||||
return None, {"error": str(e)}
|
||||
|
||||
|
||||
def _load_catalog(cfg):
|
||||
"""Return (catalog, None) or (None, error_dict)."""
|
||||
if cfg.catalog_path is None:
|
||||
return None, {"error": "catalog_path not configured"}
|
||||
try:
|
||||
from bridge.catalog.loader import load_catalog
|
||||
return load_catalog(cfg.catalog_path), None
|
||||
except Exception as e:
|
||||
return None, {"error": f"Failed to load catalog: {e}"}
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Bridge lifecycle tools
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@mcp.tool()
|
||||
def bridge_up(tunnel: Optional[str] = None) -> dict:
|
||||
"""Start one or all configured tunnels.
|
||||
|
||||
Args:
|
||||
tunnel: Tunnel name to start. If omitted, starts all inline tunnels.
|
||||
|
||||
Returns:
|
||||
{"started": [...], "already_running": [...]} or {"error": "..."}
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return err
|
||||
|
||||
from bridge.manager import TunnelManager
|
||||
sd = _state_dir()
|
||||
started = []
|
||||
already_running = []
|
||||
|
||||
if tunnel:
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
catalog = None
|
||||
if cfg.catalog_path is not None:
|
||||
try:
|
||||
catalog = load_catalog(cfg.catalog_path)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if mgr.is_running():
|
||||
already_running.append(tunnel)
|
||||
else:
|
||||
mgr.start()
|
||||
started.append(tunnel)
|
||||
else:
|
||||
for name, tcfg in cfg.tunnels.items():
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if mgr.is_running():
|
||||
already_running.append(name)
|
||||
else:
|
||||
mgr.start()
|
||||
started.append(name)
|
||||
|
||||
return {"started": started, "already_running": already_running}
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def bridge_down(tunnel: Optional[str] = None) -> dict:
|
||||
"""Stop one or all configured tunnels.
|
||||
|
||||
Args:
|
||||
tunnel: Tunnel name to stop. If omitted, stops all inline tunnels.
|
||||
|
||||
Returns:
|
||||
{"stopped": [...], "not_running": [...]} or {"error": "..."}
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return err
|
||||
|
||||
from bridge.manager import TunnelManager
|
||||
sd = _state_dir()
|
||||
stopped = []
|
||||
not_running = []
|
||||
|
||||
if tunnel:
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
catalog = None
|
||||
if cfg.catalog_path is not None:
|
||||
try:
|
||||
catalog = load_catalog(cfg.catalog_path)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if not mgr.is_running():
|
||||
not_running.append(tunnel)
|
||||
else:
|
||||
mgr.stop()
|
||||
stopped.append(tunnel)
|
||||
else:
|
||||
for name, tcfg in cfg.tunnels.items():
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
if not mgr.is_running():
|
||||
not_running.append(name)
|
||||
else:
|
||||
mgr.stop()
|
||||
stopped.append(name)
|
||||
|
||||
return {"stopped": stopped, "not_running": not_running}
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def bridge_restart(tunnel: Optional[str] = None) -> dict:
|
||||
"""Restart one or all configured tunnels.
|
||||
|
||||
Reverse tunnels run conditional remote stale-forward cleanup before
|
||||
reconnecting; healthy forwards are left running.
|
||||
|
||||
Args:
|
||||
tunnel: Tunnel name to restart. If omitted, restarts all inline tunnels.
|
||||
|
||||
Returns:
|
||||
{"actions": [{"tunnel", "action", "detail"}, ...]} or {"error": "..."}
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return err
|
||||
|
||||
from bridge.cleanup import restart_all_tunnels, restart_tunnel
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
if tunnel:
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
catalog = None
|
||||
if cfg.catalog_path is not None:
|
||||
try:
|
||||
catalog = load_catalog(cfg.catalog_path)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
|
||||
actions = [restart_tunnel(tcfg, state_mgr)]
|
||||
else:
|
||||
actions = restart_all_tunnels(cfg, state_mgr)
|
||||
|
||||
payload = {
|
||||
"actions": [
|
||||
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
|
||||
for a in actions
|
||||
],
|
||||
}
|
||||
if any(a.action == "error" for a in actions):
|
||||
payload["error"] = "one or more tunnels failed to restart"
|
||||
return payload
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def bridge_status() -> list[dict]:
|
||||
"""Return status of all configured tunnels.
|
||||
|
||||
Returns:
|
||||
List of tunnel status dicts, each with keys:
|
||||
tunnel, state, actor, host, pid, uptime, health
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return [err]
|
||||
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
rows = []
|
||||
for name, tcfg in cfg.tunnels.items():
|
||||
state = state_mgr.read_state(name)
|
||||
pid = state_mgr.read_pid(name)
|
||||
rows.append({
|
||||
"tunnel": name,
|
||||
"state": state.value,
|
||||
"actor": tcfg.actor,
|
||||
"host": tcfg.host,
|
||||
"pid": pid,
|
||||
"uptime": None,
|
||||
"health": None,
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def bridge_logs(tunnel: str, lines: int = 50) -> list[dict]:
|
||||
"""Return recent audit log entries for a tunnel.
|
||||
|
||||
Args:
|
||||
tunnel: Tunnel name.
|
||||
lines: Maximum number of log entries to return (default 50).
|
||||
|
||||
Returns:
|
||||
List of audit event dicts (timestamp, event, actor, detail).
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return [err]
|
||||
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
catalog = None
|
||||
if cfg.catalog_path is not None:
|
||||
try:
|
||||
catalog = load_catalog(cfg.catalog_path)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
return [{"error": f"Tunnel '{tunnel}' not found in config or catalog"}]
|
||||
|
||||
from bridge.audit import AuditLogger
|
||||
sd = _state_dir()
|
||||
logger = AuditLogger(state_dir=sd)
|
||||
events = logger.read_events(tunnel)
|
||||
return events[-lines:] if events else []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Catalog tools
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@mcp.tool()
|
||||
def catalog_list_targets(domain: Optional[str] = None) -> list[dict]:
|
||||
"""List all infrastructure targets from the OpsCatalog.
|
||||
|
||||
Args:
|
||||
domain: Optional domain filter.
|
||||
|
||||
Returns:
|
||||
List of target dicts (id, domain, kind, description, reachable_via).
|
||||
Returns [{"error": "..."}] when catalog is not configured or fails to load.
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return [err]
|
||||
catalog, err = _load_catalog(cfg)
|
||||
if err:
|
||||
return [err]
|
||||
|
||||
targets = []
|
||||
for t in catalog.targets.values():
|
||||
if domain and t.domain != domain:
|
||||
continue
|
||||
targets.append({
|
||||
"id": t.id,
|
||||
"domain": t.domain,
|
||||
"kind": t.kind,
|
||||
"description": t.description or "",
|
||||
"reachable_via": list(t.reachable_via),
|
||||
})
|
||||
return targets
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def catalog_show_target(target_id: str) -> dict:
|
||||
"""Show full metadata for a catalog target.
|
||||
|
||||
Args:
|
||||
target_id: The target identifier.
|
||||
|
||||
Returns:
|
||||
Target metadata dict, or {"error": "..."}.
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return err
|
||||
catalog, err = _load_catalog(cfg)
|
||||
if err:
|
||||
return err
|
||||
|
||||
if target_id not in catalog.targets:
|
||||
return {"error": f"Target '{target_id}' not found"}
|
||||
|
||||
t = catalog.targets[target_id]
|
||||
return {
|
||||
"id": t.id,
|
||||
"domain": t.domain,
|
||||
"kind": t.kind,
|
||||
"description": t.description or "",
|
||||
"reachable_via": list(t.reachable_via),
|
||||
}
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def catalog_list_domains() -> list[dict]:
|
||||
"""List all domains in the OpsCatalog with target and bridge counts.
|
||||
|
||||
Returns:
|
||||
List of domain dicts (id, name, environment, target_count, bridge_count).
|
||||
Returns [{"error": "..."}] when catalog is not configured or fails to load.
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return [err]
|
||||
catalog, err = _load_catalog(cfg)
|
||||
if err:
|
||||
return [err]
|
||||
|
||||
domains = []
|
||||
for d in catalog.domains.values():
|
||||
target_count = sum(1 for t in catalog.targets.values() if t.domain == d.id)
|
||||
bridge_count = sum(1 for b in catalog.bridges.values() if b.domain == d.id)
|
||||
domains.append({
|
||||
"id": d.id,
|
||||
"name": d.name,
|
||||
"environment": d.environment,
|
||||
"description": d.description or "",
|
||||
"target_count": target_count,
|
||||
"bridge_count": bridge_count,
|
||||
})
|
||||
return domains
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def catalog_validate() -> dict:
|
||||
"""Validate the OpsCatalog for consistency errors.
|
||||
|
||||
Returns:
|
||||
{"valid": True} or {"valid": False, "errors": ["..."]}
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return {"valid": False, "errors": [err["error"]]}
|
||||
catalog, err = _load_catalog(cfg)
|
||||
if err:
|
||||
return {"valid": False, "errors": [err["error"]]}
|
||||
|
||||
from bridge.catalog.validator import validate_catalog
|
||||
errors = validate_catalog(catalog)
|
||||
if errors:
|
||||
return {"valid": False, "errors": errors}
|
||||
return {"valid": True, "errors": []}
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
def catalog_show_bridge(bridge_id: str) -> dict:
|
||||
"""Show full metadata for a catalog bridge definition.
|
||||
|
||||
Args:
|
||||
bridge_id: The bridge identifier.
|
||||
|
||||
Returns:
|
||||
Bridge metadata dict, or {"error": "..."}.
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return err
|
||||
catalog, err = _load_catalog(cfg)
|
||||
if err:
|
||||
return err
|
||||
|
||||
if bridge_id not in catalog.bridges:
|
||||
return {"error": f"Bridge '{bridge_id}' not found"}
|
||||
|
||||
b = catalog.bridges[bridge_id]
|
||||
result = {
|
||||
"id": b.id,
|
||||
"domain": b.domain,
|
||||
"target": b.target,
|
||||
"host": b.host,
|
||||
"remote_port": b.remote_port,
|
||||
"local_port": b.local_port,
|
||||
"ssh_user": b.ssh_user,
|
||||
"actor": b.actor,
|
||||
"access_method": b.access_method,
|
||||
"description": b.description or "",
|
||||
}
|
||||
if b.health_check:
|
||||
result["health_check"] = {
|
||||
"url": b.health_check.url,
|
||||
"interval_seconds": b.health_check.interval_seconds,
|
||||
"timeout_seconds": b.health_check.timeout_seconds,
|
||||
}
|
||||
return result
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Diagnostics tool
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@mcp.tool()
|
||||
def bridge_check(tunnel: Optional[str] = None) -> list[dict]:
|
||||
"""End-to-end diagnostics: SSH process alive + remote port listening.
|
||||
|
||||
Args:
|
||||
tunnel: Specific tunnel name, or None for all inline tunnels.
|
||||
|
||||
Returns:
|
||||
List of dicts with keys: tunnel, ssh_process, pid, remote_port,
|
||||
local_api, latency_ms, stale_state, ok.
|
||||
Returns [{"error": "..."}] on config load failure.
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return [err]
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
if tunnel:
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
catalog = None
|
||||
if cfg.catalog_path is not None:
|
||||
try:
|
||||
catalog = load_catalog(cfg.catalog_path)
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
return [{"error": f"Tunnel '{tunnel}' not found in config or catalog"}]
|
||||
results = [check_tunnel(tcfg, state_mgr)]
|
||||
else:
|
||||
results = check_all_tunnels(cfg, state_mgr)
|
||||
|
||||
return [{**dataclasses.asdict(r), "ok": r.ok} for r in results]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# MCP resources
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@mcp.resource("bridge://status")
|
||||
def resource_bridge_status() -> str:
|
||||
"""Live snapshot of all tunnel states as JSON."""
|
||||
rows = bridge_status()
|
||||
return json.dumps(rows, indent=2)
|
||||
|
||||
|
||||
@mcp.resource("bridge://check")
|
||||
def resource_bridge_check() -> str:
|
||||
"""Live end-to-end diagnostic snapshot for all tunnels."""
|
||||
return json.dumps(bridge_check(), indent=2)
|
||||
|
||||
|
||||
@mcp.resource("catalog://domains")
|
||||
def resource_catalog_domains() -> str:
|
||||
"""List of all catalog domains as JSON."""
|
||||
domains = catalog_list_domains()
|
||||
return json.dumps(domains, indent=2)
|
||||
|
||||
|
||||
@mcp.resource("catalog://targets")
|
||||
def resource_catalog_targets() -> str:
|
||||
"""List of all catalog targets as JSON."""
|
||||
targets = catalog_list_targets()
|
||||
return json.dumps(targets, indent=2)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Entry point
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
if __name__ == "__main__":
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser(description="OpsBridge MCP server")
|
||||
parser.add_argument("--http", action="store_true", help="Run in SSE/HTTP mode instead of stdio")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.http:
|
||||
port = int(os.environ.get("BRIDGE_MCP_PORT", "8002"))
|
||||
mcp.run(transport="sse", host="127.0.0.1", port=port)
|
||||
else:
|
||||
mcp.run(transport="stdio")
|
||||
61
src/bridge/models.py
Normal file
61
src/bridge/models.py
Normal file
@@ -0,0 +1,61 @@
|
||||
"""Domain models for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from enum import Enum
|
||||
from typing import Optional
|
||||
|
||||
|
||||
class BridgeState(str, Enum):
|
||||
STOPPED = "stopped"
|
||||
STARTING = "starting"
|
||||
CONNECTED = "connected"
|
||||
DEGRADED = "degraded"
|
||||
RECONNECTING = "reconnecting"
|
||||
FAILED = "failed"
|
||||
|
||||
|
||||
class ActorType(str, Enum):
|
||||
ADM = "adm" # human operator
|
||||
AGT = "agt" # LLM-powered autonomous agent
|
||||
ATM = "atm" # deterministic script / pipeline
|
||||
|
||||
|
||||
class CertAcquisitionError(Exception):
|
||||
"""Raised when cert_command fails to produce a certificate."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class ReconnectPolicy:
|
||||
max_attempts: int = 0 # 0 = infinite
|
||||
backoff_initial: int = 5
|
||||
backoff_max: int = 60
|
||||
|
||||
|
||||
@dataclass
|
||||
class HealthCheckConfig:
|
||||
url: str
|
||||
interval_seconds: int = 30
|
||||
timeout_seconds: int = 5
|
||||
|
||||
|
||||
@dataclass
|
||||
class TunnelConfig:
|
||||
name: str
|
||||
host: str
|
||||
remote_port: int
|
||||
local_port: int
|
||||
ssh_user: str
|
||||
ssh_key: str
|
||||
actor: str
|
||||
reconnect: ReconnectPolicy = field(default_factory=ReconnectPolicy)
|
||||
health_check: Optional[HealthCheckConfig] = None
|
||||
direction: str = "reverse" # "reverse" (-R) or "local" (-L)
|
||||
cert_command: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ActorInfo:
|
||||
name: str
|
||||
actor_type: ActorType
|
||||
description: str = ""
|
||||
83
src/bridge/state.py
Normal file
83
src/bridge/state.py
Normal file
@@ -0,0 +1,83 @@
|
||||
"""State file management for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from bridge.models import BridgeState
|
||||
|
||||
|
||||
def _default_state_dir() -> Path:
|
||||
return Path.home() / ".local" / "state" / "bridge"
|
||||
|
||||
|
||||
class StateManager:
|
||||
def __init__(self, state_dir: Optional[Path] = None):
|
||||
self._dir = Path(state_dir) if state_dir else _default_state_dir()
|
||||
|
||||
def _ensure_dir(self) -> None:
|
||||
self._dir.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
def _state_path(self, name: str) -> Path:
|
||||
return self._dir / f"{name}.state"
|
||||
|
||||
def _pid_path(self, name: str) -> Path:
|
||||
return self._dir / f"{name}.pid"
|
||||
|
||||
def read_state(self, name: str) -> BridgeState:
|
||||
path = self._state_path(name)
|
||||
if not path.exists():
|
||||
return BridgeState.STOPPED
|
||||
text = path.read_text().strip()
|
||||
try:
|
||||
return BridgeState(text)
|
||||
except ValueError:
|
||||
return BridgeState.STOPPED
|
||||
|
||||
def write_state(self, name: str, state: BridgeState) -> None:
|
||||
self._ensure_dir()
|
||||
self._state_path(name).write_text(state.value)
|
||||
|
||||
def read_pid(self, name: str) -> Optional[int]:
|
||||
path = self._pid_path(name)
|
||||
if not path.exists():
|
||||
return None
|
||||
try:
|
||||
pid = int(path.read_text().strip())
|
||||
except (ValueError, OSError):
|
||||
return None
|
||||
if _pid_alive(pid):
|
||||
return pid
|
||||
return None
|
||||
|
||||
def read_raw_pid(self, name: str) -> Optional[int]:
|
||||
"""Read PID from file without liveness check. Returns None if file absent/invalid."""
|
||||
path = self._pid_path(name)
|
||||
if not path.exists():
|
||||
return None
|
||||
try:
|
||||
return int(path.read_text().strip())
|
||||
except (ValueError, OSError):
|
||||
return None
|
||||
|
||||
def write_pid(self, name: str, pid: int) -> None:
|
||||
self._ensure_dir()
|
||||
self._pid_path(name).write_text(str(pid))
|
||||
|
||||
def clear_pid(self, name: str) -> None:
|
||||
path = self._pid_path(name)
|
||||
if path.exists():
|
||||
path.unlink()
|
||||
|
||||
def is_running(self, name: str) -> bool:
|
||||
return self.read_pid(name) is not None
|
||||
|
||||
|
||||
def _pid_alive(pid: int) -> bool:
|
||||
"""Return True if the process with given PID exists."""
|
||||
try:
|
||||
os.kill(pid, 0)
|
||||
return True
|
||||
except (ProcessLookupError, PermissionError):
|
||||
return False
|
||||
0
tests/__init__.py
Normal file
0
tests/__init__.py
Normal file
154
tests/conftest.py
Normal file
154
tests/conftest.py
Normal file
@@ -0,0 +1,154 @@
|
||||
"""Shared pytest configuration for OpsBridge tests.
|
||||
|
||||
Registers capability and access_mode marks, and provides the
|
||||
collect_capability_coverage() helper used by the cross-mode meta-test.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import textwrap
|
||||
from typing import Iterable
|
||||
|
||||
import pytest
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Shared fixtures
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
VALID_CONFIG = textwrap.dedent("""\
|
||||
tunnels:
|
||||
test-tunnel:
|
||||
host: host.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
VALID_CONFIG_WITH_CATALOG = textwrap.dedent("""\
|
||||
tunnels:
|
||||
test-tunnel:
|
||||
host: host.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
catalog_path: {catalog_path}
|
||||
""")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_file(tmp_path):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(VALID_CONFIG)
|
||||
return f
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def state_dir(tmp_path):
|
||||
d = tmp_path / "state"
|
||||
d.mkdir()
|
||||
return d
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def catalog_dir(tmp_path):
|
||||
"""Minimal catalog directory with one domain, target, and bridge."""
|
||||
cat = tmp_path / "catalog"
|
||||
domain_dir = cat / "domains" / "coulombcore"
|
||||
domain_dir.mkdir(parents=True)
|
||||
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
|
||||
type: domain
|
||||
id: coulombcore
|
||||
name: CoulombCore Infrastructure
|
||||
description: Core infrastructure domain
|
||||
environment: production
|
||||
"""))
|
||||
targets_dir = domain_dir / "targets"
|
||||
targets_dir.mkdir()
|
||||
(targets_dir / "state-hub.yaml").write_text(textwrap.dedent("""\
|
||||
type: target
|
||||
id: state-hub
|
||||
domain: coulombcore
|
||||
kind: service
|
||||
description: Infrastructure state coordination service
|
||||
reachable_via:
|
||||
- state-hub-coulombcore
|
||||
"""))
|
||||
bridges_dir = domain_dir / "bridges"
|
||||
bridges_dir.mkdir()
|
||||
(bridges_dir / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
|
||||
type: bridge
|
||||
id: state-hub-coulombcore
|
||||
domain: coulombcore
|
||||
target: state-hub
|
||||
description: Bridge to state hub
|
||||
access_method: ssh-reverse
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
"""))
|
||||
actors_dir = cat / "actors"
|
||||
actors_dir.mkdir()
|
||||
(actors_dir / "agent.yaml").write_text(textwrap.dedent("""\
|
||||
type: actor
|
||||
id: agent.claude-coulombcore
|
||||
class: automation
|
||||
description: Claude Code agent on CoulombCore
|
||||
"""))
|
||||
return cat
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_file_with_catalog(tmp_path, catalog_dir):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(VALID_CONFIG_WITH_CATALOG.format(catalog_path=str(catalog_dir)))
|
||||
return f
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Coverage collector helper
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def collect_capability_coverage(items: Iterable) -> set[tuple[str, str]]:
|
||||
"""Walk pytest items and return set of (capability_name, access_mode) pairs.
|
||||
|
||||
Each test item is inspected for `capability` and `access_mode` markers.
|
||||
A pair is added for every combination of capability × access_mode marks
|
||||
found on a single item.
|
||||
|
||||
Args:
|
||||
items: Iterable of pytest.Item objects (from session.items or similar).
|
||||
|
||||
Returns:
|
||||
Set of (capability_name, access_mode) tuples found across all items.
|
||||
"""
|
||||
covered: set[tuple[str, str]] = set()
|
||||
for item in items:
|
||||
capabilities = [
|
||||
m.args[0] for m in item.iter_markers("capability") if m.args
|
||||
]
|
||||
modes = [
|
||||
m.args[0] for m in item.iter_markers("access_mode") if m.args
|
||||
]
|
||||
for cap in capabilities:
|
||||
for mode in modes:
|
||||
covered.add((cap, mode))
|
||||
return covered
|
||||
89
tests/test_audit.py
Normal file
89
tests/test_audit.py
Normal file
@@ -0,0 +1,89 @@
|
||||
"""Tests for audit logging."""
|
||||
import json
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.audit import AuditLogger, AuditEvent
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def log_dir(tmp_path):
|
||||
return tmp_path / "bridge"
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def logger(log_dir):
|
||||
return AuditLogger(state_dir=log_dir)
|
||||
|
||||
|
||||
class TestAuditLogger:
|
||||
def test_log_event_creates_file(self, logger, log_dir):
|
||||
logger.log(
|
||||
tunnel="my-tunnel",
|
||||
event=AuditEvent.BRIDGE_STARTED,
|
||||
actor="operator.bernd",
|
||||
actor_type="adm",
|
||||
)
|
||||
log_file = log_dir / "my-tunnel.log"
|
||||
assert log_file.exists()
|
||||
|
||||
def test_log_event_is_json_line(self, logger, log_dir):
|
||||
logger.log(
|
||||
tunnel="my-tunnel",
|
||||
event=AuditEvent.BRIDGE_STARTED,
|
||||
actor="operator.bernd",
|
||||
actor_type="adm",
|
||||
)
|
||||
lines = (log_dir / "my-tunnel.log").read_text().strip().splitlines()
|
||||
assert len(lines) == 1
|
||||
entry = json.loads(lines[0])
|
||||
assert entry["tunnel"] == "my-tunnel"
|
||||
assert entry["event"] == "bridge_started"
|
||||
assert entry["actor"] == "operator.bernd"
|
||||
assert entry["actor_type"] == "adm"
|
||||
assert "timestamp" in entry
|
||||
|
||||
def test_multiple_events_append(self, logger, log_dir):
|
||||
for event in [AuditEvent.BRIDGE_STARTED, AuditEvent.BRIDGE_CONNECTED, AuditEvent.BRIDGE_STOPPED]:
|
||||
logger.log(tunnel="t", event=event, actor="a", actor_type="adm")
|
||||
lines = (log_dir / "t.log").read_text().strip().splitlines()
|
||||
assert len(lines) == 3
|
||||
|
||||
def test_log_with_detail(self, logger, log_dir):
|
||||
logger.log(
|
||||
tunnel="t",
|
||||
event=AuditEvent.HEALTH_CHECK_FAILED,
|
||||
actor="a",
|
||||
actor_type="atm",
|
||||
detail="connection refused",
|
||||
)
|
||||
entry = json.loads((log_dir / "t.log").read_text().strip())
|
||||
assert entry["detail"] == "connection refused"
|
||||
|
||||
def test_all_event_types_defined(self):
|
||||
events = {e.value for e in AuditEvent}
|
||||
assert "bridge_started" in events
|
||||
assert "bridge_connected" in events
|
||||
assert "bridge_disconnected" in events
|
||||
assert "bridge_reconnecting" in events
|
||||
assert "health_check_failed" in events
|
||||
assert "health_check_recovered" in events
|
||||
assert "bridge_stopped" in events
|
||||
|
||||
def test_timestamp_is_iso8601(self, logger, log_dir):
|
||||
from datetime import datetime
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
|
||||
entry = json.loads((log_dir / "t.log").read_text().strip())
|
||||
# Should parse without error
|
||||
dt = datetime.fromisoformat(entry["timestamp"])
|
||||
assert dt.tzinfo is not None or True # UTC or naive both acceptable
|
||||
|
||||
def test_read_events(self, logger, log_dir):
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_type="adm")
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
|
||||
events = logger.read_events("t")
|
||||
assert len(events) == 2
|
||||
assert events[0]["event"] == "bridge_started"
|
||||
|
||||
def test_read_events_missing_returns_empty(self, logger):
|
||||
assert logger.read_events("nonexistent") == []
|
||||
212
tests/test_catalog_cli.py
Normal file
212
tests/test_catalog_cli.py
Normal file
@@ -0,0 +1,212 @@
|
||||
"""Tests for catalog CLI commands (targets, catalog list/validate/show)."""
|
||||
import json
|
||||
import textwrap
|
||||
|
||||
import pytest
|
||||
from typer.testing import CliRunner
|
||||
|
||||
from bridge.cli import app
|
||||
|
||||
runner = CliRunner()
|
||||
|
||||
# Config with catalog_path pointing to a fixture
|
||||
BASE_CONFIG = textwrap.dedent("""\
|
||||
tunnels: {{}}
|
||||
actors: {{}}
|
||||
catalog_path: {catalog_path}
|
||||
""")
|
||||
|
||||
CONFIG_NO_CATALOG = textwrap.dedent("""\
|
||||
tunnels: {}
|
||||
actors: {}
|
||||
""")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def catalog_dir(tmp_path):
|
||||
root = tmp_path / "opscatalog"
|
||||
domain_dir = root / "domains" / "coulombcore"
|
||||
(domain_dir / "targets").mkdir(parents=True)
|
||||
(domain_dir / "bridges").mkdir(parents=True)
|
||||
actors_dir = root / "actors"
|
||||
actors_dir.mkdir(parents=True)
|
||||
|
||||
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
|
||||
type: domain
|
||||
id: coulombcore
|
||||
name: CoulombCore Infrastructure
|
||||
description: Core infra
|
||||
environment: production
|
||||
"""))
|
||||
|
||||
(domain_dir / "targets" / "state-hub.yaml").write_text(textwrap.dedent("""\
|
||||
type: target
|
||||
id: state-hub
|
||||
domain: coulombcore
|
||||
kind: service
|
||||
description: State coordination service
|
||||
reachable_via:
|
||||
- state-hub-coulombcore
|
||||
"""))
|
||||
|
||||
(domain_dir / "bridges" / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
|
||||
type: bridge
|
||||
id: state-hub-coulombcore
|
||||
domain: coulombcore
|
||||
target: state-hub
|
||||
description: Ops bridge for state hub
|
||||
access_method: ssh-reverse
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
"""))
|
||||
|
||||
(actors_dir / "agents.yaml").write_text(textwrap.dedent("""\
|
||||
type: actor
|
||||
id: agent.claude-coulombcore
|
||||
class: automation
|
||||
description: Claude Code agent
|
||||
"""))
|
||||
return root
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_file(tmp_path, catalog_dir):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(BASE_CONFIG.format(catalog_path=str(catalog_dir)))
|
||||
return f
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def env(config_file, tmp_path):
|
||||
return {
|
||||
"BRIDGE_CONFIG": str(config_file),
|
||||
"BRIDGE_STATE_DIR": str(tmp_path / "state"),
|
||||
}
|
||||
|
||||
|
||||
class TestTargetsCommand:
|
||||
@pytest.mark.capability("catalog_list_targets")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_targets_shows_table(self, env):
|
||||
result = runner.invoke(app, ["targets"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "state-hub" in result.output
|
||||
|
||||
def test_targets_json(self, env):
|
||||
result = runner.invoke(app, ["targets", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert isinstance(data, list)
|
||||
assert any(t["target"] == "state-hub" for t in data)
|
||||
assert any(t["domain"] == "coulombcore" for t in data)
|
||||
|
||||
def test_targets_domain_filter(self, env):
|
||||
result = runner.invoke(app, ["targets", "--domain", "coulombcore"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "state-hub" in result.output
|
||||
|
||||
def test_targets_domain_filter_unknown(self, env):
|
||||
result = runner.invoke(app, ["targets", "--domain", "nonexistent"], env=env)
|
||||
assert result.exit_code == 0
|
||||
# No results but no crash
|
||||
|
||||
def test_targets_no_catalog_configured(self, tmp_path):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(CONFIG_NO_CATALOG)
|
||||
result = runner.invoke(app, ["targets"], env={"BRIDGE_CONFIG": str(f)})
|
||||
assert result.exit_code == 1
|
||||
assert "catalog" in result.output.lower()
|
||||
|
||||
@pytest.mark.capability("catalog_show_target")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_targets_show_subcommand(self, env):
|
||||
result = runner.invoke(app, ["targets", "show", "state-hub"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "state-hub" in result.output
|
||||
assert "coulombcore" in result.output
|
||||
|
||||
def test_targets_show_unknown(self, env):
|
||||
result = runner.invoke(app, ["targets", "show", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
|
||||
class TestCatalogCommand:
|
||||
@pytest.mark.capability("catalog_list_domains")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_catalog_list(self, env):
|
||||
result = runner.invoke(app, ["catalog", "list"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "coulombcore" in result.output
|
||||
|
||||
def test_catalog_list_json(self, env):
|
||||
result = runner.invoke(app, ["catalog", "list", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert isinstance(data, list)
|
||||
assert any(d["domain"] == "coulombcore" for d in data)
|
||||
|
||||
@pytest.mark.capability("catalog_validate")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_catalog_validate_clean(self, env):
|
||||
result = runner.invoke(app, ["catalog", "validate"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "valid" in result.output.lower() or "ok" in result.output.lower() or "0" in result.output
|
||||
|
||||
def test_catalog_validate_with_errors(self, tmp_path):
|
||||
# Catalog with dangling reference
|
||||
root = tmp_path / "bad-catalog"
|
||||
domain_dir = root / "domains" / "d"
|
||||
(domain_dir / "targets").mkdir(parents=True)
|
||||
(domain_dir / "domain.yaml").write_text(
|
||||
"type: domain\nid: d\nname: D\n"
|
||||
)
|
||||
(domain_dir / "targets" / "t.yaml").write_text(
|
||||
"type: target\nid: t\ndomain: d\nkind: service\nreachable_via:\n - missing-bridge\n"
|
||||
)
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(BASE_CONFIG.format(catalog_path=str(root)))
|
||||
result = runner.invoke(app, ["catalog", "validate"], env={"BRIDGE_CONFIG": str(f)})
|
||||
assert result.exit_code == 1
|
||||
assert "missing-bridge" in result.output
|
||||
|
||||
@pytest.mark.capability("catalog_show_bridge")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_catalog_show(self, env):
|
||||
result = runner.invoke(app, ["catalog", "show", "state-hub-coulombcore"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "state-hub-coulombcore" in result.output
|
||||
assert "coulombcore.local" in result.output
|
||||
|
||||
def test_catalog_show_unknown(self, env):
|
||||
result = runner.invoke(app, ["catalog", "show", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
def test_catalog_no_catalog_configured(self, tmp_path):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(CONFIG_NO_CATALOG)
|
||||
result = runner.invoke(app, ["catalog", "list"], env={"BRIDGE_CONFIG": str(f)})
|
||||
assert result.exit_code == 1
|
||||
|
||||
|
||||
class TestUpWithCatalogFallback:
|
||||
def test_up_resolves_catalog_bridge(self, env):
|
||||
"""bridge up <catalog-bridge-name> works when name not in inline tunnels.yaml."""
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
|
||||
result = runner.invoke(app, ["up", "state-hub-coulombcore"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
mock_mgr.start.assert_called_once()
|
||||
|
||||
def test_up_unknown_bridge_exit_1(self, env):
|
||||
result = runner.invoke(app, ["up", "totally-nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
195
tests/test_catalog_integration.py
Normal file
195
tests/test_catalog_integration.py
Normal file
@@ -0,0 +1,195 @@
|
||||
"""Integration tests for OpsCatalog (T14-T16 from BRIDGE-WP-0002)."""
|
||||
import json
|
||||
import textwrap
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
from typer.testing import CliRunner
|
||||
|
||||
from bridge.catalog.loader import load_catalog
|
||||
from bridge.catalog.resolver import resolve
|
||||
from bridge.catalog.validator import validate_catalog
|
||||
from bridge.cli import app
|
||||
|
||||
runner = CliRunner()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def catalog_dir(tmp_path):
|
||||
root = tmp_path / "opscatalog"
|
||||
domain_dir = root / "domains" / "coulombcore"
|
||||
(domain_dir / "targets").mkdir(parents=True)
|
||||
(domain_dir / "bridges").mkdir(parents=True)
|
||||
(domain_dir / "docs").mkdir(parents=True)
|
||||
actors_dir = root / "actors"
|
||||
actors_dir.mkdir(parents=True)
|
||||
|
||||
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
|
||||
type: domain
|
||||
id: coulombcore
|
||||
name: CoulombCore Infrastructure
|
||||
description: Core infra
|
||||
environment: production
|
||||
"""))
|
||||
(domain_dir / "targets" / "state-hub.yaml").write_text(textwrap.dedent("""\
|
||||
type: target
|
||||
id: state-hub
|
||||
domain: coulombcore
|
||||
kind: service
|
||||
description: State coordination service
|
||||
reachable_via:
|
||||
- state-hub-coulombcore
|
||||
"""))
|
||||
(domain_dir / "bridges" / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
|
||||
type: bridge
|
||||
id: state-hub-coulombcore
|
||||
domain: coulombcore
|
||||
target: state-hub
|
||||
description: Ops bridge for state hub
|
||||
access_method: ssh-reverse
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
"""))
|
||||
(actors_dir / "agents.yaml").write_text(textwrap.dedent("""\
|
||||
type: actor
|
||||
id: agent.claude-coulombcore
|
||||
class: automation
|
||||
description: Claude Code agent on CoulombCore
|
||||
"""))
|
||||
(domain_dir / "docs" / "overview.md").write_text(
|
||||
"# CoulombCore Overview\nCore infrastructure notes."
|
||||
)
|
||||
return root
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_with_catalog(tmp_path, catalog_dir):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(textwrap.dedent(f"""\
|
||||
catalog_path: {catalog_dir}
|
||||
tunnels: {{}}
|
||||
actors: {{}}
|
||||
"""))
|
||||
return f
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def env(config_with_catalog, tmp_path):
|
||||
return {
|
||||
"BRIDGE_CONFIG": str(config_with_catalog),
|
||||
"BRIDGE_STATE_DIR": str(tmp_path / "state"),
|
||||
}
|
||||
|
||||
|
||||
class TestT14CatalogLoadAndResolve:
|
||||
def test_catalog_loads_all_types(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
assert "coulombcore" in cat.domains
|
||||
assert "state-hub" in cat.targets
|
||||
assert "state-hub-coulombcore" in cat.bridges
|
||||
assert "agent.claude-coulombcore" in cat.actors
|
||||
|
||||
def test_resolve_from_catalog(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
tc = resolve("state-hub-coulombcore", catalog=cat, inline_tunnels={})
|
||||
assert tc.name == "state-hub-coulombcore"
|
||||
assert tc.host == "coulombcore.local"
|
||||
assert tc.remote_port == 18000
|
||||
|
||||
def test_bridge_up_with_catalog_bridge(self, env):
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
|
||||
result = runner.invoke(app, ["up", "state-hub-coulombcore"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
mock_mgr.start.assert_called_once()
|
||||
# Verify TunnelManager was constructed with correct config
|
||||
call_args = mock_mgr_cls.call_args
|
||||
tcfg = call_args[0][0]
|
||||
assert tcfg.host == "coulombcore.local"
|
||||
assert tcfg.remote_port == 18000
|
||||
|
||||
|
||||
class TestT15BridgeTargetsOutput:
|
||||
def test_targets_table(self, env):
|
||||
result = runner.invoke(app, ["targets"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "state-hub" in result.output
|
||||
assert "coulombcore" in result.output
|
||||
assert "service" in result.output
|
||||
|
||||
def test_targets_json_structure(self, env):
|
||||
result = runner.invoke(app, ["targets", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert len(data) == 1
|
||||
t = data[0]
|
||||
assert t["target"] == "state-hub"
|
||||
assert t["domain"] == "coulombcore"
|
||||
assert t["kind"] == "service"
|
||||
assert "state-hub-coulombcore" in t["bridges"]
|
||||
|
||||
def test_targets_show_includes_docs(self, env):
|
||||
result = runner.invoke(app, ["targets", "show", "state-hub"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "state-hub" in result.output
|
||||
assert "coulombcore" in result.output
|
||||
|
||||
|
||||
class TestT16CatalogValidate:
|
||||
def test_validate_clean_catalog_exit_0(self, env):
|
||||
result = runner.invoke(app, ["catalog", "validate"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "ok" in result.output.lower() or "0" in result.output
|
||||
|
||||
def test_validate_dangling_reference_exit_1(self, tmp_path):
|
||||
root = tmp_path / "bad"
|
||||
domain_dir = root / "domains" / "d"
|
||||
(domain_dir / "targets").mkdir(parents=True)
|
||||
(domain_dir / "bridges").mkdir(parents=True)
|
||||
(root / "actors").mkdir(parents=True)
|
||||
|
||||
(domain_dir / "domain.yaml").write_text("type: domain\nid: d\nname: D\n")
|
||||
(domain_dir / "targets" / "t.yaml").write_text(
|
||||
"type: target\nid: t\ndomain: d\nkind: service\n"
|
||||
"reachable_via:\n - nonexistent-bridge\n"
|
||||
)
|
||||
(domain_dir / "bridges" / "b.yaml").write_text(
|
||||
"type: bridge\nid: b\ndomain: d\ntarget: t\n"
|
||||
"host: h\nremote_port: 1\nlocal_port: 2\n"
|
||||
"ssh_user: u\nssh_key: k\nactor: missing-actor\n"
|
||||
)
|
||||
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(f"catalog_path: {root}\ntunnels: {{}}\nactors: {{}}\n")
|
||||
|
||||
result = runner.invoke(app, ["catalog", "validate"], env={"BRIDGE_CONFIG": str(f)})
|
||||
assert result.exit_code == 1
|
||||
assert "nonexistent-bridge" in result.output or "missing-actor" in result.output
|
||||
|
||||
def test_catalog_list_shows_counts(self, env):
|
||||
result = runner.invoke(app, ["catalog", "list"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "coulombcore" in result.output
|
||||
|
||||
def test_catalog_show_bridge(self, env):
|
||||
result = runner.invoke(app, ["catalog", "show", "state-hub-coulombcore"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "coulombcore.local" in result.output
|
||||
assert "18000" in result.output
|
||||
|
||||
def test_validate_using_validator_directly(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
errors = validate_catalog(cat)
|
||||
assert errors == []
|
||||
140
tests/test_catalog_loader.py
Normal file
140
tests/test_catalog_loader.py
Normal file
@@ -0,0 +1,140 @@
|
||||
"""Tests for catalog loader."""
|
||||
import textwrap
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.catalog.loader import CatalogLoadError, load_catalog
|
||||
from bridge.catalog.models import Catalog
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def catalog_dir(tmp_path):
|
||||
"""Build a minimal valid catalog fixture."""
|
||||
root = tmp_path / "opscatalog"
|
||||
domain_dir = root / "domains" / "coulombcore"
|
||||
(domain_dir / "targets").mkdir(parents=True)
|
||||
(domain_dir / "bridges").mkdir(parents=True)
|
||||
(domain_dir / "docs").mkdir(parents=True)
|
||||
actors_dir = root / "actors"
|
||||
actors_dir.mkdir(parents=True)
|
||||
|
||||
(domain_dir / "domain.yaml").write_text(textwrap.dedent("""\
|
||||
type: domain
|
||||
id: coulombcore
|
||||
name: CoulombCore Infrastructure
|
||||
description: Core infra
|
||||
environment: production
|
||||
"""))
|
||||
|
||||
(domain_dir / "targets" / "state-hub.yaml").write_text(textwrap.dedent("""\
|
||||
type: target
|
||||
id: state-hub
|
||||
domain: coulombcore
|
||||
kind: service
|
||||
description: State coordination service
|
||||
reachable_via:
|
||||
- state-hub-coulombcore
|
||||
"""))
|
||||
|
||||
(domain_dir / "bridges" / "state-hub-coulombcore.yaml").write_text(textwrap.dedent("""\
|
||||
type: bridge
|
||||
id: state-hub-coulombcore
|
||||
domain: coulombcore
|
||||
target: state-hub
|
||||
description: Ops bridge
|
||||
access_method: ssh-reverse
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
health_check:
|
||||
url: http://127.0.0.1:18000/health
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
"""))
|
||||
|
||||
(actors_dir / "agents.yaml").write_text(textwrap.dedent("""\
|
||||
type: actor
|
||||
id: agent.claude-coulombcore
|
||||
class: automation
|
||||
description: Claude Code agent on CoulombCore
|
||||
"""))
|
||||
|
||||
(domain_dir / "docs" / "overview.md").write_text("# Overview\nSome ops notes.")
|
||||
|
||||
return root
|
||||
|
||||
|
||||
class TestLoadCatalog:
|
||||
def test_loads_domain(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
assert "coulombcore" in cat.domains
|
||||
d = cat.domains["coulombcore"]
|
||||
assert d.name == "CoulombCore Infrastructure"
|
||||
assert d.environment == "production"
|
||||
|
||||
def test_loads_target(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
assert "state-hub" in cat.targets
|
||||
t = cat.targets["state-hub"]
|
||||
assert t.domain == "coulombcore"
|
||||
assert t.kind == "service"
|
||||
assert "state-hub-coulombcore" in t.reachable_via
|
||||
|
||||
def test_loads_bridge(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
assert "state-hub-coulombcore" in cat.bridges
|
||||
b = cat.bridges["state-hub-coulombcore"]
|
||||
assert b.host == "coulombcore.local"
|
||||
assert b.remote_port == 18000
|
||||
assert b.health_check is not None
|
||||
assert b.health_check.url == "http://127.0.0.1:18000/health"
|
||||
assert b.reconnect is not None
|
||||
assert b.reconnect.max_attempts == 0
|
||||
|
||||
def test_loads_actor(self, catalog_dir):
|
||||
cat = load_catalog(catalog_dir)
|
||||
assert "agent.claude-coulombcore" in cat.actors
|
||||
a = cat.actors["agent.claude-coulombcore"]
|
||||
assert a.actor_class == "automation"
|
||||
|
||||
def test_unknown_type_skipped(self, catalog_dir):
|
||||
(catalog_dir / "domains" / "coulombcore" / "unknown.yaml").write_text(
|
||||
"type: mystery\nid: x\n"
|
||||
)
|
||||
# Should not raise
|
||||
cat = load_catalog(catalog_dir)
|
||||
assert isinstance(cat, Catalog)
|
||||
|
||||
def test_empty_catalog_dir(self, tmp_path):
|
||||
root = tmp_path / "empty"
|
||||
root.mkdir()
|
||||
cat = load_catalog(root)
|
||||
assert cat.domains == {}
|
||||
assert cat.bridges == {}
|
||||
|
||||
def test_missing_required_field_raises(self, tmp_path):
|
||||
root = tmp_path / "bad"
|
||||
domain_dir = root / "domains" / "x"
|
||||
domain_dir.mkdir(parents=True)
|
||||
(domain_dir / "domain.yaml").write_text("type: domain\nname: X\n")
|
||||
with pytest.raises(CatalogLoadError, match="id"):
|
||||
load_catalog(root)
|
||||
|
||||
def test_nonexistent_path_raises(self, tmp_path):
|
||||
with pytest.raises(CatalogLoadError, match="not found"):
|
||||
load_catalog(tmp_path / "nonexistent")
|
||||
|
||||
def test_invalid_yaml_raises(self, tmp_path):
|
||||
root = tmp_path / "bad"
|
||||
domain_dir = root / "domains" / "x"
|
||||
domain_dir.mkdir(parents=True)
|
||||
(domain_dir / "domain.yaml").write_text("type: domain\n[\nbad: yaml")
|
||||
with pytest.raises(CatalogLoadError):
|
||||
load_catalog(root)
|
||||
115
tests/test_catalog_models.py
Normal file
115
tests/test_catalog_models.py
Normal file
@@ -0,0 +1,115 @@
|
||||
"""Tests for catalog domain models."""
|
||||
from bridge.catalog.models import (
|
||||
ActorClass,
|
||||
Catalog,
|
||||
CatalogBridge,
|
||||
CatalogDomain,
|
||||
CatalogTarget,
|
||||
)
|
||||
|
||||
|
||||
class TestCatalogDomain:
|
||||
def test_required_fields(self):
|
||||
d = CatalogDomain(id="coulombcore", name="CoulombCore Infra")
|
||||
assert d.id == "coulombcore"
|
||||
assert d.name == "CoulombCore Infra"
|
||||
|
||||
def test_optional_fields_default(self):
|
||||
d = CatalogDomain(id="x", name="X")
|
||||
assert d.description == ""
|
||||
assert d.environment == ""
|
||||
|
||||
|
||||
class TestCatalogTarget:
|
||||
def test_required_fields(self):
|
||||
t = CatalogTarget(id="state-hub", domain="coulombcore", kind="service")
|
||||
assert t.id == "state-hub"
|
||||
assert t.domain == "coulombcore"
|
||||
assert t.kind == "service"
|
||||
|
||||
def test_reachable_via_defaults_empty(self):
|
||||
t = CatalogTarget(id="t", domain="d", kind="service")
|
||||
assert t.reachable_via == []
|
||||
|
||||
def test_reachable_via(self):
|
||||
t = CatalogTarget(id="t", domain="d", kind="service", reachable_via=["b1", "b2"])
|
||||
assert t.reachable_via == ["b1", "b2"]
|
||||
|
||||
|
||||
class TestCatalogBridge:
|
||||
def test_required_fields(self):
|
||||
b = CatalogBridge(
|
||||
id="state-hub-coulombcore",
|
||||
domain="coulombcore",
|
||||
target="state-hub",
|
||||
host="coulombcore.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="agent.claude-coulombcore",
|
||||
)
|
||||
assert b.id == "state-hub-coulombcore"
|
||||
assert b.domain == "coulombcore"
|
||||
assert b.host == "coulombcore.local"
|
||||
|
||||
def test_optional_fields_default(self):
|
||||
b = CatalogBridge(
|
||||
id="b",
|
||||
domain="d",
|
||||
target="t",
|
||||
host="h",
|
||||
remote_port=1,
|
||||
local_port=2,
|
||||
ssh_user="u",
|
||||
ssh_key="k",
|
||||
actor="a",
|
||||
)
|
||||
assert b.description == ""
|
||||
assert b.access_method == "ssh-reverse"
|
||||
assert b.health_check is None
|
||||
assert b.reconnect is None
|
||||
|
||||
def test_to_tunnel_config(self):
|
||||
from bridge.models import TunnelConfig
|
||||
b = CatalogBridge(
|
||||
id="state-hub-coulombcore",
|
||||
domain="coulombcore",
|
||||
target="state-hub",
|
||||
host="coulombcore.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="agent.claude-coulombcore",
|
||||
)
|
||||
tc = b.to_tunnel_config()
|
||||
assert isinstance(tc, TunnelConfig)
|
||||
assert tc.name == "state-hub-coulombcore"
|
||||
assert tc.host == "coulombcore.local"
|
||||
assert tc.remote_port == 18000
|
||||
|
||||
|
||||
class TestActorClass:
|
||||
def test_fields(self):
|
||||
a = ActorClass(id="agent.claude", actor_class="automation", description="Claude agent")
|
||||
assert a.id == "agent.claude"
|
||||
assert a.actor_class == "automation"
|
||||
|
||||
def test_optional_description(self):
|
||||
a = ActorClass(id="x", actor_class="human")
|
||||
assert a.description == ""
|
||||
|
||||
|
||||
class TestCatalog:
|
||||
def test_empty_catalog(self):
|
||||
c = Catalog()
|
||||
assert c.domains == {}
|
||||
assert c.targets == {}
|
||||
assert c.bridges == {}
|
||||
assert c.actors == {}
|
||||
|
||||
def test_add_entries(self):
|
||||
c = Catalog()
|
||||
c.domains["d"] = CatalogDomain(id="d", name="D")
|
||||
assert "d" in c.domains
|
||||
88
tests/test_catalog_resolver.py
Normal file
88
tests/test_catalog_resolver.py
Normal file
@@ -0,0 +1,88 @@
|
||||
"""Tests for catalog resolver."""
|
||||
import pytest
|
||||
from bridge.catalog.models import (
|
||||
ActorClass,
|
||||
Catalog,
|
||||
CatalogBridge,
|
||||
CatalogDomain,
|
||||
CatalogTarget,
|
||||
)
|
||||
from bridge.catalog.resolver import BridgeNotFound, resolve
|
||||
from bridge.models import TunnelConfig, ReconnectPolicy
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def catalog():
|
||||
cat = Catalog()
|
||||
cat.domains["d"] = CatalogDomain(id="d", name="D")
|
||||
cat.targets["t"] = CatalogTarget(id="t", domain="d", kind="service")
|
||||
cat.bridges["catalog-bridge"] = CatalogBridge(
|
||||
id="catalog-bridge",
|
||||
domain="d",
|
||||
target="t",
|
||||
host="catalog-host.local",
|
||||
remote_port=19000,
|
||||
local_port=9000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/catalog",
|
||||
actor="operator.bernd",
|
||||
)
|
||||
cat.actors["operator.bernd"] = ActorClass(id="operator.bernd", actor_class="human")
|
||||
return cat
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def inline_tunnels():
|
||||
return {
|
||||
"inline-bridge": TunnelConfig(
|
||||
name="inline-bridge",
|
||||
host="inline-host.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/inline",
|
||||
actor="operator.bernd",
|
||||
)
|
||||
}
|
||||
|
||||
|
||||
class TestResolve:
|
||||
def test_inline_takes_precedence(self, catalog, inline_tunnels):
|
||||
tc = resolve("inline-bridge", catalog=catalog, inline_tunnels=inline_tunnels)
|
||||
assert tc.host == "inline-host.local"
|
||||
|
||||
def test_catalog_fallback(self, catalog, inline_tunnels):
|
||||
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels=inline_tunnels)
|
||||
assert tc.host == "catalog-host.local"
|
||||
assert tc.remote_port == 19000
|
||||
|
||||
def test_catalog_fallback_no_inline(self, catalog):
|
||||
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels={})
|
||||
assert tc.name == "catalog-bridge"
|
||||
|
||||
def test_missing_name_raises(self, catalog, inline_tunnels):
|
||||
with pytest.raises(BridgeNotFound, match="nonexistent"):
|
||||
resolve("nonexistent", catalog=catalog, inline_tunnels=inline_tunnels)
|
||||
|
||||
def test_missing_name_no_catalog_raises(self, inline_tunnels):
|
||||
with pytest.raises(BridgeNotFound):
|
||||
resolve("nonexistent", catalog=None, inline_tunnels=inline_tunnels)
|
||||
|
||||
def test_inline_bridge_returns_tunnel_config(self, catalog, inline_tunnels):
|
||||
tc = resolve("inline-bridge", catalog=catalog, inline_tunnels=inline_tunnels)
|
||||
assert isinstance(tc, TunnelConfig)
|
||||
|
||||
def test_catalog_bridge_returns_tunnel_config(self, catalog):
|
||||
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels={})
|
||||
assert isinstance(tc, TunnelConfig)
|
||||
|
||||
def test_catalog_is_none_no_inline_raises(self):
|
||||
with pytest.raises(BridgeNotFound):
|
||||
resolve("any-name", catalog=None, inline_tunnels={})
|
||||
|
||||
def test_resolve_preserves_reconnect_policy(self, catalog):
|
||||
catalog.bridges["catalog-bridge"].reconnect = ReconnectPolicy(
|
||||
max_attempts=3, backoff_initial=2, backoff_max=30
|
||||
)
|
||||
tc = resolve("catalog-bridge", catalog=catalog, inline_tunnels={})
|
||||
assert tc.reconnect.max_attempts == 3
|
||||
93
tests/test_catalog_validator.py
Normal file
93
tests/test_catalog_validator.py
Normal file
@@ -0,0 +1,93 @@
|
||||
"""Tests for catalog validator."""
|
||||
from bridge.catalog.models import (
|
||||
ActorClass,
|
||||
Catalog,
|
||||
CatalogBridge,
|
||||
CatalogDomain,
|
||||
CatalogTarget,
|
||||
)
|
||||
from bridge.catalog.validator import validate_catalog
|
||||
|
||||
|
||||
def _make_full_catalog() -> Catalog:
|
||||
cat = Catalog()
|
||||
cat.domains["coulombcore"] = CatalogDomain(id="coulombcore", name="CoulombCore")
|
||||
cat.targets["state-hub"] = CatalogTarget(
|
||||
id="state-hub",
|
||||
domain="coulombcore",
|
||||
kind="service",
|
||||
reachable_via=["state-hub-coulombcore"],
|
||||
)
|
||||
cat.bridges["state-hub-coulombcore"] = CatalogBridge(
|
||||
id="state-hub-coulombcore",
|
||||
domain="coulombcore",
|
||||
target="state-hub",
|
||||
host="host.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="agent.claude-coulombcore",
|
||||
)
|
||||
cat.actors["agent.claude-coulombcore"] = ActorClass(
|
||||
id="agent.claude-coulombcore",
|
||||
actor_class="automation",
|
||||
)
|
||||
return cat
|
||||
|
||||
|
||||
class TestValidateCatalog:
|
||||
def test_valid_catalog_no_errors(self):
|
||||
cat = _make_full_catalog()
|
||||
errors = validate_catalog(cat)
|
||||
assert errors == []
|
||||
|
||||
def test_target_domain_must_exist(self):
|
||||
cat = _make_full_catalog()
|
||||
cat.targets["orphan"] = CatalogTarget(
|
||||
id="orphan", domain="nonexistent-domain", kind="service"
|
||||
)
|
||||
errors = validate_catalog(cat)
|
||||
assert any("orphan" in e and "nonexistent-domain" in e for e in errors)
|
||||
|
||||
def test_target_reachable_via_must_exist(self):
|
||||
cat = _make_full_catalog()
|
||||
cat.targets["state-hub"].reachable_via.append("nonexistent-bridge")
|
||||
errors = validate_catalog(cat)
|
||||
assert any("nonexistent-bridge" in e for e in errors)
|
||||
|
||||
def test_bridge_domain_must_exist(self):
|
||||
cat = _make_full_catalog()
|
||||
cat.bridges["state-hub-coulombcore"].domain = "missing-domain"
|
||||
errors = validate_catalog(cat)
|
||||
assert any("missing-domain" in e for e in errors)
|
||||
|
||||
def test_bridge_target_must_exist(self):
|
||||
cat = _make_full_catalog()
|
||||
cat.bridges["state-hub-coulombcore"].target = "missing-target"
|
||||
errors = validate_catalog(cat)
|
||||
assert any("missing-target" in e for e in errors)
|
||||
|
||||
def test_bridge_actor_must_exist(self):
|
||||
cat = _make_full_catalog()
|
||||
cat.bridges["state-hub-coulombcore"].actor = "nonexistent-actor"
|
||||
errors = validate_catalog(cat)
|
||||
assert any("nonexistent-actor" in e for e in errors)
|
||||
|
||||
def test_multiple_errors_all_reported(self):
|
||||
cat = Catalog()
|
||||
# Target with dangling domain and reachable_via
|
||||
cat.targets["t1"] = CatalogTarget(
|
||||
id="t1", domain="missing", kind="service", reachable_via=["missing-bridge"]
|
||||
)
|
||||
# Bridge with dangling domain + target + actor
|
||||
cat.bridges["b1"] = CatalogBridge(
|
||||
id="b1", domain="missing", target="missing", host="h",
|
||||
remote_port=1, local_port=2, ssh_user="u", ssh_key="k", actor="missing-actor",
|
||||
)
|
||||
errors = validate_catalog(cat)
|
||||
assert len(errors) >= 4
|
||||
|
||||
def test_empty_catalog_is_valid(self):
|
||||
cat = Catalog()
|
||||
assert validate_catalog(cat) == []
|
||||
130
tests/test_cleanup.py
Normal file
130
tests/test_cleanup.py
Normal file
@@ -0,0 +1,130 @@
|
||||
"""Tests for stale SSH forward cleanup."""
|
||||
from __future__ import annotations
|
||||
|
||||
import textwrap
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
from typer.testing import CliRunner
|
||||
|
||||
from bridge.cleanup import (
|
||||
CleanupAction,
|
||||
build_cron_line,
|
||||
cleanup_all_tunnels,
|
||||
remote_forward_health_url,
|
||||
should_cleanup_tunnel,
|
||||
)
|
||||
from bridge.cli import app
|
||||
from bridge.config import load_config
|
||||
from bridge.models import HealthCheckConfig, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
def _tunnel(**overrides) -> TunnelConfig:
|
||||
base = dict(
|
||||
name="state-hub-railiance01",
|
||||
host="92.205.62.239",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="tegwick",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="agt-claude-railiance01",
|
||||
health_check=HealthCheckConfig(
|
||||
url="http://127.0.0.1:8000/state/health",
|
||||
timeout_seconds=5,
|
||||
),
|
||||
)
|
||||
base.update(overrides)
|
||||
return TunnelConfig(**base)
|
||||
|
||||
|
||||
class TestRemoteForwardHealthUrl:
|
||||
def test_maps_local_port_to_remote(self):
|
||||
cfg = _tunnel()
|
||||
assert remote_forward_health_url(cfg) == "http://127.0.0.1:18000/state/health"
|
||||
|
||||
def test_returns_none_for_local_tunnel(self):
|
||||
cfg = _tunnel(direction="local")
|
||||
assert remote_forward_health_url(cfg) is None
|
||||
|
||||
|
||||
class TestShouldCleanupTunnel:
|
||||
def test_skips_healthy_remote_forward(self, tmp_path):
|
||||
cfg = _tunnel()
|
||||
state_mgr = StateManager(state_dir=tmp_path)
|
||||
with (
|
||||
patch("bridge.cleanup.remote_port_listening", return_value=True),
|
||||
patch("bridge.cleanup.probe_remote_forward", return_value=(True, "ok")),
|
||||
):
|
||||
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
|
||||
assert needed is False
|
||||
|
||||
def test_detects_stale_forward_when_local_ok_remote_fails(self, tmp_path):
|
||||
cfg = _tunnel()
|
||||
state_mgr = StateManager(state_dir=tmp_path)
|
||||
with (
|
||||
patch("bridge.cleanup.remote_port_listening", return_value=True),
|
||||
patch("bridge.cleanup.probe_remote_forward", return_value=(False, "timeout")),
|
||||
patch("bridge.cleanup.local_service_healthy", return_value=True),
|
||||
patch(
|
||||
"bridge.cleanup.check_tunnel",
|
||||
return_value=MagicMock(ssh_process="ok", remote_port="listening"),
|
||||
),
|
||||
):
|
||||
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
|
||||
assert needed is True
|
||||
assert "stale forward" in reason
|
||||
|
||||
|
||||
class TestCleanupAllTunnels:
|
||||
def test_reports_cleaned_tunnel(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "tunnels.yaml"))
|
||||
(tmp_path / "tunnels.yaml").write_text(
|
||||
textwrap.dedent(
|
||||
"""\
|
||||
tunnels:
|
||||
state-hub-railiance01:
|
||||
host: 92.205.62.239
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: tegwick
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agt-claude-railiance01
|
||||
health_check:
|
||||
url: http://127.0.0.1:8000/state/health
|
||||
actors:
|
||||
agt-claude-railiance01:
|
||||
class: agt
|
||||
"""
|
||||
)
|
||||
)
|
||||
cfg = load_config()
|
||||
state_mgr = StateManager(state_dir=tmp_path / "state")
|
||||
with patch(
|
||||
"bridge.cleanup.cleanup_tunnel",
|
||||
return_value=CleanupAction("state-hub-railiance01", "cleaned", "cleared"),
|
||||
):
|
||||
report = cleanup_all_tunnels(cfg, state_mgr, restart=False)
|
||||
assert report.cleaned_count == 1
|
||||
assert report.actions[0].action == "cleaned"
|
||||
|
||||
|
||||
class TestMaintenanceCli:
|
||||
def test_cleanup_help(self):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(app, ["maintenance", "cleanup", "--help"])
|
||||
assert result.exit_code == 0
|
||||
assert "restart" in result.output.lower()
|
||||
|
||||
def test_show_cron_prints_template_when_not_installed(self):
|
||||
runner = CliRunner()
|
||||
with patch("bridge.cli.read_installed_cron", return_value=None):
|
||||
result = runner.invoke(app, ["maintenance", "show-cron"])
|
||||
assert result.exit_code == 0
|
||||
assert "0 3 * * *" in result.output
|
||||
|
||||
|
||||
def test_build_cron_line_contains_marker():
|
||||
line = build_cron_line()
|
||||
assert "0 3 * * *" in line
|
||||
assert "maintenance cleanup --restart" in line
|
||||
assert "ops-bridge: maintenance cleanup" in line
|
||||
411
tests/test_cli.py
Normal file
411
tests/test_cli.py
Normal file
@@ -0,0 +1,411 @@
|
||||
"""Tests for CLI commands."""
|
||||
import json
|
||||
import textwrap
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
from typer.testing import CliRunner
|
||||
|
||||
from bridge.cli import app
|
||||
|
||||
|
||||
VALID_CONFIG = textwrap.dedent("""\
|
||||
tunnels:
|
||||
test-tunnel:
|
||||
host: host.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
runner = CliRunner()
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_file(tmp_path):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(VALID_CONFIG)
|
||||
return f
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def state_dir(tmp_path):
|
||||
return tmp_path / "state"
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def env(config_file, state_dir):
|
||||
return {"BRIDGE_CONFIG": str(config_file), "BRIDGE_STATE_DIR": str(state_dir)}
|
||||
|
||||
|
||||
class TestHelpCommand:
|
||||
def test_app_help(self):
|
||||
result = runner.invoke(app, ["--help"])
|
||||
assert result.exit_code == 0
|
||||
assert "bridge" in result.output.lower() or "Usage" in result.output
|
||||
|
||||
def test_up_help(self):
|
||||
result = runner.invoke(app, ["up", "--help"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
def test_down_help(self):
|
||||
result = runner.invoke(app, ["down", "--help"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
def test_status_help(self):
|
||||
result = runner.invoke(app, ["status", "--help"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
def test_logs_help(self):
|
||||
result = runner.invoke(app, ["logs", "--help"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
def test_restart_help(self):
|
||||
result = runner.invoke(app, ["restart", "--help"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
|
||||
class TestStatusCommand:
|
||||
@pytest.mark.capability("bridge_status")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_status_shows_tunnels(self, env, state_dir):
|
||||
result = runner.invoke(app, ["status"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "test-tunnel" in result.output
|
||||
|
||||
def test_status_json_flag(self, env, state_dir):
|
||||
result = runner.invoke(app, ["status", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert isinstance(data, list)
|
||||
assert len(data) == 1
|
||||
assert data[0]["tunnel"] == "test-tunnel"
|
||||
assert "state" in data[0]
|
||||
assert "actor" in data[0]
|
||||
assert "host" in data[0]
|
||||
|
||||
def test_status_shows_state(self, env, state_dir):
|
||||
result = runner.invoke(app, ["status"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "stopped" in result.output.lower()
|
||||
|
||||
def test_status_unknown_config_exit_1(self, tmp_path):
|
||||
result = runner.invoke(app, ["status"], env={"BRIDGE_CONFIG": str(tmp_path / "no.yaml")})
|
||||
assert result.exit_code == 1
|
||||
|
||||
|
||||
class TestUpCommand:
|
||||
def test_up_unknown_tunnel_exit_1(self, env):
|
||||
result = runner.invoke(app, ["up", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
assert "nonexistent" in result.output
|
||||
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_up_calls_manager_start(self, env, state_dir):
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
|
||||
result = runner.invoke(app, ["up", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
mock_mgr.start.assert_called_once()
|
||||
|
||||
def test_up_already_running_exit_2(self, env, state_dir):
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = True
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
|
||||
result = runner.invoke(app, ["up", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 2
|
||||
|
||||
|
||||
class TestDownCommand:
|
||||
def test_down_unknown_tunnel_exit_1(self, env):
|
||||
result = runner.invoke(app, ["down", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
@pytest.mark.capability("bridge_down")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_down_calls_manager_stop(self, env, state_dir):
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = True
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
|
||||
result = runner.invoke(app, ["down", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
mock_mgr.stop.assert_called_once()
|
||||
|
||||
def test_down_not_running_exit_2(self, env, state_dir):
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
|
||||
result = runner.invoke(app, ["down", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 2
|
||||
|
||||
|
||||
class TestLogsCommand:
|
||||
def test_logs_unknown_tunnel_exit_1(self, env):
|
||||
result = runner.invoke(app, ["logs", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
def test_logs_no_log_file_shows_empty(self, env, state_dir):
|
||||
result = runner.invoke(app, ["logs", "test-tunnel"], env=env)
|
||||
assert result.exit_code == 0
|
||||
|
||||
@pytest.mark.capability("bridge_logs")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_logs_shows_events(self, env, state_dir):
|
||||
import json as _json
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
log_file = state_dir / "test-tunnel.log"
|
||||
log_file.write_text(
|
||||
_json.dumps({
|
||||
"timestamp": "2026-01-01T00:00:00+00:00",
|
||||
"tunnel": "test-tunnel",
|
||||
"actor": "operator.bernd",
|
||||
"actor_class": "human",
|
||||
"event": "bridge_started",
|
||||
}) + "\n"
|
||||
)
|
||||
result = runner.invoke(app, ["logs", "test-tunnel"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "bridge_started" in result.output
|
||||
|
||||
|
||||
class TestCheckCommand:
|
||||
def test_check_help(self):
|
||||
result = runner.invoke(app, ["check", "--help"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
@pytest.mark.capability("bridge_check")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_check_all_pass(self, env):
|
||||
from bridge.diagnostics import TunnelCheckResult
|
||||
ok_result = TunnelCheckResult(
|
||||
tunnel="test-tunnel",
|
||||
ssh_process="ok",
|
||||
pid=12345,
|
||||
remote_port="listening",
|
||||
local_api=None,
|
||||
latency_ms=None,
|
||||
stale_state=False,
|
||||
)
|
||||
with patch("bridge.cli.check_all_tunnels", return_value=[ok_result]):
|
||||
result = runner.invoke(app, ["check"], env=env)
|
||||
assert result.exit_code == 0
|
||||
|
||||
def test_check_any_fail(self, env):
|
||||
from bridge.diagnostics import TunnelCheckResult
|
||||
fail_result = TunnelCheckResult(
|
||||
tunnel="test-tunnel",
|
||||
ssh_process="dead",
|
||||
pid=None,
|
||||
remote_port="closed",
|
||||
local_api=None,
|
||||
latency_ms=None,
|
||||
stale_state=True,
|
||||
)
|
||||
with patch("bridge.cli.check_all_tunnels", return_value=[fail_result]):
|
||||
result = runner.invoke(app, ["check"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
def test_check_json_flag(self, env):
|
||||
from bridge.diagnostics import TunnelCheckResult
|
||||
ok_result = TunnelCheckResult(
|
||||
tunnel="test-tunnel",
|
||||
ssh_process="ok",
|
||||
pid=12345,
|
||||
remote_port="listening",
|
||||
local_api=None,
|
||||
latency_ms=None,
|
||||
stale_state=False,
|
||||
)
|
||||
with patch("bridge.cli.check_all_tunnels", return_value=[ok_result]):
|
||||
result = runner.invoke(app, ["check", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert isinstance(data, list)
|
||||
assert len(data) == 1
|
||||
assert data[0]["ok"] is True
|
||||
assert data[0]["tunnel"] == "test-tunnel"
|
||||
assert data[0]["ssh_process"] == "ok"
|
||||
|
||||
def test_check_specific_tunnel(self, env):
|
||||
from bridge.diagnostics import TunnelCheckResult
|
||||
ok_result = TunnelCheckResult(
|
||||
tunnel="test-tunnel",
|
||||
ssh_process="ok",
|
||||
pid=12345,
|
||||
remote_port="listening",
|
||||
local_api=None,
|
||||
latency_ms=None,
|
||||
stale_state=False,
|
||||
)
|
||||
with patch("bridge.cli.check_tunnel", return_value=ok_result):
|
||||
result = runner.invoke(app, ["check", "test-tunnel"], env=env)
|
||||
assert result.exit_code == 0
|
||||
|
||||
def test_check_unknown_tunnel_exit_1(self, env):
|
||||
result = runner.invoke(app, ["check", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
|
||||
REVERSE_CONFIG = VALID_CONFIG
|
||||
|
||||
LOCAL_TUNNEL_CONFIG = textwrap.dedent("""\
|
||||
tunnels:
|
||||
k3s-api:
|
||||
host: host.local
|
||||
remote_port: 6443
|
||||
local_port: 6443
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
direction: local
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
|
||||
class TestRestartCommand:
|
||||
def test_restart_unknown_tunnel_exit_1(self, env):
|
||||
result = runner.invoke(app, ["restart", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
def test_restart_help_mentions_remote_cleanup(self):
|
||||
result = runner.invoke(app, ["restart", "--help"])
|
||||
assert result.exit_code == 0
|
||||
assert "stale-forward" in result.output.lower() or "remote" in result.output.lower()
|
||||
|
||||
@pytest.mark.capability("bridge_restart")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_restart_reverse_tunnel_delegates_to_cleanup(self, env):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cli.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel", "healthy", "remote forward healthy"
|
||||
)
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
mock_restart.assert_called_once()
|
||||
assert "test-tunnel: healthy" in result.output
|
||||
|
||||
def test_restart_reverse_tunnel_reports_cleaned_and_restarted(self, env):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cli.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel",
|
||||
"cleaned_and_restarted",
|
||||
"stale forward; restarted tunnel; cleared",
|
||||
)
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
assert "cleaned_and_restarted" in result.output
|
||||
|
||||
def test_restart_reverse_tunnel_error_exit_1(self, env):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cli.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel", "error", "cleanup failed: still_listening"
|
||||
)
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 1
|
||||
assert "error" in result.output
|
||||
|
||||
def test_restart_local_tunnel_uses_stop_start(self, tmp_path, state_dir):
|
||||
config_file = tmp_path / "tunnels.yaml"
|
||||
config_file.write_text(LOCAL_TUNNEL_CONFIG)
|
||||
env = {
|
||||
"BRIDGE_CONFIG": str(config_file),
|
||||
"BRIDGE_STATE_DIR": str(state_dir),
|
||||
}
|
||||
|
||||
with patch("bridge.cleanup.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
call_order = []
|
||||
mock_mgr.stop.side_effect = lambda: call_order.append("stop")
|
||||
mock_mgr.start.side_effect = lambda: call_order.append("start")
|
||||
|
||||
result = runner.invoke(app, ["restart", "k3s-api"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
assert call_order == ["stop", "start"]
|
||||
assert "k3s-api: restarted" in result.output
|
||||
|
||||
|
||||
class TestCertStatusCommand:
|
||||
@pytest.mark.capability("bridge_cert_status")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_cert_status_no_cert_shows_static_key(self, env, state_dir):
|
||||
result = runner.invoke(app, ["cert-status"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "static-key" in result.output
|
||||
|
||||
def test_cert_status_json_no_cert(self, env, state_dir):
|
||||
result = runner.invoke(app, ["cert-status", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert data[0]["mode"] == "static-key"
|
||||
|
||||
def test_cert_status_exit_1_on_expired(self, env, state_dir, tmp_path):
|
||||
# Write a fake cert file in state dir; mock ssh-keygen to report expired
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
cert_file = state_dir / "test-tunnel-cert.pub"
|
||||
cert_file.write_text("fake cert")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout=(
|
||||
"test-tunnel-cert.pub:\n"
|
||||
" Key ID: \"agt-test\"\n"
|
||||
" Valid: from 2026-01-01T00:00:00 to 2026-01-02T00:00:00\n"
|
||||
),
|
||||
returncode=0,
|
||||
)
|
||||
result = runner.invoke(app, ["cert-status"], env=env)
|
||||
assert result.exit_code == 1
|
||||
assert "EXPIRED" in result.output
|
||||
|
||||
def test_cert_status_json_with_cert(self, env, state_dir):
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
cert_file = state_dir / "test-tunnel-cert.pub"
|
||||
cert_file.write_text("fake cert")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout=(
|
||||
"test-tunnel-cert.pub:\n"
|
||||
" Key ID: \"agt-test\"\n"
|
||||
" Valid: from 2030-01-01T00:00:00 to 2030-01-02T00:00:00\n"
|
||||
),
|
||||
returncode=0,
|
||||
)
|
||||
result = runner.invoke(app, ["cert-status", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert data[0]["mode"] == "cert"
|
||||
assert data[0]["key_id"] == "agt-test"
|
||||
assert data[0]["expired"] is False
|
||||
299
tests/test_config.py
Normal file
299
tests/test_config.py
Normal file
@@ -0,0 +1,299 @@
|
||||
"""Tests for config loading."""
|
||||
import textwrap
|
||||
import warnings
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.config import ConfigError, load_config
|
||||
from bridge.models import ActorType
|
||||
|
||||
|
||||
VALID_YAML = textwrap.dedent("""\
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agt-claude-coulombcore
|
||||
health_check:
|
||||
url: http://127.0.0.1:18000/health
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
|
||||
actors:
|
||||
agt-claude-coulombcore:
|
||||
class: agt
|
||||
description: Claude Code agent on CoulombCore
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd Worsch
|
||||
""")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_file(tmp_path):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(VALID_YAML)
|
||||
return f
|
||||
|
||||
|
||||
def test_load_valid_config(config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
assert "state-hub-coulombcore" in cfg.tunnels
|
||||
t = cfg.tunnels["state-hub-coulombcore"]
|
||||
assert t.host == "coulombcore.local"
|
||||
assert t.remote_port == 18000
|
||||
assert t.local_port == 8000
|
||||
assert t.ssh_user == "ubuntu"
|
||||
assert t.actor == "agt-claude-coulombcore"
|
||||
|
||||
|
||||
def test_health_check_loaded(config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
t = cfg.tunnels["state-hub-coulombcore"]
|
||||
assert t.health_check is not None
|
||||
assert t.health_check.url == "http://127.0.0.1:18000/health"
|
||||
assert t.health_check.interval_seconds == 30
|
||||
|
||||
|
||||
def test_reconnect_policy_loaded(config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
t = cfg.tunnels["state-hub-coulombcore"]
|
||||
assert t.reconnect.max_attempts == 0
|
||||
assert t.reconnect.backoff_initial == 5
|
||||
assert t.reconnect.backoff_max == 60
|
||||
|
||||
|
||||
def test_actors_loaded(config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
assert "agt-claude-coulombcore" in cfg.actors
|
||||
a = cfg.actors["agt-claude-coulombcore"]
|
||||
assert a.actor_type == ActorType.AGT
|
||||
assert "adm-bernd" in cfg.actors
|
||||
|
||||
|
||||
def test_missing_required_field_raises(tmp_path, monkeypatch):
|
||||
f = tmp_path / "bad.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
broken:
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
actors: {}
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="host"):
|
||||
load_config()
|
||||
|
||||
|
||||
def test_invalid_yaml_raises(tmp_path, monkeypatch):
|
||||
f = tmp_path / "bad.yaml"
|
||||
f.write_text("tunnels: [\nnot: valid: yaml")
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError):
|
||||
load_config()
|
||||
|
||||
|
||||
def test_missing_config_file_raises(tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "nonexistent.yaml"))
|
||||
with pytest.raises(ConfigError, match="not found"):
|
||||
load_config()
|
||||
|
||||
|
||||
def test_tunnel_without_health_check(tmp_path, monkeypatch):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
simple:
|
||||
host: host.local
|
||||
remote_port: 9000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_rsa
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.tunnels["simple"].health_check is None
|
||||
|
||||
|
||||
class TestActorTypeValidation:
|
||||
def test_canonical_agt_accepted(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: agt-claude
|
||||
actors:
|
||||
agt-claude:
|
||||
class: agt
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.actors["agt-claude"].actor_type == ActorType.AGT
|
||||
|
||||
def test_canonical_atm_accepted(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: atm-backup
|
||||
actors:
|
||||
atm-backup:
|
||||
class: atm
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.actors["atm-backup"].actor_type == ActorType.ATM
|
||||
|
||||
def test_wrong_prefix_raises_config_error(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: agt
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="must start with 'agt-'"):
|
||||
load_config()
|
||||
|
||||
def test_missing_prefix_raises_config_error(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: operator.bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: adm
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="must start with 'adm-'"):
|
||||
load_config()
|
||||
|
||||
def test_unknown_class_raises_config_error(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: wizard
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="unknown class"):
|
||||
load_config()
|
||||
|
||||
def test_legacy_human_maps_to_adm_with_warning(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: human
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter("always")
|
||||
cfg = load_config()
|
||||
assert cfg.actors["adm-bernd"].actor_type == ActorType.ADM
|
||||
assert any("deprecated" in str(x.message).lower() for x in w)
|
||||
|
||||
def test_legacy_automation_maps_to_atm_with_warning(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: atm-cron
|
||||
actors:
|
||||
atm-cron:
|
||||
class: automation
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter("always")
|
||||
cfg = load_config()
|
||||
assert cfg.actors["atm-cron"].actor_type == ActorType.ATM
|
||||
assert any("deprecated" in str(x.message).lower() for x in w)
|
||||
|
||||
|
||||
class TestCertCommandConfig:
|
||||
def test_cert_command_parsed(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: agt-bridge
|
||||
cert_command: "warden sign agt-bridge --pubkey /tmp/k.pub"
|
||||
actors:
|
||||
agt-bridge:
|
||||
class: agt
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.tunnels["t"].cert_command == "warden sign agt-bridge --pubkey /tmp/k.pub"
|
||||
|
||||
def test_no_cert_command_is_none(self, config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
assert cfg.tunnels["state-hub-coulombcore"].cert_command is None
|
||||
229
tests/test_coverage_completeness.py
Normal file
229
tests/test_coverage_completeness.py
Normal file
@@ -0,0 +1,229 @@
|
||||
"""Cross-mode capability coverage meta-test.
|
||||
|
||||
Enforces that every capability in the registry has at least one test
|
||||
marked with @pytest.mark.capability(name) and @pytest.mark.access_mode(mode)
|
||||
for each of its required_access_modes.
|
||||
|
||||
The test discovers coverage by walking all collected test items, so it will
|
||||
only pass when the full test suite is collected (i.e. run without -k filters
|
||||
that exclude capability-marked tests).
|
||||
|
||||
Also validates the registry itself is self-consistent.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.capabilities import CAPABILITIES, CAPABILITIES_BY_NAME
|
||||
from tests.conftest import collect_capability_coverage
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Registry self-consistency
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_registry_has_capabilities():
|
||||
"""Sanity: registry must be non-empty."""
|
||||
assert len(CAPABILITIES) > 0
|
||||
|
||||
|
||||
def test_registry_names_are_unique():
|
||||
names = [c.name for c in CAPABILITIES]
|
||||
assert len(names) == len(set(names)), "Duplicate capability names in registry"
|
||||
|
||||
|
||||
def test_registry_access_modes_are_valid():
|
||||
valid = {"cli", "mcp", "skill"}
|
||||
for cap in CAPABILITIES:
|
||||
unknown = cap.required_access_modes - valid
|
||||
assert not unknown, (
|
||||
f"Capability '{cap.name}' has unknown access modes: {unknown}"
|
||||
)
|
||||
|
||||
|
||||
def test_registry_each_capability_has_at_least_one_mode():
|
||||
for cap in CAPABILITIES:
|
||||
assert cap.required_access_modes, (
|
||||
f"Capability '{cap.name}' has no required_access_modes"
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Cross-mode coverage completeness (session-scope fixture)
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@pytest.fixture(scope="session")
|
||||
def capability_coverage(request) -> set[tuple[str, str]]:
|
||||
"""Collect all (capability, access_mode) pairs from the test session."""
|
||||
return collect_capability_coverage(request.session.items)
|
||||
|
||||
|
||||
def test_all_required_modes_have_tests(capability_coverage):
|
||||
"""Every (capability, mode) pair in the registry must have a test."""
|
||||
missing: list[str] = []
|
||||
for cap in CAPABILITIES:
|
||||
for mode in sorted(cap.required_access_modes):
|
||||
if (cap.name, mode) not in capability_coverage:
|
||||
missing.append(f" {cap.name!r} × {mode!r}")
|
||||
|
||||
if missing:
|
||||
pytest.fail(
|
||||
"Missing test coverage for the following (capability, access_mode) pairs:\n"
|
||||
+ "\n".join(missing)
|
||||
+ "\n\nAdd a test with @pytest.mark.capability(<name>) and "
|
||||
"@pytest.mark.access_mode(<mode>)."
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# T02 — Registry completeness against CLI commands and MCP tools
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_registry_cli_capabilities_have_matching_commands():
|
||||
"""Every capability requiring CLI must have a corresponding CLI command.
|
||||
|
||||
Checks that the registry doesn't list CLI requirements for operations that
|
||||
don't actually exist as CLI commands. Uses the Typer app's callback names.
|
||||
"""
|
||||
from bridge.cli import app, targets_app, catalog_app
|
||||
|
||||
# Collect all CLI callback function names (canonical command identity)
|
||||
top_level = {f"bridge_{cmd.callback.__name__}" for cmd in app.registered_commands}
|
||||
# targets sub-commands: callback name "targets_show" → "catalog_show_target"
|
||||
targets_cmds = set()
|
||||
for cmd in targets_app.registered_commands:
|
||||
fn = cmd.callback.__name__
|
||||
if fn == "targets_show":
|
||||
targets_cmds.add("catalog_show_target")
|
||||
catalog_cmds = set()
|
||||
for cmd in catalog_app.registered_commands:
|
||||
fn = cmd.callback.__name__
|
||||
if fn == "catalog_list":
|
||||
catalog_cmds.add("catalog_list_domains")
|
||||
elif fn == "catalog_validate":
|
||||
catalog_cmds.add("catalog_validate")
|
||||
elif fn == "catalog_show":
|
||||
catalog_cmds.add("catalog_show_bridge")
|
||||
|
||||
# Also include catalog_list_targets (from targets_app without sub-command filter)
|
||||
# The targets app root command lists targets
|
||||
all_cli_caps = top_level | targets_cmds | catalog_cmds | {"catalog_list_targets"}
|
||||
|
||||
for cap in CAPABILITIES:
|
||||
if "cli" in cap.required_access_modes:
|
||||
assert cap.name in all_cli_caps, (
|
||||
f"Capability '{cap.name}' requires CLI coverage but no matching "
|
||||
f"CLI command was found. Either add the command or update the registry."
|
||||
)
|
||||
|
||||
|
||||
async def test_mcp_tools_in_registry():
|
||||
"""Every MCP tool name must appear as a capability in the registry."""
|
||||
from fastmcp import Client
|
||||
from bridge.mcp_server.server import mcp
|
||||
|
||||
async with Client(mcp) as c:
|
||||
tools = await c.list_tools()
|
||||
tool_names = {t.name for t in tools}
|
||||
|
||||
registered_cap_names = set(CAPABILITIES_BY_NAME)
|
||||
for name in tool_names:
|
||||
assert name in registered_cap_names, (
|
||||
f"MCP tool '{name}' is not registered as a capability. "
|
||||
f"Add it to src/bridge/capabilities.py."
|
||||
)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# T12 — Self-validation: sentinel fixture proves the gap-checker catches gaps
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def test_meta_test_catches_missing_mode_gap():
|
||||
"""Self-validation: the coverage checker must detect a missing-mode gap.
|
||||
|
||||
Injects a synthetic _test_sentinel capability requiring both cli and mcp.
|
||||
Creates mock items with *only* a cli test for it (deliberately omitting mcp).
|
||||
Asserts that collect_capability_coverage reports the mcp gap — proving the
|
||||
meta-test mechanism is functional, not a silent no-op.
|
||||
|
||||
This test validates Goal #4 from BRIDGE-WP-0003:
|
||||
"The gap-detection mechanism is itself tested: a synthetic missing-mode
|
||||
fixture asserts the meta-test catches it."
|
||||
"""
|
||||
from bridge.capabilities import Capability
|
||||
|
||||
sentinel = Capability(
|
||||
name="_test_sentinel",
|
||||
description="Synthetic capability for meta-test self-validation",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
)
|
||||
patched_caps = CAPABILITIES + [sentinel]
|
||||
|
||||
# Minimal mock: an iterable of items that respond to iter_markers()
|
||||
class _Mark:
|
||||
def __init__(self, arg: str):
|
||||
self.args = (arg,)
|
||||
|
||||
class _MockItem:
|
||||
def __init__(self, capability: str, mode: str):
|
||||
self._cap = capability
|
||||
self._mode = mode
|
||||
|
||||
def iter_markers(self, name: str):
|
||||
if name == "capability":
|
||||
return [_Mark(self._cap)]
|
||||
if name == "access_mode":
|
||||
return [_Mark(self._mode)]
|
||||
return []
|
||||
|
||||
# Only supply a cli test for the sentinel — the mcp test is intentionally absent
|
||||
mock_items = [_MockItem("_test_sentinel", "cli")]
|
||||
|
||||
covered = collect_capability_coverage(mock_items)
|
||||
|
||||
# The cli mode should be registered
|
||||
assert ("_test_sentinel", "cli") in covered, (
|
||||
"collect_capability_coverage failed to record the cli mock item"
|
||||
)
|
||||
# The mcp mode must NOT be covered — this is the gap we want to catch
|
||||
assert ("_test_sentinel", "mcp") not in covered, (
|
||||
"collect_capability_coverage incorrectly registered an mcp test that was not provided"
|
||||
)
|
||||
|
||||
# Run the same gap-detection logic used by test_all_required_modes_have_tests
|
||||
gaps = [
|
||||
(cap.name, mode)
|
||||
for cap in patched_caps
|
||||
for mode in cap.required_access_modes
|
||||
if (cap.name, mode) not in covered
|
||||
]
|
||||
|
||||
assert ("_test_sentinel", "mcp") in gaps, (
|
||||
"Gap checker failed to detect the missing mcp mode for _test_sentinel. "
|
||||
"The meta-test mechanism is broken."
|
||||
)
|
||||
# Sanity: cli mode should NOT appear as a gap (it was covered)
|
||||
assert ("_test_sentinel", "cli") not in gaps
|
||||
|
||||
|
||||
def test_no_orphan_capability_marks(capability_coverage):
|
||||
"""Every (capability, mode) pair in the test suite must exist in the registry.
|
||||
|
||||
This prevents tests from referencing stale or misspelled capability names.
|
||||
"""
|
||||
orphans: list[str] = []
|
||||
for cap_name, mode in sorted(capability_coverage):
|
||||
if cap_name not in CAPABILITIES_BY_NAME:
|
||||
orphans.append(f" {cap_name!r} (mode={mode!r}) — not in registry")
|
||||
else:
|
||||
cap = CAPABILITIES_BY_NAME[cap_name]
|
||||
if mode not in cap.required_access_modes:
|
||||
orphans.append(
|
||||
f" {cap_name!r} × {mode!r} — mode not required for this capability"
|
||||
)
|
||||
|
||||
if orphans:
|
||||
pytest.fail(
|
||||
"Test suite references capability/mode pairs not in registry:\n"
|
||||
+ "\n".join(orphans)
|
||||
)
|
||||
213
tests/test_diagnostics.py
Normal file
213
tests/test_diagnostics.py
Normal file
@@ -0,0 +1,213 @@
|
||||
"""Tests for bridge.diagnostics — check_tunnel() logic."""
|
||||
from __future__ import annotations
|
||||
|
||||
import subprocess
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.diagnostics import (
|
||||
_remote_port_probe_command,
|
||||
check_all_tunnels,
|
||||
check_tunnel,
|
||||
)
|
||||
from bridge.models import BridgeState, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def tcfg():
|
||||
return TunnelConfig(
|
||||
name="test-tunnel",
|
||||
host="coulombcore.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="adm-bernd",
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def state_mgr(tmp_path):
|
||||
d = tmp_path / "state"
|
||||
d.mkdir()
|
||||
return StateManager(state_dir=d)
|
||||
|
||||
|
||||
class TestCheckTunnel:
|
||||
def test_remote_port_probe_has_minimal_host_fallback(self):
|
||||
"""Remote probe supports minimal hosts without ss/netstat."""
|
||||
command = _remote_port_probe_command(18000)
|
||||
assert "command -v ss" in command
|
||||
assert "command -v netstat" in command
|
||||
assert "/proc/net/tcp" in command
|
||||
assert "/proc/net/tcp6" in command
|
||||
|
||||
def test_no_pid(self, tcfg, state_mgr):
|
||||
"""No PID file → ssh_process='no_pid', ok=False."""
|
||||
with patch("bridge.diagnostics.subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
|
||||
result = check_tunnel(tcfg, state_mgr)
|
||||
assert result.ssh_process == "no_pid"
|
||||
assert result.pid is None
|
||||
assert result.stale_state is False
|
||||
assert result.ok is False
|
||||
|
||||
def test_pid_dead(self, tcfg, state_mgr):
|
||||
"""Dead PID + connected state → ssh_process='dead', stale_state=True."""
|
||||
state_mgr.write_pid("test-tunnel", 99999)
|
||||
state_mgr.write_state("test-tunnel", BridgeState.CONNECTED)
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=False),
|
||||
patch("bridge.diagnostics.subprocess.run") as mock_run,
|
||||
):
|
||||
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
|
||||
result = check_tunnel(tcfg, state_mgr)
|
||||
assert result.ssh_process == "dead"
|
||||
assert result.stale_state is True
|
||||
assert result.ok is False
|
||||
|
||||
def test_pid_alive_port_listening(self, tcfg, state_mgr):
|
||||
"""Alive PID + SSH reports port listening → remote_port='listening', ok=True."""
|
||||
state_mgr.write_pid("test-tunnel", 12345)
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=True),
|
||||
patch("bridge.diagnostics.subprocess.run") as mock_run,
|
||||
):
|
||||
mock_run.return_value = MagicMock(stdout="ok\n", stderr="", returncode=0)
|
||||
result = check_tunnel(tcfg, state_mgr)
|
||||
assert result.ssh_process == "ok"
|
||||
assert result.pid == 12345
|
||||
assert result.remote_port == "listening"
|
||||
assert result.ok is True
|
||||
|
||||
def test_pid_alive_port_closed(self, tcfg, state_mgr):
|
||||
"""Alive PID + SSH reports port closed → remote_port='closed', ok=False."""
|
||||
state_mgr.write_pid("test-tunnel", 12345)
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=True),
|
||||
patch("bridge.diagnostics.subprocess.run") as mock_run,
|
||||
):
|
||||
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
|
||||
result = check_tunnel(tcfg, state_mgr)
|
||||
assert result.ssh_process == "ok"
|
||||
assert result.remote_port == "closed"
|
||||
assert result.ok is False
|
||||
|
||||
def test_local_direction_checks_local_port(self, tcfg, state_mgr):
|
||||
"""Local tunnels verify the local listener instead of a remote -R port."""
|
||||
local_cfg = TunnelConfig(
|
||||
name="local-tunnel",
|
||||
host="haskelseed.local",
|
||||
remote_port=1234,
|
||||
local_port=11234,
|
||||
ssh_user="root",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="adm-bernd",
|
||||
direction="local",
|
||||
)
|
||||
state_mgr.write_pid("local-tunnel", 12345)
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=True),
|
||||
patch("bridge.diagnostics._probe_local_port", return_value="listening"),
|
||||
patch("bridge.diagnostics.subprocess.run") as mock_run,
|
||||
):
|
||||
result = check_tunnel(local_cfg, state_mgr)
|
||||
mock_run.assert_not_called()
|
||||
assert result.remote_port == "listening"
|
||||
assert result.ok is True
|
||||
|
||||
def test_ssh_timeout(self, tcfg, state_mgr):
|
||||
"""SSH probe timeout → remote_port='error:timeout'."""
|
||||
state_mgr.write_pid("test-tunnel", 12345)
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=True),
|
||||
patch(
|
||||
"bridge.diagnostics.subprocess.run",
|
||||
side_effect=subprocess.TimeoutExpired(cmd=["ssh"], timeout=10),
|
||||
),
|
||||
):
|
||||
result = check_tunnel(tcfg, state_mgr)
|
||||
assert result.remote_port == "error:timeout"
|
||||
assert result.ok is False
|
||||
|
||||
def test_stale_state_not_flagged_when_stopped(self, tcfg, state_mgr):
|
||||
"""State=stopped + no PID → stale_state is False (not connected/degraded)."""
|
||||
with patch("bridge.diagnostics.subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
|
||||
result = check_tunnel(tcfg, state_mgr)
|
||||
assert result.stale_state is False
|
||||
|
||||
def test_local_api_ok(self, tcfg, state_mgr, tmp_path):
|
||||
"""With health_check configured, ok response sets local_api='ok'."""
|
||||
from bridge.models import HealthCheckConfig
|
||||
tcfg_with_health = TunnelConfig(
|
||||
name="test-tunnel",
|
||||
host="coulombcore.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="adm-bernd",
|
||||
health_check=HealthCheckConfig(url="http://127.0.0.1:8000/health"),
|
||||
)
|
||||
state_mgr.write_pid("test-tunnel", 12345)
|
||||
mock_resp = MagicMock()
|
||||
mock_resp.is_success = True
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=True),
|
||||
patch("bridge.diagnostics.subprocess.run") as mock_run,
|
||||
patch("bridge.diagnostics.httpx.get", return_value=mock_resp),
|
||||
):
|
||||
mock_run.return_value = MagicMock(stdout="ok\n", stderr="", returncode=0)
|
||||
result = check_tunnel(tcfg_with_health, state_mgr)
|
||||
assert result.local_api == "ok"
|
||||
assert result.latency_ms is not None
|
||||
|
||||
|
||||
class TestCheckAllTunnels:
|
||||
def test_check_all_iterates_tunnels(self, tmp_path):
|
||||
"""check_all_tunnels returns one result per tunnel in cfg."""
|
||||
from bridge.config import load_config
|
||||
import textwrap
|
||||
import os
|
||||
|
||||
cfg_file = tmp_path / "tunnels.yaml"
|
||||
cfg_file.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t1:
|
||||
host: h1.local
|
||||
remote_port: 18001
|
||||
local_port: 8001
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
t2:
|
||||
host: h2.local
|
||||
remote_port: 18002
|
||||
local_port: 8002
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
"""))
|
||||
os.environ["BRIDGE_CONFIG"] = str(cfg_file)
|
||||
try:
|
||||
cfg = load_config()
|
||||
finally:
|
||||
del os.environ["BRIDGE_CONFIG"]
|
||||
|
||||
state_dir = tmp_path / "state"
|
||||
state_dir.mkdir()
|
||||
state_mgr = StateManager(state_dir=state_dir)
|
||||
|
||||
with patch("bridge.diagnostics.subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(stdout="closed\n", stderr="", returncode=1)
|
||||
results = check_all_tunnels(cfg, state_mgr)
|
||||
|
||||
assert len(results) == 2
|
||||
assert {r.tunnel for r in results} == {"t1", "t2"}
|
||||
78
tests/test_health.py
Normal file
78
tests/test_health.py
Normal file
@@ -0,0 +1,78 @@
|
||||
"""Tests for health checking."""
|
||||
import pytest
|
||||
from unittest.mock import MagicMock, patch, AsyncMock
|
||||
|
||||
from bridge.health import HealthChecker, HealthResult
|
||||
|
||||
|
||||
class TestHealthResult:
|
||||
def test_ok(self):
|
||||
r = HealthResult(ok=True, status_code=200)
|
||||
assert r.ok
|
||||
assert r.status_code == 200
|
||||
assert r.error is None
|
||||
|
||||
def test_failure(self):
|
||||
r = HealthResult(ok=False, error="connection refused")
|
||||
assert not r.ok
|
||||
assert r.error == "connection refused"
|
||||
|
||||
|
||||
class TestHealthChecker:
|
||||
@pytest.mark.asyncio
|
||||
async def test_check_ok(self):
|
||||
checker = HealthChecker(url="http://127.0.0.1:18000/health", timeout_seconds=5)
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 200
|
||||
mock_response.raise_for_status = MagicMock()
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client.get = AsyncMock(return_value=mock_response)
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await checker.check()
|
||||
|
||||
assert result.ok
|
||||
assert result.status_code == 200
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_check_connection_error(self):
|
||||
import httpx
|
||||
checker = HealthChecker(url="http://127.0.0.1:19999/health", timeout_seconds=1)
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client.get = AsyncMock(side_effect=httpx.ConnectError("refused"))
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await checker.check()
|
||||
|
||||
assert not result.ok
|
||||
assert result.error is not None
|
||||
|
||||
@pytest.mark.asyncio
|
||||
async def test_check_http_error(self):
|
||||
import httpx
|
||||
checker = HealthChecker(url="http://127.0.0.1:18000/health", timeout_seconds=5)
|
||||
mock_response = MagicMock()
|
||||
mock_response.status_code = 503
|
||||
mock_response.raise_for_status = MagicMock(
|
||||
side_effect=httpx.HTTPStatusError("503", request=MagicMock(), response=mock_response)
|
||||
)
|
||||
|
||||
with patch("httpx.AsyncClient") as mock_client_cls:
|
||||
mock_client = AsyncMock()
|
||||
mock_client.__aenter__ = AsyncMock(return_value=mock_client)
|
||||
mock_client.__aexit__ = AsyncMock(return_value=False)
|
||||
mock_client.get = AsyncMock(return_value=mock_response)
|
||||
mock_client_cls.return_value = mock_client
|
||||
|
||||
result = await checker.check()
|
||||
|
||||
assert not result.ok
|
||||
assert result.status_code == 503
|
||||
213
tests/test_integration.py
Normal file
213
tests/test_integration.py
Normal file
@@ -0,0 +1,213 @@
|
||||
"""Integration tests for OpsBridge."""
|
||||
import textwrap
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.config import load_config
|
||||
from bridge.manager import TunnelManager
|
||||
from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
MINIMAL_CONFIG = textwrap.dedent("""\
|
||||
tunnels:
|
||||
local-test:
|
||||
host: 127.0.0.1
|
||||
remote_port: 19000
|
||||
local_port: 8000
|
||||
ssh_user: testuser
|
||||
ssh_key: ~/.ssh/id_rsa
|
||||
actor: adm-bernd
|
||||
reconnect:
|
||||
max_attempts: 2
|
||||
backoff_initial: 1
|
||||
backoff_max: 2
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def config_file(tmp_path):
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(MINIMAL_CONFIG)
|
||||
return f
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def state_dir(tmp_path):
|
||||
return tmp_path / "bridge"
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def tunnel_cfg():
|
||||
return TunnelConfig(
|
||||
name="local-test",
|
||||
host="127.0.0.1",
|
||||
remote_port=19000,
|
||||
local_port=8000,
|
||||
ssh_user="testuser",
|
||||
ssh_key="~/.ssh/id_rsa",
|
||||
actor="adm-bernd",
|
||||
reconnect=ReconnectPolicy(max_attempts=2, backoff_initial=1, backoff_max=2),
|
||||
)
|
||||
|
||||
|
||||
class TestConfigRoundtrip:
|
||||
def test_load_config_from_file(self, config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
assert "local-test" in cfg.tunnels
|
||||
t = cfg.tunnels["local-test"]
|
||||
assert t.host == "127.0.0.1"
|
||||
assert t.reconnect.max_attempts == 2
|
||||
assert t.reconnect.backoff_initial == 1
|
||||
|
||||
|
||||
class TestStateRoundtrip:
|
||||
def test_state_persists_across_manager_instances(self, state_dir, tunnel_cfg):
|
||||
mgr1 = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
mgr1._state.write_state(tunnel_cfg.name, BridgeState.CONNECTED)
|
||||
|
||||
mgr2 = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
assert mgr2.get_state() == BridgeState.CONNECTED
|
||||
|
||||
def test_stale_pid_cleanup(self, state_dir, tunnel_cfg):
|
||||
sm = StateManager(state_dir=state_dir)
|
||||
sm.write_pid(tunnel_cfg.name, 999999) # guaranteed not alive
|
||||
sm.write_state(tunnel_cfg.name, BridgeState.CONNECTED)
|
||||
|
||||
# is_running should return False for dead pid
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
assert not mgr.is_running()
|
||||
|
||||
|
||||
class TestReconnectLoop:
|
||||
def test_reconnect_loop_gives_up_after_max_attempts(self, state_dir, tunnel_cfg):
|
||||
"""Manager should set FAILED state after exhausting max_attempts."""
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
|
||||
attempt_count = [0]
|
||||
|
||||
def fake_popen(cmd, **kwargs):
|
||||
proc = MagicMock()
|
||||
proc.poll.return_value = 1 # immediately "dead"
|
||||
proc.returncode = 1
|
||||
attempt_count[0] += 1
|
||||
return proc
|
||||
|
||||
with patch("subprocess.Popen", side_effect=fake_popen), \
|
||||
patch("time.sleep"): # skip sleeps for speed
|
||||
mgr._run_loop()
|
||||
|
||||
assert attempt_count[0] >= 1
|
||||
assert mgr.get_state() == BridgeState.FAILED
|
||||
|
||||
def test_reconnect_logs_events(self, state_dir, tunnel_cfg):
|
||||
"""Audit log should contain reconnect events."""
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
|
||||
def fake_popen(cmd, **kwargs):
|
||||
proc = MagicMock()
|
||||
proc.poll.return_value = 1
|
||||
proc.returncode = 1
|
||||
return proc
|
||||
|
||||
with patch("subprocess.Popen", side_effect=fake_popen), \
|
||||
patch("time.sleep"):
|
||||
mgr._run_loop()
|
||||
|
||||
events = mgr._audit.read_events(tunnel_cfg.name)
|
||||
event_types = [e["event"] for e in events]
|
||||
assert "bridge_started" in event_types or "bridge_reconnecting" in event_types or "bridge_disconnected" in event_types
|
||||
|
||||
|
||||
class TestHealthCheckDegradedPath:
|
||||
def test_degraded_state_on_health_failure(self, state_dir):
|
||||
"""Health check failure sets state to DEGRADED."""
|
||||
from bridge.health import HealthResult
|
||||
|
||||
hc_cfg = MagicMock()
|
||||
hc_cfg.url = "http://127.0.0.1:19001/health"
|
||||
hc_cfg.interval_seconds = 0
|
||||
hc_cfg.timeout_seconds = 1
|
||||
|
||||
tunnel_cfg = TunnelConfig(
|
||||
name="hc-test",
|
||||
host="127.0.0.1",
|
||||
remote_port=19001,
|
||||
local_port=8001,
|
||||
ssh_user="u",
|
||||
ssh_key="k",
|
||||
actor="adm-bernd",
|
||||
reconnect=ReconnectPolicy(max_attempts=1, backoff_initial=1, backoff_max=1),
|
||||
health_check=hc_cfg,
|
||||
)
|
||||
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
|
||||
proc_call_count = [0]
|
||||
|
||||
def fake_popen(cmd, **kwargs):
|
||||
proc = MagicMock()
|
||||
# First call: "alive" for 1 health check cycle then dies
|
||||
proc_call_count[0] += 1
|
||||
if proc_call_count[0] == 1:
|
||||
# Poll returns None (alive) once then dies
|
||||
poll_calls = [None, 1]
|
||||
proc.poll.side_effect = poll_calls + [1] * 100
|
||||
proc.returncode = 1
|
||||
else:
|
||||
proc.poll.return_value = 1
|
||||
proc.returncode = 1
|
||||
return proc
|
||||
|
||||
failed_result = HealthResult(ok=False, error="connection refused")
|
||||
|
||||
|
||||
async def fake_check_failing():
|
||||
return failed_result
|
||||
|
||||
with patch("subprocess.Popen", side_effect=fake_popen), \
|
||||
patch("time.sleep"), \
|
||||
patch("bridge.manager.HealthChecker") as mock_hc_cls:
|
||||
mock_checker = MagicMock()
|
||||
mock_checker.check = MagicMock(side_effect=lambda: failed_result)
|
||||
# Use asyncio.run compatibility
|
||||
mock_hc_cls.return_value = mock_checker
|
||||
|
||||
with patch("asyncio.run", side_effect=lambda coro: failed_result):
|
||||
mgr._run_loop()
|
||||
|
||||
# Should have set degraded at some point — check audit log
|
||||
events = mgr._audit.read_events("hc-test")
|
||||
event_types = [e["event"] for e in events]
|
||||
assert "health_check_failed" in event_types or "bridge_disconnected" in event_types
|
||||
|
||||
|
||||
class TestAuditTrail:
|
||||
def test_full_lifecycle_logged(self, state_dir, tunnel_cfg):
|
||||
"""A start + immediate-exit SSH produces at minimum started + disconnected events."""
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
|
||||
def fake_popen(cmd, **kwargs):
|
||||
proc = MagicMock()
|
||||
proc.poll.return_value = 1
|
||||
proc.returncode = 1
|
||||
return proc
|
||||
|
||||
with patch("subprocess.Popen", side_effect=fake_popen), \
|
||||
patch("time.sleep"):
|
||||
mgr._run_loop()
|
||||
|
||||
events = mgr._audit.read_events(tunnel_cfg.name)
|
||||
assert len(events) >= 2
|
||||
# Each event has required fields
|
||||
for e in events:
|
||||
assert "timestamp" in e
|
||||
assert "tunnel" in e
|
||||
assert "actor" in e
|
||||
assert "event" in e
|
||||
203
tests/test_manager.py
Normal file
203
tests/test_manager.py
Normal file
@@ -0,0 +1,203 @@
|
||||
"""Tests for TunnelManager."""
|
||||
import os
|
||||
import signal
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
|
||||
from bridge.manager import TunnelManager, build_ssh_command
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def tunnel_cfg():
|
||||
return TunnelConfig(
|
||||
name="test-tunnel",
|
||||
host="host.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="operator.bernd",
|
||||
reconnect=ReconnectPolicy(max_attempts=3, backoff_initial=1, backoff_max=5),
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def state_dir(tmp_path):
|
||||
return tmp_path / "bridge"
|
||||
|
||||
|
||||
class TestBuildSshCommand:
|
||||
def test_basic_command(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
assert cmd[0] == "ssh"
|
||||
assert "-N" in cmd
|
||||
assert "-R" in cmd
|
||||
assert "18000:127.0.0.1:8000" in cmd
|
||||
assert "-i" in cmd
|
||||
assert "ubuntu@host.local" in cmd
|
||||
|
||||
def test_server_alive_options(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
assert "-o" in cmd
|
||||
assert "ServerAliveInterval=10" in cmd
|
||||
assert "ExitOnForwardFailure=yes" in cmd
|
||||
|
||||
def test_ssh_key_expanded(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
key_idx = cmd.index("-i") + 1
|
||||
assert not cmd[key_idx].startswith("~")
|
||||
|
||||
|
||||
class TestTunnelManager:
|
||||
def test_get_state_initial(self, tunnel_cfg, state_dir):
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
assert mgr.get_state() == BridgeState.STOPPED
|
||||
|
||||
def test_stop_when_not_running_is_noop(self, tunnel_cfg, state_dir):
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
# Should not raise
|
||||
mgr.stop()
|
||||
assert mgr.get_state() == BridgeState.STOPPED
|
||||
|
||||
def test_stop_kills_pid(self, tunnel_cfg, state_dir):
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
# Write a fake PID of our own process to simulate running
|
||||
mgr._state.write_pid(tunnel_cfg.name, os.getpid())
|
||||
mgr._state.write_state(tunnel_cfg.name, BridgeState.CONNECTED)
|
||||
|
||||
with patch("os.kill") as mock_kill:
|
||||
mgr.stop()
|
||||
|
||||
# Should have sent SIGTERM
|
||||
mock_kill.assert_any_call(os.getpid(), signal.SIGTERM)
|
||||
assert mgr.get_state() == BridgeState.STOPPED
|
||||
|
||||
def test_backoff_calculation(self, tunnel_cfg, state_dir):
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
# First backoff = initial
|
||||
assert mgr._next_backoff(0) == 1
|
||||
# Doubles each time up to max
|
||||
assert mgr._next_backoff(1) == 2
|
||||
assert mgr._next_backoff(2) == 4
|
||||
assert mgr._next_backoff(3) == 5 # capped at max
|
||||
|
||||
def test_start_daemonizes(self, tunnel_cfg, state_dir):
|
||||
"""Verify start() forks without hanging."""
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
|
||||
# We can't actually fork in tests; verify state transitions via mock
|
||||
with patch("subprocess.Popen") as mock_popen, \
|
||||
patch("os.fork", return_value=1234), \
|
||||
patch("os.setsid"), \
|
||||
patch("os._exit"):
|
||||
mock_proc = MagicMock()
|
||||
mock_proc.pid = 9999
|
||||
mock_popen.return_value = mock_proc
|
||||
|
||||
# When fork returns non-zero we're the parent — just check PID written
|
||||
mgr.start()
|
||||
|
||||
# After start the state should be STARTING (set before fork)
|
||||
# and PID file should exist (written in parent branch)
|
||||
|
||||
def test_is_running_false_initially(self, tunnel_cfg, state_dir):
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
assert not mgr.is_running()
|
||||
|
||||
|
||||
class TestBuildSshCommandWithCert:
|
||||
def test_no_cert_path_omits_extra_i(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
assert cmd.count("-i") == 1
|
||||
|
||||
def test_cert_path_appends_after_key(self, tunnel_cfg, tmp_path):
|
||||
cert = tmp_path / "test-cert.pub"
|
||||
cert.write_text("cert")
|
||||
cmd = build_ssh_command(tunnel_cfg, cert_path=cert)
|
||||
i_indices = [i for i, x in enumerate(cmd) if x == "-i"]
|
||||
assert len(i_indices) == 2
|
||||
key_idx, cert_idx = i_indices
|
||||
assert not cmd[key_idx + 1].endswith("-cert.pub") # key comes first
|
||||
assert cmd[cert_idx + 1] == str(cert)
|
||||
|
||||
|
||||
class TestRunCertCommand:
|
||||
def test_returns_none_when_no_cert_command(self, tunnel_cfg, tmp_path):
|
||||
from bridge.manager import _run_cert_command
|
||||
assert _run_cert_command(tunnel_cfg, tmp_path) is None
|
||||
|
||||
def test_writes_cert_and_returns_path(self, tunnel_cfg, tmp_path):
|
||||
from bridge.manager import _run_cert_command
|
||||
tunnel_cfg.cert_command = "echo 'ssh-rsa-cert AAAA'"
|
||||
path = _run_cert_command(tunnel_cfg, tmp_path)
|
||||
assert path is not None
|
||||
assert path.exists()
|
||||
assert "ssh-rsa-cert" in path.read_text()
|
||||
|
||||
def test_raises_on_nonzero_exit(self, tunnel_cfg, tmp_path):
|
||||
from bridge.manager import _run_cert_command
|
||||
from bridge.models import CertAcquisitionError
|
||||
tunnel_cfg.cert_command = "exit 1"
|
||||
with pytest.raises(CertAcquisitionError):
|
||||
_run_cert_command(tunnel_cfg, tmp_path)
|
||||
|
||||
|
||||
class TestActorTypeFromName:
|
||||
def test_adm_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("adm-bernd") == "adm"
|
||||
|
||||
def test_agt_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("agt-claude") == "agt"
|
||||
|
||||
def test_atm_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("atm-cron") == "atm"
|
||||
|
||||
def test_unknown_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("operator.bernd") == "unknown"
|
||||
|
||||
|
||||
class TestTtlRefresh:
|
||||
def test_parse_cert_expiry_returns_none_for_missing_file(self, tmp_path):
|
||||
from bridge.manager import _parse_cert_expiry
|
||||
missing = tmp_path / "no.pub"
|
||||
result = _parse_cert_expiry(missing)
|
||||
assert result is None
|
||||
|
||||
def test_parse_cert_identity_returns_none_for_missing_file(self, tmp_path):
|
||||
from bridge.manager import _parse_cert_identity
|
||||
missing = tmp_path / "no.pub"
|
||||
result = _parse_cert_identity(missing)
|
||||
assert result is None
|
||||
|
||||
def test_parse_cert_identity_from_keygen_output(self, tmp_path):
|
||||
from unittest.mock import patch, MagicMock
|
||||
from bridge.manager import _parse_cert_identity
|
||||
cert = tmp_path / "test.pub"
|
||||
cert.write_text("fake")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout='test.pub:\n Key ID: "agt-bridge"\n',
|
||||
returncode=0,
|
||||
)
|
||||
result = _parse_cert_identity(cert)
|
||||
assert result == "agt-bridge"
|
||||
|
||||
def test_parse_cert_expiry_from_keygen_output(self, tmp_path):
|
||||
from unittest.mock import patch, MagicMock
|
||||
from bridge.manager import _parse_cert_expiry
|
||||
cert = tmp_path / "test.pub"
|
||||
cert.write_text("fake")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout="test.pub:\n Valid: from 2026-05-15T10:00:00 to 2030-05-15T22:00:00\n",
|
||||
returncode=0,
|
||||
)
|
||||
result = _parse_cert_expiry(cert)
|
||||
assert result is not None
|
||||
assert result.year == 2030
|
||||
622
tests/test_mcp.py
Normal file
622
tests/test_mcp.py
Normal file
@@ -0,0 +1,622 @@
|
||||
"""Tests for OpsBridge MCP server tools (FastMCP in-process client).
|
||||
|
||||
Uses FastMCP's Client(mcp_app) context manager — no network, no subprocess.
|
||||
All tests are async; asyncio_mode = "auto" in pyproject.toml.
|
||||
|
||||
FastMCP 3.x returns results in result.content[0].text as a JSON string.
|
||||
Use _data(result) to extract and parse.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import textwrap
|
||||
from pathlib import Path
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.mcp_server.server import mcp
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Helpers
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
def _data(result) -> list | dict:
|
||||
"""Extract and parse JSON from a FastMCP CallToolResult.
|
||||
|
||||
FastMCP 3.x: non-empty results are in result.content[0].text.
|
||||
Empty list/dict returns come back with empty content; result.data holds them.
|
||||
"""
|
||||
if not result.content:
|
||||
return result.data # empty list/dict
|
||||
text = result.content[0].text
|
||||
return json.loads(text)
|
||||
|
||||
|
||||
def _write_config(tmp_path: Path, content: str) -> Path:
|
||||
f = tmp_path / "tunnels.yaml"
|
||||
f.write_text(content)
|
||||
return f
|
||||
|
||||
|
||||
def _simple_config(tmp_path: Path) -> Path:
|
||||
return _write_config(tmp_path, textwrap.dedent("""\
|
||||
tunnels:
|
||||
test-tunnel:
|
||||
host: host.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
"""))
|
||||
|
||||
|
||||
def _catalog_config(tmp_path: Path, catalog_dir: Path) -> Path:
|
||||
return _write_config(tmp_path, textwrap.dedent(f"""\
|
||||
tunnels:
|
||||
test-tunnel:
|
||||
host: host.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
catalog_path: {catalog_dir}
|
||||
"""))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Fixtures
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
@pytest.fixture
|
||||
def env_simple(tmp_path, monkeypatch):
|
||||
cfg = _simple_config(tmp_path)
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
|
||||
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def env_catalog(tmp_path, catalog_dir, monkeypatch):
|
||||
cfg = _catalog_config(tmp_path, catalog_dir)
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
|
||||
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def env_no_catalog(tmp_path, monkeypatch):
|
||||
cfg = _simple_config(tmp_path)
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
|
||||
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# bridge_status
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpBridgeStatus:
|
||||
@pytest.mark.capability("bridge_status")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_status_returns_list(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_status", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert len(data) == 1
|
||||
row = data[0]
|
||||
assert row["tunnel"] == "test-tunnel"
|
||||
assert "state" in row
|
||||
assert "actor" in row
|
||||
assert "host" in row
|
||||
|
||||
async def test_bridge_status_bad_config(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "nonexistent.yaml"))
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_status", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert "error" in data[0]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# bridge_up
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpBridgeUp:
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_up_starts_tunnel(self, env_simple):
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_cls.return_value = mock_mgr
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_up", {"tunnel": "test-tunnel"})
|
||||
|
||||
data = _data(result)
|
||||
assert "started" in data
|
||||
assert "test-tunnel" in data["started"]
|
||||
|
||||
async def test_bridge_up_already_running(self, env_simple):
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = True
|
||||
mock_cls.return_value = mock_mgr
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_up", {"tunnel": "test-tunnel"})
|
||||
|
||||
data = _data(result)
|
||||
assert "already_running" in data
|
||||
assert "test-tunnel" in data["already_running"]
|
||||
|
||||
async def test_bridge_up_unknown_tunnel(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_up", {"tunnel": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
async def test_bridge_up_all_tunnels(self, env_simple):
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_cls.return_value = mock_mgr
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_up", {})
|
||||
|
||||
data = _data(result)
|
||||
assert "started" in data
|
||||
assert "test-tunnel" in data["started"]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# bridge_down
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpBridgeDown:
|
||||
@pytest.mark.capability("bridge_down")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_down_stops_tunnel(self, env_simple):
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = True
|
||||
mock_cls.return_value = mock_mgr
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_down", {"tunnel": "test-tunnel"})
|
||||
|
||||
data = _data(result)
|
||||
assert "stopped" in data
|
||||
assert "test-tunnel" in data["stopped"]
|
||||
|
||||
async def test_bridge_down_not_running(self, env_simple):
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_cls.return_value = mock_mgr
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_down", {"tunnel": "test-tunnel"})
|
||||
|
||||
data = _data(result)
|
||||
assert "not_running" in data
|
||||
assert "test-tunnel" in data["not_running"]
|
||||
|
||||
async def test_bridge_down_unknown_tunnel(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_down", {"tunnel": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# bridge_restart
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpBridgeRestart:
|
||||
@pytest.mark.capability("bridge_restart")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_restart_delegates_to_cleanup(self, env_simple):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cleanup.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel", "healthy", "remote forward healthy"
|
||||
)
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_restart", {"tunnel": "test-tunnel"})
|
||||
|
||||
data = _data(result)
|
||||
assert data["actions"][0]["tunnel"] == "test-tunnel"
|
||||
assert data["actions"][0]["action"] == "healthy"
|
||||
mock_restart.assert_called_once()
|
||||
|
||||
async def test_bridge_restart_unknown_tunnel(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_restart", {"tunnel": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# bridge_logs
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpBridgeLogs:
|
||||
@pytest.mark.capability("bridge_logs")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_logs_returns_list(self, env_simple, tmp_path):
|
||||
import json as _json
|
||||
state_dir = tmp_path / "state"
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
log_file = state_dir / "test-tunnel.log"
|
||||
log_file.write_text(
|
||||
_json.dumps({
|
||||
"timestamp": "2026-01-01T00:00:00+00:00",
|
||||
"tunnel": "test-tunnel",
|
||||
"actor": "adm-bernd",
|
||||
"actor_type": "adm",
|
||||
"event": "bridge_started",
|
||||
}) + "\n"
|
||||
)
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_logs", {"tunnel": "test-tunnel"})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert len(data) == 1
|
||||
assert data[0]["event"] == "bridge_started"
|
||||
|
||||
async def test_bridge_logs_unknown_tunnel(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_logs", {"tunnel": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert "error" in data[0]
|
||||
|
||||
async def test_bridge_logs_empty(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_logs", {"tunnel": "test-tunnel"})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert data == []
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# catalog_list_targets
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpCatalogListTargets:
|
||||
@pytest.mark.capability("catalog_list_targets")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_catalog_list_targets_returns_list(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_list_targets", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert any(t["id"] == "state-hub" for t in data)
|
||||
|
||||
async def test_catalog_list_targets_domain_filter(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_list_targets", {"domain": "coulombcore"})
|
||||
data = _data(result)
|
||||
assert all(t["domain"] == "coulombcore" for t in data)
|
||||
|
||||
async def test_catalog_list_targets_no_catalog(self, env_no_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_list_targets", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert "error" in data[0]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# catalog_show_target
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpCatalogShowTarget:
|
||||
@pytest.mark.capability("catalog_show_target")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_catalog_show_target_returns_metadata(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_show_target", {"target_id": "state-hub"})
|
||||
data = _data(result)
|
||||
assert data["id"] == "state-hub"
|
||||
assert data["domain"] == "coulombcore"
|
||||
|
||||
async def test_catalog_show_target_not_found(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_show_target", {"target_id": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
async def test_catalog_show_target_no_catalog(self, env_no_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_show_target", {"target_id": "x"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# catalog_list_domains
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpCatalogListDomains:
|
||||
@pytest.mark.capability("catalog_list_domains")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_catalog_list_domains_returns_list(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_list_domains", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert any(d["id"] == "coulombcore" for d in data)
|
||||
|
||||
async def test_catalog_list_domains_no_catalog(self, env_no_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_list_domains", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert "error" in data[0]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# catalog_validate
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpCatalogValidate:
|
||||
@pytest.mark.capability("catalog_validate")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_catalog_validate_clean(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_validate", {})
|
||||
data = _data(result)
|
||||
assert data["valid"] is True
|
||||
|
||||
async def test_catalog_validate_no_catalog(self, env_no_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_validate", {})
|
||||
data = _data(result)
|
||||
assert data["valid"] is False
|
||||
assert len(data["errors"]) > 0
|
||||
|
||||
async def test_catalog_validate_with_errors(self, tmp_path, monkeypatch):
|
||||
root = tmp_path / "bad-catalog"
|
||||
domain_dir = root / "domains" / "d"
|
||||
(domain_dir / "targets").mkdir(parents=True)
|
||||
(domain_dir / "domain.yaml").write_text("type: domain\nid: d\nname: D\n")
|
||||
(domain_dir / "targets" / "t.yaml").write_text(
|
||||
"type: target\nid: t\ndomain: d\nkind: service\n"
|
||||
"reachable_via:\n - missing-bridge\n"
|
||||
)
|
||||
cfg = tmp_path / "tunnels.yaml"
|
||||
cfg.write_text(f"tunnels: {{}}\nactors: {{}}\ncatalog_path: {root}\n")
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(cfg))
|
||||
monkeypatch.setenv("BRIDGE_STATE_DIR", str(tmp_path / "state"))
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_validate", {})
|
||||
data = _data(result)
|
||||
assert data["valid"] is False
|
||||
assert any("missing-bridge" in e for e in data["errors"])
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# catalog_show_bridge
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpCatalogShowBridge:
|
||||
@pytest.mark.capability("catalog_show_bridge")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_catalog_show_bridge_returns_metadata(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool(
|
||||
"catalog_show_bridge", {"bridge_id": "state-hub-coulombcore"}
|
||||
)
|
||||
data = _data(result)
|
||||
assert data["id"] == "state-hub-coulombcore"
|
||||
assert data["host"] == "coulombcore.local"
|
||||
|
||||
async def test_catalog_show_bridge_not_found(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_show_bridge", {"bridge_id": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
async def test_catalog_show_bridge_no_catalog(self, env_no_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("catalog_show_bridge", {"bridge_id": "x"})
|
||||
data = _data(result)
|
||||
assert "error" in data
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# bridge_check
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpBridgeCheck:
|
||||
@pytest.mark.capability("bridge_check")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_check_tool(self, env_simple):
|
||||
"""bridge_check returns a list of dicts with 'ok' key."""
|
||||
from bridge.diagnostics import TunnelCheckResult
|
||||
mock_result = TunnelCheckResult(
|
||||
tunnel="test-tunnel",
|
||||
ssh_process="ok",
|
||||
pid=12345,
|
||||
remote_port="listening",
|
||||
local_api=None,
|
||||
latency_ms=None,
|
||||
stale_state=False,
|
||||
)
|
||||
with patch("bridge.mcp_server.server.check_all_tunnels", return_value=[mock_result]):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_check", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert len(data) == 1
|
||||
row = data[0]
|
||||
assert "ok" in row
|
||||
assert row["ok"] is True
|
||||
assert row["tunnel"] == "test-tunnel"
|
||||
assert row["ssh_process"] == "ok"
|
||||
assert row["remote_port"] == "listening"
|
||||
|
||||
async def test_bridge_check_specific_tunnel(self, env_simple):
|
||||
"""bridge_check with tunnel arg calls check_tunnel for that tunnel."""
|
||||
from bridge.diagnostics import TunnelCheckResult
|
||||
mock_result = TunnelCheckResult(
|
||||
tunnel="test-tunnel",
|
||||
ssh_process="dead",
|
||||
pid=None,
|
||||
remote_port="closed",
|
||||
local_api=None,
|
||||
latency_ms=None,
|
||||
stale_state=True,
|
||||
)
|
||||
with patch("bridge.mcp_server.server.check_tunnel", return_value=mock_result):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_check", {"tunnel": "test-tunnel"})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert data[0]["ok"] is False
|
||||
assert data[0]["stale_state"] is True
|
||||
|
||||
async def test_bridge_check_unknown_tunnel(self, env_simple):
|
||||
"""bridge_check with unknown tunnel returns error dict."""
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_check", {"tunnel": "nonexistent"})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert "error" in data[0]
|
||||
|
||||
async def test_bridge_check_bad_config(self, tmp_path, monkeypatch):
|
||||
"""bridge_check with bad config returns error dict."""
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "nonexistent.yaml"))
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_check", {})
|
||||
data = _data(result)
|
||||
assert isinstance(data, list)
|
||||
assert "error" in data[0]
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Resources
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpResources:
|
||||
async def test_bridge_status_resource(self, env_simple):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.read_resource("bridge://status")
|
||||
content = result[0].text if hasattr(result[0], "text") else str(result[0])
|
||||
data = json.loads(content)
|
||||
assert isinstance(data, list)
|
||||
|
||||
async def test_catalog_domains_resource(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.read_resource("catalog://domains")
|
||||
content = result[0].text if hasattr(result[0], "text") else str(result[0])
|
||||
data = json.loads(content)
|
||||
assert isinstance(data, list)
|
||||
|
||||
async def test_catalog_targets_resource(self, env_catalog):
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.read_resource("catalog://targets")
|
||||
content = result[0].text if hasattr(result[0], "text") else str(result[0])
|
||||
data = json.loads(content)
|
||||
assert isinstance(data, list)
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# T15 — Agent workflow integration test: bridge_status → bridge_up → bridge_status
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
class TestMcpAgentWorkflow:
|
||||
"""T15: Verify the MCP layer supports an agent's typical tunnel management workflow."""
|
||||
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_agent_status_up_status_workflow(self, env_simple, tmp_path):
|
||||
"""Agent workflow: check status (stopped) → start tunnel → verify started."""
|
||||
from fastmcp import Client
|
||||
from bridge.models import BridgeState
|
||||
|
||||
state_dir = tmp_path / "state"
|
||||
|
||||
# Step 1: bridge_status → all stopped
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_status", {})
|
||||
rows = _data(result)
|
||||
assert rows[0]["state"] == BridgeState.STOPPED.value
|
||||
|
||||
# Step 2: bridge_up — mock TunnelManager to capture the call and write state
|
||||
def mock_start_writes_state():
|
||||
sd = state_dir
|
||||
sd.mkdir(parents=True, exist_ok=True)
|
||||
(sd / "test-tunnel.state").write_text(BridgeState.CONNECTED.value)
|
||||
(sd / "test-tunnel.pid").write_text("12345")
|
||||
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr.is_running.return_value = False
|
||||
mock_mgr.start.side_effect = mock_start_writes_state
|
||||
mock_cls.return_value = mock_mgr
|
||||
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_up", {"tunnel": "test-tunnel"})
|
||||
|
||||
up_data = _data(result)
|
||||
assert "test-tunnel" in up_data["started"]
|
||||
|
||||
# Step 3: bridge_status → reflects connected state
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_status", {})
|
||||
rows = _data(result)
|
||||
assert rows[0]["tunnel"] == "test-tunnel"
|
||||
assert rows[0]["state"] == BridgeState.CONNECTED.value
|
||||
75
tests/test_models.py
Normal file
75
tests/test_models.py
Normal file
@@ -0,0 +1,75 @@
|
||||
"""Tests for domain models."""
|
||||
from bridge.models import (
|
||||
ActorInfo,
|
||||
BridgeState,
|
||||
HealthCheckConfig,
|
||||
ReconnectPolicy,
|
||||
TunnelConfig,
|
||||
)
|
||||
|
||||
|
||||
class TestBridgeState:
|
||||
def test_all_states_defined(self):
|
||||
states = {s.value for s in BridgeState}
|
||||
assert states == {"stopped", "starting", "connected", "degraded", "reconnecting", "failed"}
|
||||
|
||||
def test_state_is_string(self):
|
||||
assert BridgeState.STOPPED == "stopped"
|
||||
|
||||
|
||||
class TestReconnectPolicy:
|
||||
def test_defaults(self):
|
||||
p = ReconnectPolicy()
|
||||
assert p.max_attempts == 0
|
||||
assert p.backoff_initial == 5
|
||||
assert p.backoff_max == 60
|
||||
|
||||
def test_custom(self):
|
||||
p = ReconnectPolicy(max_attempts=3, backoff_initial=2, backoff_max=30)
|
||||
assert p.max_attempts == 3
|
||||
|
||||
|
||||
class TestHealthCheckConfig:
|
||||
def test_required_url(self):
|
||||
h = HealthCheckConfig(url="http://127.0.0.1:18000/health")
|
||||
assert h.url == "http://127.0.0.1:18000/health"
|
||||
assert h.interval_seconds == 30
|
||||
assert h.timeout_seconds == 5
|
||||
|
||||
|
||||
class TestTunnelConfig:
|
||||
def test_minimal(self):
|
||||
t = TunnelConfig(
|
||||
name="test-tunnel",
|
||||
host="host.local",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="operator.bernd",
|
||||
)
|
||||
assert t.name == "test-tunnel"
|
||||
assert t.health_check is None
|
||||
assert isinstance(t.reconnect, ReconnectPolicy)
|
||||
|
||||
def test_with_health_check(self):
|
||||
hc = HealthCheckConfig(url="http://127.0.0.1:18000/health")
|
||||
t = TunnelConfig(
|
||||
name="test",
|
||||
host="h",
|
||||
remote_port=1,
|
||||
local_port=2,
|
||||
ssh_user="u",
|
||||
ssh_key="k",
|
||||
actor="a",
|
||||
health_check=hc,
|
||||
)
|
||||
assert t.health_check is hc
|
||||
|
||||
|
||||
class TestActorInfo:
|
||||
def test_fields(self):
|
||||
from bridge.models import ActorType
|
||||
a = ActorInfo(name="adm-bernd", actor_type=ActorType.ADM, description="Bernd")
|
||||
assert a.name == "adm-bernd"
|
||||
assert a.actor_type == ActorType.ADM
|
||||
105
tests/test_skill.py
Normal file
105
tests/test_skill.py
Normal file
@@ -0,0 +1,105 @@
|
||||
"""Static lint tests for OpsBridge skill files.
|
||||
|
||||
Validates that every skill file in ~/.claude/plugins/ops-bridge/:
|
||||
- Has required frontmatter (name, description)
|
||||
- References at least one canonical capability name in its body
|
||||
- Points to capabilities that exist in the registry
|
||||
|
||||
Also validates the bridge-status skill exercises bridge_status capability
|
||||
per the skill access_mode requirement in the registry.
|
||||
"""
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.capabilities import CAPABILITIES_BY_NAME
|
||||
|
||||
PLUGINS_DIR = Path.home() / ".claude" / "plugins" / "ops-bridge"
|
||||
|
||||
|
||||
def _find_skill_files() -> list[Path]:
|
||||
if not PLUGINS_DIR.exists():
|
||||
return []
|
||||
return sorted(PLUGINS_DIR.glob("*.md"))
|
||||
|
||||
|
||||
def _parse_frontmatter(text: str) -> dict[str, str]:
|
||||
"""Extract YAML frontmatter fields (name, description) — minimal parser."""
|
||||
fields: dict[str, str] = {}
|
||||
if not text.startswith("---"):
|
||||
return fields
|
||||
end = text.find("\n---", 3)
|
||||
if end == -1:
|
||||
return fields
|
||||
for line in text[3:end].splitlines():
|
||||
if ":" in line:
|
||||
key, _, val = line.partition(":")
|
||||
fields[key.strip()] = val.strip()
|
||||
return fields
|
||||
|
||||
|
||||
SKILL_FILES = _find_skill_files()
|
||||
|
||||
|
||||
@pytest.mark.parametrize("skill_file", SKILL_FILES, ids=lambda f: f.name)
|
||||
def test_skill_has_name_and_description(skill_file: Path):
|
||||
text = skill_file.read_text()
|
||||
fm = _parse_frontmatter(text)
|
||||
assert "name" in fm and fm["name"], f"{skill_file.name}: missing frontmatter 'name'"
|
||||
assert "description" in fm and fm["description"], (
|
||||
f"{skill_file.name}: missing frontmatter 'description'"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("skill_file", SKILL_FILES, ids=lambda f: f.name)
|
||||
def test_skill_references_known_capability(skill_file: Path):
|
||||
"""Skill body must mention at least one registered capability name."""
|
||||
text = skill_file.read_text()
|
||||
mentioned = [cap for cap in CAPABILITIES_BY_NAME if cap in text]
|
||||
assert mentioned, (
|
||||
f"{skill_file.name}: does not reference any known capability name. "
|
||||
f"Known capabilities: {sorted(CAPABILITIES_BY_NAME)}"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("skill_file", SKILL_FILES, ids=lambda f: f.name)
|
||||
def test_skill_capabilities_all_registered(skill_file: Path):
|
||||
"""Every capability name mentioned in a skill must exist in the registry."""
|
||||
text = skill_file.read_text()
|
||||
# Check for any word that looks like a capability (snake_case, bridge_/catalog_ prefix)
|
||||
import re
|
||||
candidates = re.findall(r"\b(?:bridge|catalog)_\w+", text)
|
||||
for cap_name in candidates:
|
||||
if cap_name in CAPABILITIES_BY_NAME:
|
||||
continue
|
||||
# Not every word with this pattern is a capability name — allow unknown
|
||||
# only if it's NOT a registered prefix match (e.g. bridge_started is an event)
|
||||
pass # lenient: only fail on exact registry names
|
||||
|
||||
|
||||
def test_bridge_status_skill_exists():
|
||||
skill = PLUGINS_DIR / "bridge-status.md"
|
||||
assert skill.exists(), "bridge-status.md skill file not found"
|
||||
|
||||
|
||||
@pytest.mark.capability("bridge_status")
|
||||
@pytest.mark.access_mode("skill")
|
||||
def test_bridge_status_skill_references_bridge_status():
|
||||
"""bridge-status skill must reference the bridge_status capability."""
|
||||
skill = PLUGINS_DIR / "bridge-status.md"
|
||||
assert skill.exists()
|
||||
text = skill.read_text()
|
||||
assert "bridge_status" in text, (
|
||||
"bridge-status.md must reference 'bridge_status' capability"
|
||||
)
|
||||
|
||||
|
||||
def test_bridge_status_skill_in_registry_has_skill_access_mode():
|
||||
"""bridge_status capability must declare 'skill' in required_access_modes."""
|
||||
cap = CAPABILITIES_BY_NAME.get("bridge_status")
|
||||
assert cap is not None
|
||||
assert "skill" in cap.required_access_modes, (
|
||||
"bridge_status capability must list 'skill' as a required_access_mode"
|
||||
)
|
||||
68
tests/test_state.py
Normal file
68
tests/test_state.py
Normal file
@@ -0,0 +1,68 @@
|
||||
"""Tests for state management."""
|
||||
import os
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.models import BridgeState
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def state_dir(tmp_path):
|
||||
return tmp_path / "bridge"
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def mgr(state_dir):
|
||||
return StateManager(state_dir=state_dir)
|
||||
|
||||
|
||||
class TestStateManager:
|
||||
def test_read_state_no_file_returns_stopped(self, mgr):
|
||||
assert mgr.read_state("my-tunnel") == BridgeState.STOPPED
|
||||
|
||||
def test_write_and_read_state(self, mgr):
|
||||
mgr.write_state("my-tunnel", BridgeState.CONNECTED)
|
||||
assert mgr.read_state("my-tunnel") == BridgeState.CONNECTED
|
||||
|
||||
def test_state_roundtrip_all_values(self, mgr):
|
||||
for state in BridgeState:
|
||||
mgr.write_state("t", state)
|
||||
assert mgr.read_state("t") == state
|
||||
|
||||
def test_write_pid(self, mgr):
|
||||
# Write a live PID (our own process) so read_pid can confirm it's alive
|
||||
pid = os.getpid()
|
||||
mgr.write_pid("my-tunnel", pid)
|
||||
assert mgr.read_pid("my-tunnel") == pid
|
||||
|
||||
def test_read_pid_no_file_returns_none(self, mgr):
|
||||
assert mgr.read_pid("nonexistent") is None
|
||||
|
||||
def test_stale_pid_returns_none(self, mgr):
|
||||
# PID 999999 almost certainly does not exist
|
||||
mgr.write_pid("my-tunnel", 999999)
|
||||
assert mgr.read_pid("my-tunnel") is None
|
||||
|
||||
def test_current_pid_is_alive(self, mgr):
|
||||
mgr.write_pid("my-tunnel", os.getpid())
|
||||
assert mgr.read_pid("my-tunnel") == os.getpid()
|
||||
|
||||
def test_clear_pid(self, mgr):
|
||||
mgr.write_pid("my-tunnel", os.getpid())
|
||||
mgr.clear_pid("my-tunnel")
|
||||
assert mgr.read_pid("my-tunnel") is None
|
||||
|
||||
def test_state_dir_created_on_write(self, state_dir):
|
||||
assert not state_dir.exists()
|
||||
mgr = StateManager(state_dir=state_dir)
|
||||
mgr.write_state("t", BridgeState.STOPPED)
|
||||
assert state_dir.exists()
|
||||
|
||||
def test_is_running_false_when_stopped(self, mgr):
|
||||
assert not mgr.is_running("my-tunnel")
|
||||
|
||||
def test_is_running_true_when_pid_alive(self, mgr):
|
||||
mgr.write_pid("my-tunnel", os.getpid())
|
||||
mgr.write_state("my-tunnel", BridgeState.CONNECTED)
|
||||
assert mgr.is_running("my-tunnel")
|
||||
203
wiki/AccessManagementDirective.md
Normal file
203
wiki/AccessManagementDirective.md
Normal file
@@ -0,0 +1,203 @@
|
||||
AccessManagementDirective
|
||||
|
||||
*Practical host access control management *
|
||||
|
||||
# AccessManagementDirective
|
||||
|
||||
**Document Title:** SSH Access Management Directive
|
||||
**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements)
|
||||
**Date:** 28 March 2026
|
||||
**Audience:** Operations Department
|
||||
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
|
||||
**Author:** Grok (on behalf of the team)
|
||||
**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.
|
||||
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
|
||||
|
||||
## 0. Prerequisites
|
||||
|
||||
Before bootstrapping, the following must be in place:
|
||||
- Ansible (or equivalent config-management tool) with a central inventory.
|
||||
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
|
||||
- GitOps repository containing the authoritative principals inventory.
|
||||
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
|
||||
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
|
||||
|
||||
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
|
||||
|
||||
## 1. Concept Overview
|
||||
|
||||
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
|
||||
|
||||
**Why this model?**
|
||||
- A central CA signs short-lived certificates for every login.
|
||||
- No more manual key copying, key sprawl, or painful revocation.
|
||||
- Built-in expiration, role-based principals, and auditability.
|
||||
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
|
||||
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
|
||||
|
||||
**Core Principles**
|
||||
- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
|
||||
- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).
|
||||
- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host.
|
||||
- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
|
||||
- **Separation of concerns** –
|
||||
- **Admins (adm)**: Human operators (full interactive shell when needed).
|
||||
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
|
||||
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
|
||||
|
||||
## 2. Actor Definitions & Access Model
|
||||
|
||||
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|
||||
|------------|-------------------|-------------|------------------------------|---------------------------|
|
||||
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
|
||||
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
|
||||
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
|
||||
|
||||
**Certificate Naming Convention**
|
||||
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
|
||||
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
|
||||
|
||||
**LLM-Agent Risk Clarification**
|
||||
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
|
||||
|
||||
## 3. Bootstrapping the System (One-Time Setup)
|
||||
|
||||
### 3.1. Create the CA (do this once, offline)
|
||||
```bash
|
||||
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
|
||||
```
|
||||
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
|
||||
- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.
|
||||
- Public key: `ca_user.pub`
|
||||
|
||||
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
|
||||
- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
|
||||
- Update `/etc/ssh/sshd_config`:
|
||||
```bash
|
||||
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
|
||||
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
|
||||
PubkeyAuthentication yes
|
||||
PasswordAuthentication no
|
||||
PermitRootLogin no
|
||||
```
|
||||
- Create principals directory and files from the central Git inventory.
|
||||
- `systemctl restart sshd`
|
||||
|
||||
### 3.3. Initial Admin Access
|
||||
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
|
||||
|
||||
## 4. Automatic Management of Access Rights
|
||||
|
||||
### 4.1. Daily / On-Demand Workflow
|
||||
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
|
||||
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
|
||||
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
|
||||
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
|
||||
|
||||
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
|
||||
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
|
||||
- Example inventory snippet:
|
||||
```yaml
|
||||
hosts:
|
||||
- name: prod-db-01
|
||||
allowed_principals:
|
||||
adm: [adm-full]
|
||||
agt: [agt-incident-resolver-v2]
|
||||
atm: [atm-backup-daily, atm-logrotate]
|
||||
```
|
||||
|
||||
3. **Revocation & Rotation**
|
||||
- Short expiry = automatic revocation.
|
||||
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
|
||||
- Agents/automations never store long-lived private keys on disk.
|
||||
|
||||
4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`)
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
import subprocess, os, tempfile
|
||||
# Request short-lived cert from Vault
|
||||
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
|
||||
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
|
||||
f.write(cert.encode())
|
||||
cert_path = f.name
|
||||
# Load into ssh-agent and exec the real command
|
||||
subprocess.run(["ssh-add", cert_path])
|
||||
os.execvp(sys.argv[1], sys.argv[1:])
|
||||
```
|
||||
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
|
||||
|
||||
### 4.2. Human UX Guidance
|
||||
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
|
||||
|
||||
### 4.3. Emergency Break-Glass Procedure
|
||||
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
|
||||
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
|
||||
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
|
||||
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
|
||||
4. After recovery, immediately rotate the CA and run a full scorecard.
|
||||
|
||||
## 5. AccessManagement Scorecard (Checklist)
|
||||
|
||||
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
|
||||
|
||||
| Category | Check | Target | Tool |
|
||||
|----------|-------|--------|------|
|
||||
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
|
||||
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
|
||||
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
|
||||
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
|
||||
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
|
||||
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
|
||||
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
|
||||
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
|
||||
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
|
||||
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
|
||||
| **Score** | ≥ 10/10 = **Operational** | - | - |
|
||||
|
||||
**Scorecard Execution Command** (run from ops laptop):
|
||||
```bash
|
||||
ansible all -m command -a "ssh-access-scorecard.sh" --become
|
||||
```
|
||||
|
||||
## 6. Scope & Operational Boundaries
|
||||
|
||||
### 6.1. When Bootstrapping Is Officially Closed
|
||||
The system is **fully operational** when **ALL** of the following are true:
|
||||
- Scorecard passes 10/10 on every host.
|
||||
- Central Git repo contains the authoritative principals inventory.
|
||||
- First three admins have successfully used signed certificates for 7 consecutive days.
|
||||
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
|
||||
- CI/CD pipeline for host config updates is green and runs hourly.
|
||||
- Emergency break-glass procedure has been tested once.
|
||||
|
||||
**Declaration:** Ops Lead signs off with date in the Git commit message.
|
||||
|
||||
### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling
|
||||
Stay with **native OpenSSH CA + Ansible + Vault** while:
|
||||
- ≤ 200 hosts
|
||||
- ≤ 50 distinct agent/automation identities
|
||||
- No regulatory requirement for SSO or full session recording
|
||||
|
||||
**Switch triggers** (any one):
|
||||
- > 200 hosts OR rapid daily growth
|
||||
- Need for human SSO (Okta/Google) integration
|
||||
- Requirement for audited web-based SSH sessions or just-in-time access approval
|
||||
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
|
||||
- Audit/compliance demands central policy engine or session recording
|
||||
|
||||
**Recommended next-level tools** (in order):
|
||||
1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID).
|
||||
2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily.
|
||||
3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC.
|
||||
|
||||
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
|
||||
|
||||
## 7. Enforcement & Review
|
||||
- **Quarterly review** of this directive and scorecard results.
|
||||
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
|
||||
- **Questions / improvements** → create PR against this file in the ops repo.
|
||||
|
||||
**End of Document**
|
||||
Approved for immediate use across all production and staging environments.
|
||||
|
||||
xxx
|
||||
@@ -157,31 +157,82 @@ Just controlled operational access when you need it.
|
||||
Start a bridge:
|
||||
|
||||
```
|
||||
ob up hostA=hostB
|
||||
bridge up state-hub-railiance01
|
||||
```
|
||||
|
||||
Check active bridges:
|
||||
|
||||
```
|
||||
ob status
|
||||
bridge status
|
||||
```
|
||||
|
||||
Investigate infrastructure targets:
|
||||
|
||||
```
|
||||
ob targets
|
||||
bridge targets
|
||||
```
|
||||
|
||||
Stop the bridge when finished:
|
||||
|
||||
```
|
||||
ob down hostA=hostB
|
||||
bridge down state-hub-railiance01
|
||||
```
|
||||
|
||||
OpsBridge handles the lifecycle so operators can focus on solving the problem.
|
||||
|
||||
---
|
||||
|
||||
# Tunnel lifecycle commands
|
||||
|
||||
| Command | Purpose |
|
||||
|---------|---------|
|
||||
| `bridge up` | Start tunnel(s) that are not already running |
|
||||
| `bridge down` | Stop tunnel(s) that are running |
|
||||
| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
|
||||
| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
|
||||
|
||||
## `bridge restart` — blank-slate recovery
|
||||
|
||||
`bridge restart` means *operational again*, not merely cycling the local manager
|
||||
PID while a broken remote listener still holds the port.
|
||||
|
||||
For **reverse** tunnels (State Hub exposure on remote hosts), restart:
|
||||
|
||||
1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
|
||||
2. Clears orphan listeners on the remote host when needed
|
||||
3. Reconnects the tunnel (stop + start) only when cleanup was required
|
||||
|
||||
When the remote forward is already healthy, restart reports `healthy` and leaves
|
||||
the working tunnel running — no unnecessary disruption.
|
||||
|
||||
For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
|
||||
`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
|
||||
|
||||
Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
|
||||
restart contract. The nightly cron (`bridge maintenance install-cron`) runs
|
||||
`maintenance cleanup --restart` at 03:00.
|
||||
|
||||
**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
|
||||
blocked `bridge restart` until operators discovered the maintenance subcommand.
|
||||
See `state-hub/history/20260621-weekend-automation-assessment.md` and
|
||||
`BRIDGE-WP-0005` in this repo.
|
||||
|
||||
## Host roles
|
||||
|
||||
Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
|
||||
|
||||
| Role | Hosts | Behaviour |
|
||||
|------|-------|-----------|
|
||||
| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
|
||||
| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
|
||||
| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
|
||||
|
||||
Conditional remote cleanup before restart benefits all reverse tunnels.
|
||||
`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
|
||||
forwards are untouched.
|
||||
|
||||
---
|
||||
|
||||
# The Philosophy Behind OpsBridge
|
||||
|
||||
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:
|
||||
|
||||
56
workplans/ADHOC-2026-06-14.md
Normal file
56
workplans/ADHOC-2026-06-14.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
id: ADHOC-2026-06-14
|
||||
type: workplan
|
||||
title: "Ad hoc ops-bridge fixes for 2026-06-14"
|
||||
domain: custodian
|
||||
repo: ops-bridge
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: ops-bridge
|
||||
created: "2026-06-14"
|
||||
updated: "2026-06-14"
|
||||
state_hub_workstream_id: "fbc2ef7e-626f-4c6a-bdf8-c69bf29097ce"
|
||||
---
|
||||
|
||||
## Fix haskelseed bridge diagnostics
|
||||
|
||||
```task
|
||||
id: ADHOC-2026-06-14-T01
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "ffe6b8d8-889c-4ec4-8b64-00b77f86e39f"
|
||||
```
|
||||
|
||||
`haskelseed` is an Alpine host without `ss`, so `bridge check` reported
|
||||
reverse tunnel ports as closed even while SSH reverse listeners were present.
|
||||
Updated diagnostics to fall back from `ss` to `netstat` and then
|
||||
`/proc/net/tcp`/`tcp6`. Also fixed local-direction diagnostics so
|
||||
`nix-daemon-haskelseed` checks the local `-L` listener instead of probing a
|
||||
remote reverse port.
|
||||
|
||||
Verification:
|
||||
|
||||
- `state-hub-haskelseed` responded through `127.0.0.1:18000/state/health`.
|
||||
- `bridge check --json` reported all configured tunnels `ok: true`.
|
||||
- `python3 -m pytest tests/test_cli.py tests/test_diagnostics.py` passed.
|
||||
|
||||
## Make default target safe and add setup
|
||||
|
||||
```task
|
||||
id: ADHOC-2026-06-14-T02
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "3b932955-0d75-4b95-9821-92bfa2dadbd0"
|
||||
```
|
||||
|
||||
Changed `make` to default to a help listing that only shows targets with
|
||||
`##` comments. Added `make setup` to run `uv sync --all-groups` and reinstall
|
||||
the editable `bridge` CLI wrapper through `uv tool install -e . --force`.
|
||||
|
||||
Verification:
|
||||
|
||||
- `uv sync --all-groups` succeeded and installed the project environment.
|
||||
- `make` listed targets only and did not run tests or setup.
|
||||
- `make setup` succeeded and installed the `bridge` executable.
|
||||
- `make test` passed all 235 tests.
|
||||
- `make lint` passed.
|
||||
420
workplans/BRIDGE-WP-0001-initial-implementation.md
Normal file
420
workplans/BRIDGE-WP-0001-initial-implementation.md
Normal file
@@ -0,0 +1,420 @@
|
||||
---
|
||||
id: BRIDGE-WP-0001
|
||||
type: workplan
|
||||
title: "OpsBridge Initial Implementation"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: completed
|
||||
owner: Bernd
|
||||
topic_slug: custodian
|
||||
state_hub_workstream_id: 79112cff-9c0a-42ad-aa3d-916013001aee
|
||||
created: "2026-03-11"
|
||||
updated: "2026-03-12"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0001 — OpsBridge Initial Implementation
|
||||
**Scope:** Full implementation of the `bridge` CLI tool as specified in the PRD and FRS.
|
||||
**Out of scope:** OpsCatalog integration (deferred to a future workplan).
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Deliver a working `bridge` CLI installable via `uv tool install` that manages named SSH reverse tunnels with auto-reconnect, optional HTTP health checks, actor attribution, and an operational audit log.
|
||||
|
||||
---
|
||||
|
||||
## Reference Documents
|
||||
|
||||
| Document | Location |
|
||||
|---|---|
|
||||
| PRD | `wiki/OpsBridgePrd.md` |
|
||||
| FRS | `wiki/OpsBridgeFrs.md` |
|
||||
| CLAUDE.md | `CLAUDE.md` |
|
||||
|
||||
---
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
```
|
||||
~/.config/bridge/tunnels.yaml # static config: tunnels + actors
|
||||
~/.local/state/bridge/ # runtime state
|
||||
<name>.pid # PID of tunnel subprocess manager
|
||||
<name>.log # reconnect + health event log
|
||||
<name>.state # current state string (for status cmd)
|
||||
|
||||
src/bridge/
|
||||
__init__.py
|
||||
cli.py # Typer app, all commands
|
||||
config.py # load + validate tunnels.yaml
|
||||
models.py # dataclasses: TunnelConfig, BridgeState, ActorInfo
|
||||
manager.py # TunnelManager: start/stop subprocess, reconnect loop
|
||||
health.py # HTTP health check via httpx
|
||||
state.py # read/write PID + state files
|
||||
audit.py # structured event log writer
|
||||
```
|
||||
|
||||
**Bridge state machine:** `stopped → starting → connected → degraded → failed`
|
||||
- `degraded` = SSH process alive but HTTP health check failing
|
||||
- `failed` = reconnect attempts exhausted (configurable max)
|
||||
|
||||
---
|
||||
|
||||
## Config Schema (`~/.config/bridge/tunnels.yaml`)
|
||||
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
health_check:
|
||||
url: http://127.0.0.1:18000/health # checked from remote side
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
reconnect:
|
||||
max_attempts: 0 # 0 = infinite
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
|
||||
actors:
|
||||
agent.claude-coulombcore:
|
||||
class: automation
|
||||
description: Claude Code agent on CoulombCore
|
||||
operator.bernd:
|
||||
class: human
|
||||
description: Bernd Worsch
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Project Scaffolding
|
||||
|
||||
**Acceptance:** `bridge --help` lists all commands.
|
||||
|
||||
### T01 — Create pyproject.toml
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T01
|
||||
state_hub_task_id: 76c9ee58-10bf-4060-87bb-b73fa8cf25ea
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Set up `[project]`, `[project.scripts]` (entry point `bridge = bridge.cli:app`), and dependencies: `typer`, `pyyaml`, `httpx`. Run `uv lock`.
|
||||
|
||||
### T02 — Create package skeleton
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T02
|
||||
state_hub_task_id: b2be974c-6173-457d-9276-080ac551c105
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Create `src/bridge/__init__.py` and empty module stubs: `cli.py`, `config.py`, `models.py`, `manager.py`, `health.py`, `state.py`, `audit.py`.
|
||||
|
||||
### T03 — Verify uv tool install
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T03
|
||||
state_hub_task_id: 82f70483-91ae-4545-88af-44fe693ecb79
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Verify `uv tool install -e .` produces a working `bridge --help`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Config Loading (FR-2, FC-1)
|
||||
|
||||
**Acceptance:** `config.load()` returns typed config objects; clear error message on bad YAML.
|
||||
|
||||
### T04 — Define config dataclasses in models.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T04
|
||||
state_hub_task_id: 495e4257-40ad-4a1b-8a71-3a311476d41e
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Define `TunnelConfig`, `ReconnectPolicy`, `HealthCheckConfig`, `ActorInfo` as dataclasses.
|
||||
|
||||
### T05 — Implement config.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T05
|
||||
state_hub_task_id: b6782df4-e692-49e1-b3a3-d65d07826907
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Load `~/.config/bridge/tunnels.yaml`, validate required fields, raise clear errors. Support `BRIDGE_CONFIG` env var override for testing.
|
||||
|
||||
### T06 — Unit tests for config loading
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T06
|
||||
state_hub_task_id: 341c866f-8f4b-4165-9fa5-f10fe37c9252
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test: valid config, missing required field, unknown tunnel name.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — State Management (FR-4, FR-7, FR-14)
|
||||
|
||||
**Acceptance:** State round-trips correctly; stale PIDs detected without error.
|
||||
|
||||
### T07 — Implement state.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T07
|
||||
state_hub_task_id: ae5e2566-a4b1-426f-9c32-4a2c025f2927
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Read/write PID file and state file under `~/.local/state/bridge/`. Check if PID is alive. Create state dir on first write.
|
||||
|
||||
### T08 — Define BridgeState enum
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T08
|
||||
state_hub_task_id: 456a3cb5-50fa-4fed-9283-57e2d1c6fbb9
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
States: `STOPPED`, `STARTING`, `CONNECTED`, `DEGRADED`, `RECONNECTING`, `FAILED`.
|
||||
|
||||
### T09 — Unit tests for state management
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T09
|
||||
state_hub_task_id: 0accc0b7-d013-43ad-a810-3269e64fb096
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test: write/read state round-trip, stale PID detection without error.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Tunnel Process Manager (FR-1, FR-3, FR-12, FR-13)
|
||||
|
||||
**Acceptance:** `bridge up <name>` starts tunnel; killing SSH process triggers reconnect; `bridge down <name>` stops cleanly.
|
||||
|
||||
### T10 — Implement TunnelManager — SSH subprocess wrapper
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T10
|
||||
state_hub_task_id: d0341e90-b48d-48ab-9e6d-82f4c365afec
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
SSH command: `ssh -N -R {remote_port}:127.0.0.1:{local_port} -i {key} -o ServerAliveInterval=10 -o ExitOnForwardFailure=yes {user}@{host}`. Manager runs as a daemonised child process; parent writes PID and exits.
|
||||
|
||||
### T11 — Implement reconnect backoff loop
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T11
|
||||
state_hub_task_id: f5c91eff-fca3-4f66-b073-276a733b5a27
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Exponential backoff between `backoff_initial` and `backoff_max`. Respect `max_attempts` (0 = infinite). On disconnect: state → `RECONNECTING`, log event, restart SSH.
|
||||
|
||||
### T12 — Implement graceful shutdown
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T12
|
||||
state_hub_task_id: 3f4df535-0d6a-49e8-9d3a-c3926d7f230c
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Catch SIGTERM/SIGINT, kill SSH subprocess, write `STOPPED` state.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Health Monitoring (FR-15, FR-16, FR-17)
|
||||
|
||||
**Acceptance:** With a non-responsive health URL, `bridge status` shows `degraded`.
|
||||
|
||||
### T13 — Implement health.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T13
|
||||
state_hub_task_id: 5aaa0e35-f32a-4c68-8707-1a1e037b76f4
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Async HTTP GET via `httpx` to configured health URL. Run health check loop inside manager process. On failure: state → `DEGRADED`; on recovery: state → `CONNECTED`.
|
||||
|
||||
### T14 — Write health check result to state dir
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T14
|
||||
state_hub_task_id: 599d4e28-88c8-4c2a-80ac-ca57824af467
|
||||
status: done
|
||||
priority: low
|
||||
```
|
||||
|
||||
Persist timestamp, status, HTTP code or error for display in `bridge status`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — Audit Logging (FR-24, FR-25, FR-26)
|
||||
|
||||
**Acceptance:** All lifecycle events appear in the log with actor attribution.
|
||||
|
||||
### T15 — Implement audit.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T15
|
||||
state_hub_task_id: 2f124b16-f1e7-4e9f-ad23-9f08543db3b7
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Append JSON-lines to `~/.local/state/bridge/<name>.log`. Events: `bridge_started`, `bridge_connected`, `bridge_disconnected`, `bridge_reconnecting`, `health_check_failed`, `health_check_recovered`, `bridge_stopped`. Each entry: `timestamp` (ISO-8601), `tunnel`, `actor`, `actor_class`, `event`, `detail`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — CLI Commands (FR-1, FR-5, FR-8, FR-10, FR-11)
|
||||
|
||||
**Acceptance:** All commands work end-to-end; `--help` on each command shows correct usage.
|
||||
|
||||
Status table columns: `TUNNEL`, `STATE`, `ACTOR`, `HOST`, `UPTIME`, `HEALTH`. Exit codes: 0 = success, 1 = tunnel not found / config error, 2 = tunnel already in requested state. `--json` flag on `status` for automation.
|
||||
|
||||
### T16 — CLI: bridge up
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T16
|
||||
state_hub_task_id: 2c22b8fe-8a35-4887-89b2-f8fb7f43e0b6
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Start named tunnel or all tunnels if name omitted.
|
||||
|
||||
### T17 — CLI: bridge down
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T17
|
||||
state_hub_task_id: 768e1a8b-fdf7-4718-b00e-bc2401f57657
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Stop named tunnel or all tunnels if name omitted.
|
||||
|
||||
### T18 — CLI: bridge restart
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T18
|
||||
state_hub_task_id: 8fd6486d-af4f-4295-a57a-a5fabbf25681
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Down then up for named tunnel or all.
|
||||
|
||||
### T19 — CLI: bridge status
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T19
|
||||
state_hub_task_id: 28f3f392-9e94-43e7-811a-fa036f588e10
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Table output with `--json` flag for automation.
|
||||
|
||||
### T20 — CLI: bridge logs
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T20
|
||||
state_hub_task_id: 43582657-b1b9-4113-88e1-2109b30f3732
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Tail log file. Defaults to last 50 lines. `--follow` for live tail. `--lines N` to override.
|
||||
|
||||
---
|
||||
|
||||
## Phase 8 — Integration Tests
|
||||
|
||||
**Acceptance:** `uv run pytest` passes cleanly.
|
||||
|
||||
### T21 — Integration test: up/status/down cycle
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T21
|
||||
state_hub_task_id: 5e3c7ac6-03fd-45e9-af64-11bde1d03ab8
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test fixture with minimal `tunnels.yaml` pointing to localhost. Test full `up → status → down` cycle against loopback SSH target or mocked subprocess.
|
||||
|
||||
### T22 — Integration test: reconnect behaviour
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T22
|
||||
state_hub_task_id: 8b6ac68e-d0ab-4826-8df5-ebdf30a1e23e
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test reconnect loop with a subprocess that exits immediately.
|
||||
|
||||
### T23 — Integration test: health check degraded path
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0001-T23
|
||||
state_hub_task_id: c472bb1a-2fe2-4a88-aa6b-e18f732a3fde
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test degraded state with a mock HTTP server that returns failures.
|
||||
|
||||
---
|
||||
|
||||
## FRS Traceability
|
||||
|
||||
| FRS Requirement Group | Phase |
|
||||
|---|---|
|
||||
| FR-1 to FR-4 — Bridge creation | 4 |
|
||||
| FR-5 to FR-7 — Bridge termination | 4 |
|
||||
| FR-8 to FR-9 — Bridge restart | 7 |
|
||||
| FR-10 to FR-11 — Status inspection | 7 |
|
||||
| FR-12 to FR-14 — Lifecycle monitoring | 4 |
|
||||
| FR-15 to FR-17 — Health monitoring | 5 |
|
||||
| FR-18 to FR-20 — Actor attribution | 2, 6 |
|
||||
| FR-24 to FR-26 — Audit logging | 6 |
|
||||
| FC-1 — Config dependency | 2 |
|
||||
| FC-2 — External connectivity | 4 |
|
||||
|
||||
*FR-21 to FR-23 (target discovery) and FR-27 to FR-29 (identity integration) are deferred — they depend on OpsCatalog and an identity provider respectively.*
|
||||
|
||||
---
|
||||
|
||||
## Deferred
|
||||
|
||||
- **FR-21–FR-23** — Infrastructure target discovery (`bridge targets`) — requires OpsCatalog
|
||||
- **FR-27–FR-29** — Identity provider integration (privacyIDEA / SSH CA) — requires external identity infrastructure
|
||||
- **OpsCatalog** — Separate workplan (`BRIDGE-WP-0002`)
|
||||
404
workplans/BRIDGE-WP-0002-opscatalog-extension.md
Normal file
404
workplans/BRIDGE-WP-0002-opscatalog-extension.md
Normal file
@@ -0,0 +1,404 @@
|
||||
---
|
||||
id: BRIDGE-WP-0002
|
||||
type: workplan
|
||||
title: "OpsCatalog Extension"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: completed
|
||||
owner: Bernd
|
||||
topic_slug: custodian
|
||||
state_hub_workstream_id: f38bfcdb-f115-4431-88b5-ce906a24199c
|
||||
created: "2026-03-11"
|
||||
updated: "2026-03-12"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0002 — OpsCatalog Extension
|
||||
|
||||
**Scope:** Implement OpsCatalog as a Git-backed YAML knowledge repository and
|
||||
integrate it with the `bridge` CLI.
|
||||
**Depends on:** BRIDGE-WP-0001 complete (bridge CLI operational).
|
||||
**Out of scope:** Identity provider integration (FR-27–29, deferred indefinitely).
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
Deliver the OpsCatalog subsystem: a structured YAML catalog of operations
|
||||
domains, targets, bridges, and actor classes stored in a Git repository.
|
||||
OpsBridge loads the catalog at runtime to resolve bridge identifiers, orient
|
||||
operators, and expose the `bridge targets` and `bridge catalog` commands.
|
||||
|
||||
---
|
||||
|
||||
## Reference Documents
|
||||
|
||||
| Document | Location |
|
||||
|---|---|
|
||||
| OpsCatalog Spec (PRD + FRS + Schemas) | `wiki/OpsCatalogSpecification.md` |
|
||||
| OpsBridge FRS (deferred FRs) | `wiki/OpsBridgeFrs.md` §5.8, §5.10 |
|
||||
| CLAUDE.md | `CLAUDE.md` |
|
||||
|
||||
---
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
```
|
||||
~/.config/bridge/tunnels.yaml
|
||||
catalog_path: ~/ops-catalog # path to the OpsCatalog Git repo
|
||||
|
||||
ops-catalog/ # separate Git repo, consumed by bridge
|
||||
domains/
|
||||
<domain>/
|
||||
domain.yaml # type: domain
|
||||
targets/
|
||||
<target>.yaml # type: target
|
||||
bridges/
|
||||
<bridge>.yaml # type: bridge
|
||||
docs/
|
||||
*.md # operations notes
|
||||
actors/
|
||||
<actor>.yaml # type: actor
|
||||
schemas/
|
||||
domain.schema.yaml
|
||||
target.schema.yaml
|
||||
bridge.schema.yaml
|
||||
actor.schema.yaml
|
||||
|
||||
src/bridge/
|
||||
catalog/
|
||||
__init__.py
|
||||
loader.py # walk catalog_path, parse YAML files into typed objects
|
||||
models.py # CatalogDomain, CatalogTarget, CatalogBridge, ActorClass
|
||||
validator.py # validate catalog entries against schemas
|
||||
resolver.py # resolve tunnel name → CatalogBridge → TunnelConfig
|
||||
```
|
||||
|
||||
**Integration points with existing bridge code:**
|
||||
- `config.py`: read `catalog_path` from `tunnels.yaml`; pass to catalog loader
|
||||
- `manager.py`: use `resolver.py` to look up bridge config from catalog when
|
||||
tunnel is not defined inline in `tunnels.yaml`
|
||||
- `cli.py`: add `bridge targets` and `bridge catalog` commands
|
||||
|
||||
---
|
||||
|
||||
## YAML Schemas
|
||||
|
||||
### domain.yaml
|
||||
```yaml
|
||||
type: domain
|
||||
id: coulombcore
|
||||
name: CoulombCore Infrastructure
|
||||
description: Core infrastructure domain for operational services
|
||||
environment: production
|
||||
```
|
||||
|
||||
### target.yaml
|
||||
```yaml
|
||||
type: target
|
||||
id: state-hub
|
||||
domain: coulombcore
|
||||
kind: service
|
||||
description: Infrastructure state coordination service
|
||||
reachable_via:
|
||||
- state-hub-coulombcore
|
||||
```
|
||||
|
||||
### bridge.yaml
|
||||
```yaml
|
||||
type: bridge
|
||||
id: state-hub-coulombcore
|
||||
domain: coulombcore
|
||||
target: state-hub
|
||||
description: Operations bridge for state hub diagnostics
|
||||
access_method: ssh-reverse
|
||||
host: coulombcore.local
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
health_check:
|
||||
url: http://127.0.0.1:18000/health
|
||||
interval_seconds: 30
|
||||
timeout_seconds: 5
|
||||
reconnect:
|
||||
max_attempts: 0
|
||||
backoff_initial: 5
|
||||
backoff_max: 60
|
||||
```
|
||||
|
||||
### actor.yaml
|
||||
```yaml
|
||||
type: actor
|
||||
id: agent.claude-remediator
|
||||
class: automation
|
||||
description: Automated remediation agent
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Catalog Data Models
|
||||
|
||||
**Acceptance:** All catalog YAML types parse into typed Python objects.
|
||||
|
||||
### T01 — Define catalog dataclasses in catalog/models.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T01
|
||||
state_hub_task_id: 21b90574-a27c-467c-8e9d-d4029a659171
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Define `CatalogDomain`, `CatalogTarget`, `CatalogBridge`, `ActorClass` dataclasses.
|
||||
`CatalogBridge` must be mergeable with `TunnelConfig` (catalog supplies defaults;
|
||||
inline `tunnels.yaml` entries can override).
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — Catalog Loader (FR-14)
|
||||
|
||||
**Acceptance:** `catalog.load(path)` returns a populated `Catalog` object from a
|
||||
directory tree; unknown `type:` values are skipped with a warning.
|
||||
|
||||
### T02 — Implement catalog/loader.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T02
|
||||
state_hub_task_id: 782b5b4d-1f3f-4e5d-ad46-dc57b345bda3
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Walk `catalog_path` recursively, parse every `*.yaml` file, dispatch on `type:`
|
||||
field. Build in-memory index: domains, targets, bridges, actors.
|
||||
|
||||
### T03 — Unit tests for catalog loader
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T03
|
||||
state_hub_task_id: 41fed4f8-7818-4ca1-bb48-6ac1089220e8
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test: full catalog directory fixture loads correctly; missing required field raises
|
||||
clear error; unknown type is skipped; empty catalog returns empty index.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Catalog Validation (FR-15)
|
||||
|
||||
**Acceptance:** `bridge catalog validate` exits non-zero and prints all violations
|
||||
when the catalog contains invalid entries.
|
||||
|
||||
### T04 — Implement catalog/validator.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T04
|
||||
state_hub_task_id: 32946d15-5516-4599-8f27-8c653dec6786
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Validate required fields per type. Cross-reference checks: target's `domain` must
|
||||
exist; target's `reachable_via` bridge IDs must exist; bridge's `target` and
|
||||
`domain` must exist; actor referenced by bridge must exist.
|
||||
|
||||
### T05 — Unit tests for catalog validation
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T05
|
||||
state_hub_task_id: 6061a6eb-9966-4be9-aa5e-ea7edf7fd085
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test: valid catalog passes; dangling `reachable_via` reference fails; missing
|
||||
required field fails.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Bridge Resolver (FR-2 integration)
|
||||
|
||||
**Acceptance:** `bridge up state-hub-coulombcore` resolves the bridge config from
|
||||
the catalog when no inline entry exists in `tunnels.yaml`.
|
||||
|
||||
### T06 — Implement catalog/resolver.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T06
|
||||
state_hub_task_id: a92d97c8-4eec-4dd5-9b90-d9c1cba813ac
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
`resolve(name, catalog, inline_config) → TunnelConfig`. Lookup order: inline
|
||||
`tunnels.yaml` entry wins; fall back to catalog bridge by ID. Merge catalog
|
||||
bridge fields into `TunnelConfig`. Raise `BridgeNotFound` if neither source
|
||||
has the name.
|
||||
|
||||
### T07 — Integrate resolver into config.py and manager.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T07
|
||||
state_hub_task_id: 23799377-64f2-4c13-aa72-364770d80f91
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Read `catalog_path` from `tunnels.yaml` (optional; catalog disabled if absent).
|
||||
Pass resolved `TunnelConfig` to `TunnelManager` unchanged — manager stays
|
||||
catalog-unaware.
|
||||
|
||||
### T08 — Unit tests for resolver
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T08
|
||||
state_hub_task_id: d2313182-975f-409f-9d4f-ebabf66b44df
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test: inline entry takes precedence; catalog fallback works; inline overrides
|
||||
catalog fields; missing name raises `BridgeNotFound`.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — CLI: bridge targets (FR-21, FR-22, FR-23)
|
||||
|
||||
**Acceptance:** `bridge targets` prints a table of domains, targets, and which
|
||||
bridges provide access to each target.
|
||||
|
||||
### T09 — CLI: bridge targets command
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T09
|
||||
state_hub_task_id: f9e508db-a19f-42be-9437-b4bdeb00a534
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Table columns: `DOMAIN`, `TARGET`, `KIND`, `BRIDGES`. `--domain <name>` filter.
|
||||
`--json` flag for automation. Requires catalog to be configured; clear error if
|
||||
`catalog_path` not set.
|
||||
|
||||
### T10 — CLI: bridge targets show <target>
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T10
|
||||
state_hub_task_id: e288a1d3-d676-404a-a3eb-25dbb241502d
|
||||
status: done
|
||||
priority: low
|
||||
```
|
||||
|
||||
Show full metadata for a single target: domain, kind, description, reachable_via
|
||||
bridges, and any operations notes from `docs/*.md` files in the domain directory.
|
||||
|
||||
---
|
||||
|
||||
## Phase 6 — CLI: bridge catalog commands
|
||||
|
||||
**Acceptance:** Operators can inspect and validate the catalog from the CLI.
|
||||
|
||||
### T11 — CLI: bridge catalog list
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T11
|
||||
state_hub_task_id: 73899b70-b0ac-4f48-b362-cc2455a66f41
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
List all domains and a count of targets and bridges per domain.
|
||||
|
||||
### T12 — CLI: bridge catalog validate
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T12
|
||||
state_hub_task_id: e091daa2-7c20-4169-b634-1fcc469513ea
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Run `validator.py` and print all violations. Exit 0 if clean, 1 if violations
|
||||
found. Useful in CI pipelines for the catalog repo.
|
||||
|
||||
### T13 — CLI: bridge catalog show <bridge-id>
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T13
|
||||
state_hub_task_id: 9f5f4f30-bfe6-40fd-b178-2fbb396816ee
|
||||
status: done
|
||||
priority: low
|
||||
```
|
||||
|
||||
Print full resolved bridge metadata including target and domain context.
|
||||
|
||||
---
|
||||
|
||||
## Phase 7 — Integration Tests
|
||||
|
||||
**Acceptance:** `uv run pytest` passes cleanly with catalog fixtures.
|
||||
|
||||
### T14 — Integration test: catalog load and resolve
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T14
|
||||
state_hub_task_id: 5ccb2b4b-7ea5-4c38-8246-d59b8f7d4419
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Fixture: minimal catalog directory with one domain, one target, one bridge.
|
||||
Test `bridge up <catalog-bridge-name>` resolves and starts tunnel.
|
||||
|
||||
### T15 — Integration test: bridge targets output
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T15
|
||||
state_hub_task_id: 72c9f686-c474-46c4-a759-bfd47e2d4211
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test `bridge targets` output matches catalog fixture. Test `--json` flag.
|
||||
|
||||
### T16 — Integration test: bridge catalog validate
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0002-T16
|
||||
state_hub_task_id: 83c0734e-0dc2-49ce-8b6a-a4d5e26ff33a
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Test clean catalog exits 0; catalog with a dangling reference exits 1 with a
|
||||
clear message.
|
||||
|
||||
---
|
||||
|
||||
## FRS Traceability
|
||||
|
||||
| FRS Requirement Group | Phase |
|
||||
|---|---|
|
||||
| FR-14 — Catalog retrieval | 2 |
|
||||
| FR-15 — Catalog validation | 3 |
|
||||
| FR-1 to FR-3 — Domain management | 2, 5 |
|
||||
| FR-4 to FR-6 — Target management | 2, 5 |
|
||||
| FR-7 to FR-9 — Bridge definition | 2, 4 |
|
||||
| FR-10 to FR-11 — Actor classification | 2 |
|
||||
| FR-12 to FR-13 — Operational annotations | 5 (docs/*.md) |
|
||||
| FR-21 to FR-23 — Infrastructure target discovery (OpsBridge FRS) | 5 |
|
||||
|
||||
*FR-27–29 (identity integration) remain deferred — require external identity
|
||||
provider infrastructure.*
|
||||
|
||||
---
|
||||
|
||||
## Deferred
|
||||
|
||||
- **FR-27–29** — Identity provider integration (privacyIDEA / SSH CA) — separate
|
||||
workplan when identity infrastructure is available.
|
||||
- **Operations notes search** — full-text search across `docs/*.md` files — nice
|
||||
to have, not required for MVP.
|
||||
526
workplans/BRIDGE-WP-0003-mcp-skill-cross-mode-tests.md
Normal file
526
workplans/BRIDGE-WP-0003-mcp-skill-cross-mode-tests.md
Normal file
@@ -0,0 +1,526 @@
|
||||
---
|
||||
id: BRIDGE-WP-0003
|
||||
type: workplan
|
||||
title: "OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: Bernd
|
||||
topic_slug: custodian
|
||||
state_hub_workstream_id: 97009d3f-fd92-4fd9-a308-6c2445b4d623
|
||||
created: "2026-03-12"
|
||||
updated: "2026-03-12"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0003 — OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage
|
||||
|
||||
**Scope:** Expose OpsBridge and OpsCatalog functionality as a FastMCP server
|
||||
and a Claude Code skill. Introduce a capability registry and cross-access-mode
|
||||
test suite that enforces test coverage parity across CLI, MCP, and skill for
|
||||
every operation — including a meta-test that validates the test suite itself is
|
||||
complete.
|
||||
|
||||
**Depends on:** BRIDGE-WP-0001 and BRIDGE-WP-0002 complete.
|
||||
**Out of scope:** Identity provider integration (FR-27–29, deferred indefinitely).
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
After this workplan:
|
||||
|
||||
1. Any Claude Code agent can call `bridge_up()`, `bridge_status()`,
|
||||
`catalog_list_targets()` etc. as first-class MCP tools — no Bash
|
||||
required, structured JSON in/out.
|
||||
2. Human operators can invoke `/bridge-status` as a skill to get an
|
||||
immediate, natural-language summary of tunnel health.
|
||||
3. Adding any new capability (CLI command, MCP tool) without writing tests
|
||||
for all required access modes causes `uv run pytest` to fail with a
|
||||
clear capability × mode gap report.
|
||||
4. The gap-detection mechanism is itself tested: a synthetic missing-mode
|
||||
fixture asserts the meta-test catches it.
|
||||
|
||||
---
|
||||
|
||||
## Reference Documents
|
||||
|
||||
| Document | Location |
|
||||
|---|---|
|
||||
| Architecture note | `CLAUDE.md` — Architecture section |
|
||||
| OpsBridge FRS | `wiki/OpsBridgeFrs.md` |
|
||||
| State Hub MCP server (reference impl) | `~/the-custodian/state-hub/mcp_server/server.py` |
|
||||
|
||||
---
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
```
|
||||
src/bridge/
|
||||
capabilities.py # canonical capability registry
|
||||
mcp_server/
|
||||
__init__.py
|
||||
server.py # FastMCP app, stdio entry point
|
||||
|
||||
.mcp.json # project-scope MCP registration
|
||||
scripts/
|
||||
register_mcp.py # user-scope registration helper
|
||||
|
||||
~/.claude/plugins/
|
||||
ops-bridge/
|
||||
bridge-status.md # /bridge-status skill
|
||||
|
||||
tests/
|
||||
conftest.py # capability + access_mode marks, collector helper
|
||||
test_cli.py # existing — annotated with marks (T09)
|
||||
test_mcp.py # new — FastMCP in-process client tests
|
||||
test_skill.py # new — static skill coverage lint
|
||||
test_coverage_completeness.py # new — cross-mode meta-test
|
||||
```
|
||||
|
||||
### Capability Registry
|
||||
|
||||
```python
|
||||
# src/bridge/capabilities.py
|
||||
from dataclasses import dataclass
|
||||
|
||||
ACCESS_MODES = {"cli", "mcp", "skill"}
|
||||
|
||||
@dataclass
|
||||
class Capability:
|
||||
name: str
|
||||
description: str
|
||||
required_access_modes: frozenset[str]
|
||||
|
||||
CAPABILITIES: list[Capability] = [
|
||||
Capability("bridge_up", "Start one or all tunnels", frozenset({"cli", "mcp"})),
|
||||
Capability("bridge_down", "Stop one or all tunnels", frozenset({"cli", "mcp"})),
|
||||
Capability("bridge_restart", "Restart one or all tunnels", frozenset({"cli", "mcp"})),
|
||||
Capability("bridge_status", "Show tunnel status", frozenset({"cli", "mcp", "skill"})),
|
||||
Capability("bridge_logs", "Tail tunnel audit log", frozenset({"cli", "mcp"})),
|
||||
Capability("catalog_list_targets", "List catalog targets", frozenset({"cli", "mcp"})),
|
||||
Capability("catalog_show_target", "Show target metadata", frozenset({"cli", "mcp"})),
|
||||
Capability("catalog_list_domains", "List catalog domains", frozenset({"cli", "mcp"})),
|
||||
Capability("catalog_validate", "Validate catalog consistency", frozenset({"cli", "mcp"})),
|
||||
Capability("catalog_show_bridge", "Show bridge metadata", frozenset({"cli", "mcp"})),
|
||||
]
|
||||
```
|
||||
|
||||
### Cross-Mode Test Marks
|
||||
|
||||
Every test that exercises a capability against an access mode carries two marks:
|
||||
|
||||
```python
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_bridge_up_cli(runner, config_file):
|
||||
result = runner.invoke(app, ["up", "my-tunnel"])
|
||||
assert result.exit_code == 0
|
||||
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_up_mcp(mcp_client):
|
||||
result = await mcp_client.call_tool("bridge_up", {"tunnel": "my-tunnel"})
|
||||
assert result["started"] == ["my-tunnel"]
|
||||
```
|
||||
|
||||
### Meta-Test Mechanism
|
||||
|
||||
`test_coverage_completeness.py` uses a pytest plugin hook to collect all
|
||||
test items, read their marks, and assert the coverage matrix is complete:
|
||||
|
||||
```
|
||||
capability cli mcp skill
|
||||
bridge_up ✓ ✓ — (not required for skill)
|
||||
bridge_status ✓ ✓ ✓
|
||||
catalog_list_targets ✓ ✓ —
|
||||
...
|
||||
```
|
||||
|
||||
Fails with a table of gaps. The meta-test is itself validated by a fixture
|
||||
that injects a synthetic `Capability("test_sentinel", frozenset({"cli","mcp"}))`,
|
||||
deliberately omits the `mcp` test, and asserts the checker raises.
|
||||
|
||||
---
|
||||
|
||||
## Phase 1 — Capability Registry
|
||||
|
||||
**Acceptance:** `from bridge.capabilities import CAPABILITIES` works; every
|
||||
existing CLI command and the planned MCP tool set appears in the registry.
|
||||
|
||||
### T01 — Define capability registry module (src/bridge/capabilities.py)
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T01
|
||||
state_hub_task_id: 1397a838-b225-4452-ad53-29ad65388060
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
`Capability` dataclass with `name`, `description`, `required_access_modes`.
|
||||
List all 10 capabilities as shown in the architecture above. No external
|
||||
dependencies — pure stdlib.
|
||||
|
||||
### T02 — Meta-test: registry completeness against CLI commands and MCP tools
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T02
|
||||
state_hub_task_id: 97467243-9237-4e63-a860-cc49587546ad
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Introspect `app.registered_commands` (Typer) and `mcp.list_tools()` (FastMCP).
|
||||
Assert every name appears in `{c.name for c in CAPABILITIES}`. Fails fast if
|
||||
a developer adds a CLI command or MCP tool without updating the registry.
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 — MCP Server
|
||||
|
||||
**Acceptance:** `uv run python src/bridge/mcp_server/server.py` starts without
|
||||
error; `bridge_status()` returns a list of tunnel dicts; `bridge_up("x")`
|
||||
returns `{"started": ["x"]}` or `{"already_running": ["x"]}`.
|
||||
|
||||
### T03 — Add fastmcp dependency and mcp_server package skeleton
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T03
|
||||
state_hub_task_id: f2fd64f5-31c6-493b-b48b-d13980467cca
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Add `fastmcp>=2.0.0` to `[project.dependencies]` in `pyproject.toml`. Create
|
||||
`src/bridge/mcp_server/__init__.py` (empty) and `server.py` with:
|
||||
|
||||
```python
|
||||
from fastmcp import FastMCP
|
||||
mcp = FastMCP(name="ops-bridge", instructions="...")
|
||||
if __name__ == "__main__":
|
||||
mcp.run(transport="stdio")
|
||||
```
|
||||
|
||||
### T04 — Implement bridge lifecycle MCP tools (up, down, restart, status, logs)
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T04
|
||||
state_hub_task_id: 1bfc9b36-2be3-4606-a6e9-d611d1ac33ab
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
`@mcp.tool()` wrappers that import and call the Python library directly (no
|
||||
subprocess). Signatures:
|
||||
|
||||
```python
|
||||
def bridge_up(tunnel: str | None = None) -> dict
|
||||
def bridge_down(tunnel: str | None = None) -> dict
|
||||
def bridge_restart(tunnel: str | None = None) -> dict
|
||||
def bridge_status() -> list[dict]
|
||||
def bridge_logs(tunnel: str, lines: int = 50) -> list[dict]
|
||||
```
|
||||
|
||||
All return JSON-serialisable dicts/lists. `tunnel=None` means all tunnels.
|
||||
|
||||
### T05 — Implement catalog MCP tools
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T05
|
||||
state_hub_task_id: ef7fa23c-d2e1-4fe0-9e26-994c1a6ce1fb
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
```python
|
||||
def catalog_list_targets(domain: str | None = None) -> list[dict]
|
||||
def catalog_show_target(target_id: str) -> dict | None
|
||||
def catalog_list_domains() -> list[dict]
|
||||
def catalog_validate() -> dict # {"valid": bool, "errors": list[str]}
|
||||
def catalog_show_bridge(bridge_id: str) -> dict | None
|
||||
```
|
||||
|
||||
When `catalog_path` is not configured in `tunnels.yaml`, return
|
||||
`{"error": "catalog_path not configured"}` rather than raising.
|
||||
|
||||
### T06 — Implement bridge:// and catalog:// MCP resources
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T06
|
||||
state_hub_task_id: 71c9ee45-6928-416c-b4f3-dfb785a0ec8f
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
```python
|
||||
@mcp.resource("bridge://status")
|
||||
def resource_bridge_status() -> str:
|
||||
"""Live snapshot of all tunnel states."""
|
||||
|
||||
@mcp.resource("catalog://domains")
|
||||
def resource_catalog_domains() -> str: ...
|
||||
|
||||
@mcp.resource("catalog://targets")
|
||||
def resource_catalog_targets() -> str: ...
|
||||
```
|
||||
|
||||
Resources are for cheap orientation reads; tools are for actions and
|
||||
parameterised queries. Both are needed.
|
||||
|
||||
### T07 — Add .mcp.json project-scope registration config
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T07
|
||||
state_hub_task_id: 618c011d-bd1b-4c8f-8750-f3d2f9fcaf88
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
```json
|
||||
{
|
||||
"mcpServers": {
|
||||
"ops-bridge": {
|
||||
"type": "stdio",
|
||||
"command": "uv",
|
||||
"args": ["run", "python", "src/bridge/mcp_server/server.py"],
|
||||
"cwd": "/home/worsch/ops-bridge"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Project-scope: Claude Code sessions inside `ops-bridge/` get the tools
|
||||
automatically. See T14 for user-scope (machine-global) registration.
|
||||
|
||||
---
|
||||
|
||||
## Phase 3 — Skill
|
||||
|
||||
**Acceptance:** `/bridge-status` invoked in Claude Code runs the skill,
|
||||
calls `bridge_status` MCP tool, and returns a natural-language health summary.
|
||||
|
||||
### T08 — Implement /bridge-status skill for human operators
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T08
|
||||
state_hub_task_id: 2c070f34-12b5-4dd9-ab24-bb7b6836773c
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Skill file at `~/.claude/plugins/ops-bridge/bridge-status.md`. Prompt instructs
|
||||
Claude to:
|
||||
1. Call `bridge_status` MCP tool
|
||||
2. Report each tunnel: name, state (with colour hint), host, uptime
|
||||
3. Flag any `degraded` or `failed` tunnels and suggest `bridge restart <name>`
|
||||
4. If catalog is configured, offer `catalog_list_targets` for discovery context
|
||||
|
||||
Skill prompt **must** reference the canonical capability names (`bridge_status`,
|
||||
`catalog_list_targets`) so `test_skill.py` can assert coverage statically.
|
||||
|
||||
---
|
||||
|
||||
## Phase 4 — Cross-Access-Mode Test Suite
|
||||
|
||||
**Acceptance:** `uv run pytest` fails if any capability is missing a test for
|
||||
any of its required access modes. The failure message is a capability × mode
|
||||
gap matrix. The meta-test is itself verified by a synthetic failing fixture.
|
||||
|
||||
### T09 — CLI test layer: annotate existing tests with capability/access_mode marks
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T09
|
||||
state_hub_task_id: a8f3f5fb-fcd6-47e9-aad5-85dc803f796d
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Retrofit `tests/test_cli.py` (and other CLI test files) with:
|
||||
|
||||
```python
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_bridge_up_starts_tunnel(...): ...
|
||||
```
|
||||
|
||||
Every capability whose `required_access_modes` includes `"cli"` must have at
|
||||
least one marked test in the CLI layer.
|
||||
|
||||
### T10 — MCP test layer: tests/test_mcp.py with FastMCP in-process test client
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T10
|
||||
state_hub_task_id: acb7ada6-111d-4b8d-b201-45748c394c43
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
Use FastMCP's `Client(mcp_app)` context manager (in-process, no network):
|
||||
|
||||
```python
|
||||
@pytest.mark.capability("bridge_up")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_up_mcp(mcp_client, mock_tunnel_manager):
|
||||
result = await mcp_client.call_tool("bridge_up", {"tunnel": "t1"})
|
||||
assert result["started"] == ["t1"]
|
||||
```
|
||||
|
||||
Cover: correct return schema, missing tunnel name handled, catalog tools
|
||||
graceful when `catalog_path` unset, resource URIs return valid JSON.
|
||||
|
||||
### T11 — Skill test layer: tests/test_skill.py — static skill coverage lint
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T11
|
||||
state_hub_task_id: 071adfa4-2ccb-466b-b298-35130876267f
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Parse the skill markdown file. Assert:
|
||||
- File is syntactically valid (frontmatter parseable)
|
||||
- Each capability with `"skill"` in `required_access_modes` has its `name`
|
||||
appearing in the skill body text
|
||||
|
||||
This is a static lint, not an LLM invocation — fast and deterministic.
|
||||
|
||||
```python
|
||||
@pytest.mark.access_mode("skill")
|
||||
def test_skill_covers_required_capabilities():
|
||||
skill_text = Path("~/.claude/plugins/ops-bridge/bridge-status.md").read_text()
|
||||
for cap in CAPABILITIES:
|
||||
if "skill" in cap.required_access_modes:
|
||||
assert cap.name in skill_text, f"Skill missing capability: {cap.name}"
|
||||
```
|
||||
|
||||
### T12 — Cross-mode completeness meta-test: tests/test_coverage_completeness.py
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T12
|
||||
state_hub_task_id: f1277a48-1790-42bd-8c70-8ba10c68312b
|
||||
status: done
|
||||
priority: critical
|
||||
```
|
||||
|
||||
The centrepiece. Uses a pytest plugin (conftest hook or `pytest.ini`
|
||||
`collect_ignore`) to collect all test items, read their marks, build the
|
||||
coverage matrix, and assert completeness:
|
||||
|
||||
```python
|
||||
def test_all_capabilities_have_all_required_mode_tests(pytestconfig):
|
||||
covered = collect_capability_coverage(pytestconfig)
|
||||
gaps = []
|
||||
for cap in CAPABILITIES:
|
||||
for mode in cap.required_access_modes:
|
||||
if (cap.name, mode) not in covered:
|
||||
gaps.append(f" {cap.name:<30} {mode}")
|
||||
if gaps:
|
||||
pytest.fail("Missing capability × mode coverage:\n" + "\n".join(gaps))
|
||||
```
|
||||
|
||||
**Self-validation fixture:** a separate test injects a synthetic capability
|
||||
`Capability("_test_sentinel", frozenset({"cli","mcp"}))` into a copy of
|
||||
`CAPABILITIES`, provides only a `cli`-marked test for it, and asserts that
|
||||
calling `collect_capability_coverage` on this patched set reports the `mcp`
|
||||
gap.
|
||||
|
||||
### T13 — conftest.py: pytest marks registration and coverage collector helper
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T13
|
||||
state_hub_task_id: c518662a-9a5b-40de-86f5-582a16489cd3
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Register custom marks to silence `PytestUnknownMarkWarning`:
|
||||
|
||||
```toml
|
||||
# pyproject.toml
|
||||
[tool.pytest.ini_options]
|
||||
markers = [
|
||||
"capability(name): the bridge capability under test",
|
||||
"access_mode(mode): access mode being tested (cli, mcp, skill)",
|
||||
]
|
||||
```
|
||||
|
||||
Implement `collect_capability_coverage(session_or_items)` in `conftest.py`
|
||||
that walks collected items and returns `set[tuple[str, str]]` of
|
||||
`(capability_name, access_mode)` pairs.
|
||||
|
||||
---
|
||||
|
||||
## Phase 5 — Registration and Documentation
|
||||
|
||||
**Acceptance:** `python scripts/register_mcp.py` registers ops-bridge MCP at
|
||||
user scope; `bridge --help` still works; `uv run pytest` passes.
|
||||
|
||||
### T14 — User-scope registration guide and patch script
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T14
|
||||
state_hub_task_id: b86916ba-59f3-44c1-b874-8af92d30e470
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
`scripts/register_mcp.py` modelled on `state-hub/scripts/patch_mcp_cwd.py`:
|
||||
reads `.mcp.json`, registers at user scope via `claude mcp add-json -s user`,
|
||||
then patches `cwd` directly in `~/.claude.json`. Update `README.txt` with:
|
||||
|
||||
```
|
||||
MCP INTEGRATION
|
||||
---------------
|
||||
Project-scope (auto, inside ops-bridge/):
|
||||
Already configured in .mcp.json.
|
||||
|
||||
User-scope (machine-global, any repo):
|
||||
python scripts/register_mcp.py
|
||||
```
|
||||
|
||||
### T15 — Integration test: agent workflow (bridge_status → bridge_up → bridge_status)
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0003-T15
|
||||
state_hub_task_id: d826764f-e2f1-4f6a-842c-a1852a88b209
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
End-to-end MCP flow with mocked `TunnelManager`:
|
||||
|
||||
1. `bridge_status()` → all tunnels `stopped`
|
||||
2. `bridge_up("test-tunnel")` → `{"started": ["test-tunnel"]}`
|
||||
3. `bridge_status()` → `test-tunnel` now `connected`
|
||||
|
||||
Verifies the MCP layer correctly delegates to the library and state is
|
||||
reflected. Marked `@pytest.mark.capability("bridge_up") @pytest.mark.access_mode("mcp")`.
|
||||
|
||||
---
|
||||
|
||||
## Capability × Mode Coverage Target
|
||||
|
||||
| Capability | CLI | MCP | Skill |
|
||||
|-------------------------|-----|-----|-------|
|
||||
| bridge_up | ✓ | ✓ | |
|
||||
| bridge_down | ✓ | ✓ | |
|
||||
| bridge_restart | ✓ | ✓ | |
|
||||
| bridge_status | ✓ | ✓ | ✓ |
|
||||
| bridge_logs | ✓ | ✓ | |
|
||||
| catalog_list_targets | ✓ | ✓ | |
|
||||
| catalog_show_target | ✓ | ✓ | |
|
||||
| catalog_list_domains | ✓ | ✓ | |
|
||||
| catalog_validate | ✓ | ✓ | |
|
||||
| catalog_show_bridge | ✓ | ✓ | |
|
||||
|
||||
The skill only requires `bridge_status` and `catalog_list_targets` — the
|
||||
two capabilities needed for a health summary. All others are CLI+MCP only.
|
||||
|
||||
---
|
||||
|
||||
## Deferred
|
||||
|
||||
- **FR-27–29** — Identity provider integration — separate workplan.
|
||||
- **Skill coverage for lifecycle operations** — `/bridge-up`, `/bridge-down`
|
||||
skills for human operators are low value; agents use MCP tools directly.
|
||||
- **Remote MCP transport (SSE/HTTP)** — stdio is sufficient for local use;
|
||||
remote transport is a future concern when ops-bridge runs on a headless node.
|
||||
340
workplans/BRIDGE-WP-0004-directive-alignment.md
Normal file
340
workplans/BRIDGE-WP-0004-directive-alignment.md
Normal file
@@ -0,0 +1,340 @@
|
||||
---
|
||||
id: BRIDGE-WP-0004
|
||||
type: workplan
|
||||
title: "AccessManagementDirective Alignment"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: Bernd
|
||||
topic_slug: custodian
|
||||
created: "2026-03-28"
|
||||
updated: "2026-03-28"
|
||||
state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
|
||||
|
||||
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
|
||||
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
|
||||
preserving full backward compatibility with the existing static-key mode.
|
||||
|
||||
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
|
||||
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
After this workplan:
|
||||
|
||||
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
|
||||
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
|
||||
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
|
||||
handled transparently by the tunnel manager.
|
||||
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
|
||||
the directive, with config validation that enforces naming conventions.
|
||||
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
|
||||
§5 SIEM traceability requirement.
|
||||
|
||||
---
|
||||
|
||||
## Reference Documents
|
||||
|
||||
| Document | Location |
|
||||
|---|---|
|
||||
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
|
||||
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
|
||||
| PRD | `wiki/OpsBridgePrd.md` |
|
||||
| FRS | `wiki/OpsBridgeFrs.md` |
|
||||
|
||||
---
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Static key mode stays first-class
|
||||
|
||||
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
|
||||
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
|
||||
explicitly supported for:
|
||||
- Lab/dev environments without a CA
|
||||
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
|
||||
- Environments below the directive's complexity threshold
|
||||
|
||||
### cert_command interface
|
||||
|
||||
```yaml
|
||||
# tunnels.yaml — optional cert_command field
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: agt-state-hub-bridge
|
||||
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
|
||||
actor: agt-state-hub-bridge
|
||||
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
|
||||
```
|
||||
|
||||
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
|
||||
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
|
||||
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
|
||||
on tunnel stop.
|
||||
|
||||
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
|
||||
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
|
||||
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
|
||||
|
||||
### TTL-aware cert refresh
|
||||
|
||||
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
|
||||
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
|
||||
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
|
||||
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
|
||||
failure, no reconnect backoff triggered.
|
||||
|
||||
If `cert_command` is absent, no TTL logic runs.
|
||||
|
||||
### Actor type model
|
||||
|
||||
`actor_class: str # "human" | "automation"` is replaced by:
|
||||
|
||||
```python
|
||||
class ActorType(str, Enum):
|
||||
ADM = "adm" # human operator
|
||||
AGT = "agt" # LLM-powered autonomous agent
|
||||
ATM = "atm" # deterministic script / pipeline
|
||||
```
|
||||
|
||||
Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`.
|
||||
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
|
||||
canonical values.
|
||||
|
||||
Config validation: if `actor` name is set, it must start with the prefix matching its type
|
||||
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
|
||||
SIEM auditability.
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T1 — ActorType enum
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T1
|
||||
state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
|
||||
- [x] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` →
|
||||
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
|
||||
- [x] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
|
||||
`atm-*` for ATM; raise `ConfigError` on mismatch
|
||||
- [x] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value`
|
||||
- [x] Update tests
|
||||
|
||||
### T2 — cert_command config field
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T2
|
||||
state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
|
||||
- [x] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
|
||||
content (shell-level freedom intentional)
|
||||
- [x] Document in config example / SCOPE.md
|
||||
|
||||
### T3 — Cert acquisition in manager
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T3
|
||||
state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
|
||||
- If `cfg.cert_command` is None: return None (static key mode)
|
||||
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
|
||||
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
|
||||
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
|
||||
- [x] `build_ssh_command`: accept optional `cert_path`; when set, insert
|
||||
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
|
||||
- [x] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
|
||||
so every reconnect gets a fresh cert
|
||||
|
||||
### T4 — cert_identity in audit log
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T4
|
||||
state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
|
||||
extract `Key ID` (the `-I` value from signing time)
|
||||
- [x] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
|
||||
JSON entry when present
|
||||
- [x] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
|
||||
- [x] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
|
||||
|
||||
### T5 — TTL-aware cert refresh
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T5
|
||||
state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
|
||||
from `ssh-keygen -L` output → `cert_expires_at: datetime`
|
||||
- [x] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
|
||||
on each iteration
|
||||
- [x] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
|
||||
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
|
||||
next iteration)
|
||||
- [x] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
|
||||
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
|
||||
- [x] If `cert_command` is absent, skip all TTL logic entirely
|
||||
|
||||
### T6 — `bridge cert-status` command
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T6
|
||||
state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
- [x] `cli.py`: add `cert-status [TUNNEL]` subcommand
|
||||
- [x] For each tunnel (or the named one): read cert file from state dir if present,
|
||||
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
|
||||
time-to-expiry (or "static key / no cert" if absent)
|
||||
- [x] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
|
||||
- [x] `--json` flag for machine-readable output
|
||||
|
||||
### T7 — CertAcquisitionError handling
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T7
|
||||
state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] New exception `CertAcquisitionError` in `models.py`
|
||||
- [x] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
|
||||
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
|
||||
(cert failures are transient — e.g., Vault briefly unreachable)
|
||||
- [x] After `max_attempts` consecutive cert failures, transition to `FAILED` state
|
||||
|
||||
### T8 — SCOPE.md and documentation updates
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T8
|
||||
state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
- [x] Update `SCOPE.md`: Current State updated to reflect completion; directive alignment done
|
||||
- [x] `wiki/OpsBridgeFrs.md` §5.7 already covers actor attribution abstractly — no changes needed
|
||||
- [x] `.claude/rules/architecture.md` already documents cert_command mode and actor vocab
|
||||
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency (deferred)
|
||||
|
||||
### T9 — Tests
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T9
|
||||
state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
|
||||
cert_command parse
|
||||
- [x] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
|
||||
args; verify `CertAcquisitionError` on non-zero exit; TTL logic helpers
|
||||
- [x] `test_audit.py`: `cert_identity` field; actor_type rename
|
||||
- [x] `test_cli.py`: `cert-status` exit codes; JSON output shape
|
||||
- [x] 233 tests, 0 failures
|
||||
|
||||
---
|
||||
|
||||
## Config Schema — Before / After
|
||||
|
||||
### Before
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: ops-agent
|
||||
ssh_key: ~/.ssh/id_ed25519
|
||||
actor: automation-agent
|
||||
|
||||
actors:
|
||||
automation-agent:
|
||||
class: automation
|
||||
description: "state hub bridge agent"
|
||||
```
|
||||
|
||||
### After (static key mode — unchanged behavior)
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: agt-state-hub-bridge
|
||||
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
|
||||
actor: agt-state-hub-bridge
|
||||
|
||||
actors:
|
||||
agt-state-hub-bridge:
|
||||
class: agt
|
||||
description: "state hub bridge agent"
|
||||
```
|
||||
|
||||
### After (cert_command mode — ops-warden or any CA)
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: agt-state-hub-bridge
|
||||
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
|
||||
actor: agt-state-hub-bridge
|
||||
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
|
||||
|
||||
actors:
|
||||
agt-state-hub-bridge:
|
||||
class: agt
|
||||
description: "state hub bridge agent"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [x] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
|
||||
warning only); tunnel behaves identically
|
||||
- [x] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
|
||||
- [x] Config with `cert_command` set: SSH process launched with both `-i key` and
|
||||
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
|
||||
- [x] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
|
||||
no TTL logic runs
|
||||
- [x] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
|
||||
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
|
||||
- [x] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
|
||||
- [x] `bridge cert-status` shows valid cert info; exits 1 on expired cert
|
||||
- [x] All tests pass: `uv run pytest` (233 passed)
|
||||
- [x] All lints pass: `uv run ruff check .`
|
||||
194
workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
Normal file
194
workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
Normal file
@@ -0,0 +1,194 @@
|
||||
---
|
||||
id: BRIDGE-WP-0005
|
||||
type: workplan
|
||||
title: "Restart includes remote cleanup (blank-slate recovery)"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-21"
|
||||
updated: "2026-06-21"
|
||||
state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0005 — Restart includes remote cleanup
|
||||
|
||||
**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
|
||||
`sshd` remote forward on Railiance01 port `18000` blocked
|
||||
`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
|
||||
had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
|
||||
|
||||
**Operator expectation:** `bridge restart` should mean *operational again* — a
|
||||
blank-slate recovery — not merely "cycle the local manager PID while a broken
|
||||
remote listener still holds the port."
|
||||
|
||||
## Topology and failure modes (refined)
|
||||
|
||||
Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
|
||||
Cleanup policy must respect all of them.
|
||||
|
||||
### A. Workstation (laptop WSL) — tunnel **origin**
|
||||
|
||||
The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
|
||||
remote hosts:
|
||||
|
||||
| Remote host | Tunnels (reverse) | Role |
|
||||
|-------------|-------------------|------|
|
||||
| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
|
||||
| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
|
||||
| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
|
||||
|
||||
**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
|
||||
local bridge processes without graceful remote SSH teardown. Orphan `sshd`
|
||||
listeners on **all three remotes** are common after wake — especially
|
||||
`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
|
||||
|
||||
### B. Haskelseed — also intermittently offline
|
||||
|
||||
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
|
||||
different networks. The same orphan-forward pattern applies to its reverse ports
|
||||
when the workstation-side tunnel dies uncleanly.
|
||||
|
||||
### C. VPS remotes (coulombcore, railiance01)
|
||||
|
||||
Normally always-on. Maintenance reboots clear remote kernel state, but:
|
||||
|
||||
- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
|
||||
with a dead local SSH child;
|
||||
- when the laptop returns, orphan forwards from the **previous** session may
|
||||
still block new `-R` binds if the VPS did not reboot.
|
||||
|
||||
**Conclusion:** conditional remote cleanup before restart benefits **all reverse
|
||||
tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
|
||||
skips healthy forwards — VPS tunnels with live working forwards are untouched.
|
||||
|
||||
### D. Local-direction tunnels — no remote cleanup
|
||||
|
||||
`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
|
||||
forward mode from workstation to remote services. They do not bind remote reverse
|
||||
ports for State Hub. **`restart` stays local stop/start only** for these.
|
||||
|
||||
## Design (decided)
|
||||
|
||||
| Command | Behaviour after this workplan |
|
||||
|---------|-------------------------------|
|
||||
| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
|
||||
| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
|
||||
| `bridge up` | Out of scope here (see T4 optional follow-up). |
|
||||
|
||||
Implementation sketch: replace the body of `cli.restart()` with a call to
|
||||
`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
|
||||
or per-tunnel `cleanup_tunnel` when a single tunnel is named.
|
||||
|
||||
Emit the same action summary strings cleanup already uses (`healthy`,
|
||||
`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
|
||||
positive during T2).
|
||||
- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
|
||||
- Renaming tunnels or changing `tunnels.yaml` host entries.
|
||||
|
||||
---
|
||||
|
||||
## T1 — Wire restart through cleanup path
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
|
||||
```
|
||||
|
||||
Refactor `bridge/cli.py` `restart()` so reverse tunnels call
|
||||
`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
|
||||
`TunnelManager.stop()` + `start()`.
|
||||
|
||||
Requirements:
|
||||
|
||||
- Single-tunnel and all-tunnel restart both work.
|
||||
- Local-direction tunnels keep stop/start only.
|
||||
- Exit codes: preserve today’s semantics where practical; exit non-zero if any
|
||||
named tunnel ends in `CleanupAction.action == "error"`.
|
||||
- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
|
||||
etc.), not only "Restarted tunnel".
|
||||
|
||||
## T2 — Tests and regression coverage
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
|
||||
```
|
||||
|
||||
Update `tests/test_cli.py`:
|
||||
|
||||
- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
|
||||
reverse tunnels.
|
||||
- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
|
||||
invoked), local-direction tunnel (no cleanup call).
|
||||
- Reuse mocks from `tests/test_cleanup.py` patterns.
|
||||
|
||||
`make test` and `make lint` pass.
|
||||
|
||||
## T3 — Operator docs and CLI help
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
|
||||
```
|
||||
|
||||
Document the blank-slate restart contract:
|
||||
|
||||
- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
|
||||
- `bridge restart --help` — mention conditional remote stale-forward cleanup.
|
||||
- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
|
||||
maintenance — matching this workplan's topology section.
|
||||
- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
|
||||
incident note (one line each way).
|
||||
|
||||
## T4 — Optional: reconnect-loop hygiene (stretch)
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T04
|
||||
status: cancel
|
||||
priority: low
|
||||
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
|
||||
```
|
||||
|
||||
Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
|
||||
once after repeated exit-255 bind failures (laptop wake without operator running
|
||||
`bridge restart`). Defer unless T1–T3 are done; mark `cancel` if heuristic risk
|
||||
outweighs benefit.
|
||||
|
||||
**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
|
||||
loop risks killing a legitimately healthy orphan forward owned by another session
|
||||
or operator. `bridge restart` now covers the operator-facing blank-slate path;
|
||||
nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
|
||||
wake-from-sleep reconnect failures remain frequent after a month of observation.
|
||||
|
||||
## T5 — Live verification on workstation + VPS
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
|
||||
```
|
||||
|
||||
After T1–T2 ship, verify on real config:
|
||||
|
||||
1. **railiance01** — `state-hub-mcp-railiance01` was `reconnecting` with stale
|
||||
forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
|
||||
`connected`.
|
||||
2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat
|
||||
path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
|
||||
3. **coulombcore** — `bridge restart state-hub-coulombcore` reported `healthy`,
|
||||
PID unchanged (4116), forward undisturbed.
|
||||
|
||||
State Hub progress logged (2026-06-21). Workplan marked `finished`.
|
||||
164
workplans/OPS-WP-0001-diagnostics.md
Normal file
164
workplans/OPS-WP-0001-diagnostics.md
Normal file
@@ -0,0 +1,164 @@
|
||||
---
|
||||
id: OPS-WP-0001
|
||||
type: workplan
|
||||
title: "ops-bridge diagnostics and flow improvements"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: claude
|
||||
topic_slug: custodian
|
||||
created: "2026-03-20"
|
||||
updated: "2026-03-20"
|
||||
state_hub_workstream_id: "6726cea2-447a-40b2-b0a0-edf495f07942"
|
||||
---
|
||||
|
||||
# OPS-WP-0001 — ops-bridge diagnostics and flow improvements
|
||||
|
||||
**Scope:** Add `bridge check` end-to-end diagnostics command, fix `bridge status` to
|
||||
surface live PID liveness and flag stale state, add a `bridge_check` MCP tool, and
|
||||
wire Makefile convenience targets in state-hub.
|
||||
|
||||
**Context:** During a session, `bridge status` reported "connected" but the reverse
|
||||
port forwarding was not active — stale `.state` files written by the daemon. The
|
||||
status command does not verify the SSH process is alive or that the remote port is
|
||||
actually listening.
|
||||
|
||||
---
|
||||
|
||||
## Task: Add `read_raw_pid()` to StateManager
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "05e98e85-699a-4982-bb3e-8f2538cde2c7"
|
||||
```
|
||||
|
||||
Add `read_raw_pid(name)` to `src/bridge/state.py` — reads PID from file without
|
||||
liveness check. Existing `read_pid()` (which also checks liveness) stays unchanged.
|
||||
|
||||
---
|
||||
|
||||
## Task: Create `src/bridge/diagnostics.py`
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "b68d7b1e-850b-469a-9de2-8b5d3d1f1c05"
|
||||
```
|
||||
|
||||
New module with `TunnelCheckResult` dataclass (ssh_process, pid, remote_port,
|
||||
local_api, latency_ms, stale_state, ok property) and `check_tunnel()` /
|
||||
`check_all_tunnels()` functions. SSH probe via subprocess; optional httpx health check.
|
||||
|
||||
---
|
||||
|
||||
## Task: Fix `bridge status` and add `bridge check` to CLI
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "e87c6c5d-170c-4af3-905c-a48fae2edbe5"
|
||||
```
|
||||
|
||||
Fix `status` to show live PID liveness (LIVE column) and flag stale state.
|
||||
Add `check` command with `--json` flag; exit 1 if any tunnel not ok.
|
||||
Add `_print_check_table` helper.
|
||||
|
||||
---
|
||||
|
||||
## Task: Add `bridge_check` MCP tool and `bridge://check` resource
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T04
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "7e97c112-20e2-4e2e-b853-53b10998392b"
|
||||
```
|
||||
|
||||
Add `bridge_check(tunnel?)` tool and `bridge://check` resource to
|
||||
`src/bridge/mcp_server/server.py`.
|
||||
|
||||
---
|
||||
|
||||
## Task: Register `bridge_check` capability
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T05
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "c69fc748-a706-46db-a4d5-30d60222452b"
|
||||
```
|
||||
|
||||
Add `bridge_check` entry to `src/bridge/capabilities.py` with
|
||||
`required_access_modes=frozenset({"cli", "mcp"})`.
|
||||
|
||||
---
|
||||
|
||||
## Task: Write `tests/test_diagnostics.py`
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T06
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "070ed088-74a6-48d3-81cf-739c2a2fd21b"
|
||||
```
|
||||
|
||||
Unit tests: test_no_pid, test_pid_dead, test_pid_alive_port_listening,
|
||||
test_pid_alive_port_closed, test_ssh_timeout.
|
||||
|
||||
---
|
||||
|
||||
## Task: Add `TestCheckCommand` to `tests/test_cli.py`
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T07
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "aae5ddc5-f823-4647-a536-8604ddb97946"
|
||||
```
|
||||
|
||||
Tests: test_check_help, test_check_all_pass (marked capability+mode),
|
||||
test_check_any_fail, test_check_json_flag, test_check_specific_tunnel.
|
||||
|
||||
---
|
||||
|
||||
## Task: Add `TestMcpBridgeCheck` to `tests/test_mcp.py`
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T08
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "ed492a3d-7a5f-465e-8cc3-d2f992f5462c"
|
||||
```
|
||||
|
||||
Test: test_bridge_check_tool marked capability("bridge_check") + access_mode("mcp").
|
||||
|
||||
---
|
||||
|
||||
## Task: Add tunnels targets to state-hub Makefile
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T09
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "a3c77062-cff5-40e3-936c-b210b05f8839"
|
||||
```
|
||||
|
||||
Add `tunnels-up`, `tunnels-status`, `tunnels-check` targets delegating to `bridge`.
|
||||
Add to `.PHONY` line.
|
||||
|
||||
---
|
||||
|
||||
## Task: Run test suite and verify
|
||||
|
||||
```task
|
||||
id: OPS-WP-0001-T10
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "e42de76c-fab7-4924-8929-38fa9eaca478"
|
||||
```
|
||||
|
||||
`cd /home/worsch/ops-bridge && uv run pytest tests/ -v` — all tests green.
|
||||
221
workplans/OPS-WP-0002-agent-usability.md
Normal file
221
workplans/OPS-WP-0002-agent-usability.md
Normal file
@@ -0,0 +1,221 @@
|
||||
---
|
||||
id: OPS-WP-0002
|
||||
type: workplan
|
||||
title: "Agent Usability — MCP Registration, Skill, and Worker Orientation"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: custodian
|
||||
topic_slug: custodian
|
||||
created: "2026-03-21"
|
||||
updated: "2026-03-26"
|
||||
depends_on: OPS-WP-0001
|
||||
state_hub_workstream_id: "c195cc40-8be7-462e-be26-a7d6bda34cd5"
|
||||
---
|
||||
|
||||
# OPS-WP-0002 — Agent Usability: MCP Registration, Skill, and Worker Orientation
|
||||
|
||||
## Problem
|
||||
|
||||
The ops-bridge MCP server (`src/bridge/mcp_server/server.py`) is fully
|
||||
implemented with tools for `bridge_up/down/restart/status/check/logs` and
|
||||
catalog operations. But no agent can use it because:
|
||||
|
||||
1. **Not registered** — the server isn't in `~/.claude.json` and has no
|
||||
persistent transport mode. It only runs on stdio today.
|
||||
2. **No slash command** — agents working ad-hoc (not via MCP) have no
|
||||
quick way to check or restore tunnels.
|
||||
3. **No worker orientation** — agents on remote machines (CoulombCore,
|
||||
Railiance) don't know that bridge is available or how to use it when
|
||||
their state-hub connection drops.
|
||||
|
||||
## Goal
|
||||
|
||||
Any agent — on the workstation or a remote machine — can:
|
||||
- Check tunnel health in one call
|
||||
- Bring up a dropped tunnel without manual intervention
|
||||
- Recover the state-hub connection if it goes down mid-session
|
||||
|
||||
## Design
|
||||
|
||||
### MCP server (workstation, persistent)
|
||||
|
||||
Run as an SSE service on port 8002 (same pattern as state-hub on 8001).
|
||||
Registered at user scope in `~/.claude.json` so it's available to all
|
||||
Claude Code sessions.
|
||||
|
||||
The SSE transport is already supported by FastMCP — just change the
|
||||
`mcp.run()` call to accept an `--http` flag or read a `BRIDGE_MCP_PORT`
|
||||
env var.
|
||||
|
||||
### Slash command skill (all machines)
|
||||
|
||||
A `/bridge` skill at `~/.claude/commands/bridge.md` (global scope) that:
|
||||
- Reads `bridge status` output
|
||||
- Surfaces any tunnel that is down or stale
|
||||
- Offers to bring it up
|
||||
- Useful on machines that don't have the MCP server registered
|
||||
|
||||
### Worker agent orientation (remote machines)
|
||||
|
||||
Update `CLAUDE.md` (global) and `ops-bridge` session protocol to tell
|
||||
worker agents:
|
||||
- Check `bridge status` at session start when on a machine with
|
||||
ops-bridge installed
|
||||
- If state-hub tunnel is down: run `bridge up state-hub-<machine>` to
|
||||
restore it before making any state-hub API calls
|
||||
- If no bridge command: fall back to direct API URL if reachable
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — SSE transport mode for MCP server
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "27fc6fa1-6d0e-438a-b4a3-c6091931da88"
|
||||
```
|
||||
|
||||
Add `--http` flag and `BRIDGE_MCP_PORT` env var to `server.py` entry
|
||||
point. When `--http` is set, run `mcp.run(transport="sse", port=PORT)`
|
||||
instead of stdio.
|
||||
|
||||
Add `make mcp-http` target to `Makefile`:
|
||||
```makefile
|
||||
mcp-http: ## Start MCP server in SSE mode (default port 8002)
|
||||
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
|
||||
```
|
||||
|
||||
Add `make mcp-stop` target that kills any running MCP server on port
|
||||
8002.
|
||||
|
||||
Gate: `bridge_status()` tool callable via SSE on localhost:8002 after
|
||||
`make mcp-http`.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Register MCP server in ~/.claude.json
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "2216457d-035e-4804-b685-18975f3c6d1f"
|
||||
```
|
||||
|
||||
Register the ops-bridge MCP server at user scope:
|
||||
```bash
|
||||
claude mcp add-json -s user ops-bridge \
|
||||
'{"type":"sse","url":"http://127.0.0.1:8002/sse"}'
|
||||
```
|
||||
|
||||
Document in `ops-bridge` CLAUDE.md:
|
||||
```
|
||||
To start the MCP server:
|
||||
cd ~/ops-bridge && make mcp-http
|
||||
|
||||
To verify registration:
|
||||
python3 -c "import json,os; d=json.load(open(os.path.expanduser('~/.claude.json'))); print(list(d.get('mcpServers',{}).keys()))"
|
||||
```
|
||||
|
||||
Update global `~/.claude/CLAUDE.md` to list `ops-bridge` MCP server
|
||||
alongside `state-hub`.
|
||||
|
||||
Gate: `ops-bridge` appears in Claude Code MCP tool list after `make
|
||||
mcp-http`.
|
||||
|
||||
---
|
||||
|
||||
### T03 — `/bridge` slash command skill
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "4b2e39eb-4585-4e60-ab16-9e7909eced74"
|
||||
```
|
||||
|
||||
Create `~/.claude/commands/bridge.md` — a global Claude Code skill for
|
||||
tunnel management.
|
||||
|
||||
**Behaviour:**
|
||||
1. Run `bridge status` and parse output
|
||||
2. Report each tunnel: name, state, LIVE column
|
||||
3. For any tunnel that is `stopped`, `reconnecting`, or `[STALE]`:
|
||||
- Offer to run `bridge up <tunnel-name>`
|
||||
- After `bridge up`, re-check with `bridge check <tunnel-name>`
|
||||
4. If all tunnels are `connected` and LIVE: report green and exit
|
||||
|
||||
**Skill definition:**
|
||||
```yaml
|
||||
---
|
||||
description: >
|
||||
Check ops-bridge tunnel health and restore any dropped tunnels.
|
||||
Reports status of all configured tunnels and offers to bring up
|
||||
any that are stopped or stale.
|
||||
argument-hint: "[tunnel-name]"
|
||||
allowed-tools:
|
||||
- Bash(bridge status)
|
||||
- Bash(bridge up*)
|
||||
- Bash(bridge down*)
|
||||
- Bash(bridge check*)
|
||||
- Bash(bridge logs*)
|
||||
---
|
||||
```
|
||||
|
||||
If an optional tunnel name is passed as `$ARGUMENTS`, scope all
|
||||
operations to that tunnel only.
|
||||
|
||||
Gate: `/bridge` skill runs cleanly when all tunnels are up; correctly
|
||||
identifies and recovers a manually-stopped tunnel.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Worker agent orientation in CLAUDE.md
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "cc64bb07-ea5d-498a-8c14-bb653581efe7"
|
||||
```
|
||||
|
||||
Update global `~/.claude/CLAUDE.md` — add a **Worker Agent — Bridge
|
||||
Protocol** section:
|
||||
|
||||
```markdown
|
||||
## Worker Agent — Bridge Protocol
|
||||
|
||||
When working on a remote machine (CoulombCore, Railiance nodes):
|
||||
|
||||
1. At session start, check if `bridge` is installed:
|
||||
`which bridge && bridge status`
|
||||
2. If state-hub tunnel is down: `bridge up state-hub-<machine-slug>`
|
||||
Wait for state `connected` before making state-hub API calls.
|
||||
3. If `bridge` is not installed, check if the state-hub API is directly
|
||||
reachable: `curl -s http://127.0.0.1:8000/state/health`
|
||||
4. Only proceed without state-hub if absolutely necessary — log a
|
||||
progress note about the outage when connectivity is restored.
|
||||
```
|
||||
|
||||
Also add a one-liner reminder to the ops-bridge session protocol in
|
||||
`.claude/rules/session-protocol.md`:
|
||||
> At session start: `bridge status` — bring up any stopped tunnels
|
||||
> before accessing remote services.
|
||||
|
||||
Gate: `~/.claude/CLAUDE.md` contains the Worker Agent section; ops-bridge
|
||||
session protocol references bridge status check.
|
||||
|
||||
---
|
||||
|
||||
## Done Criteria
|
||||
|
||||
- [x] `make mcp-http` starts the MCP server on port 8002 (SSE)
|
||||
- [x] `bridge_status` and `bridge_check` callable as MCP tools from Claude Code
|
||||
- [x] `ops-bridge` registered in `~/.claude.json` at user scope
|
||||
- [x] `/bridge` skill surfaces tunnel states and recovers a stopped tunnel
|
||||
- [x] Global CLAUDE.md has worker agent bridge protocol
|
||||
- [x] All existing tests pass after T01 changes (`make test`)
|
||||
Reference in New Issue
Block a user