generated from coulomb/repo-seed
Compare commits
35 Commits
a55c685f89
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 6572a2ac99 | |||
| ce0aa728b1 | |||
| 00671f5133 | |||
| 09f2cd4b7a | |||
| c3b4fb9d55 | |||
| fab7409c66 | |||
| 1dd664c792 | |||
| 10c6fdaec9 | |||
| 8c11acc00c | |||
| 499b8781cc | |||
| 4e9882909f | |||
| a6857fb8f7 | |||
| 675772ab3b | |||
| 6eb0b1c52f | |||
| d949f3e93e | |||
| de984736ca | |||
| 28ecef121e | |||
| 860c08f1db | |||
| bd169a07e2 | |||
| 22601ef3e6 | |||
| 569de1497c | |||
| fafd04ed2e | |||
| c1d87b47df | |||
| 204bf48bc8 | |||
| 595c495f7c | |||
| 90eda27a14 | |||
| 1361727e15 | |||
| 18e3c118dd | |||
| 621de64ee0 | |||
| f3a7236c5d | |||
| 4f3c8646b3 | |||
| 431beef31b | |||
| 1c7c6eedf8 | |||
| 75a559780e | |||
| d73b7be45d |
20
.claude/rules/agents.md
Normal file
20
.claude/rules/agents.md
Normal file
@@ -0,0 +1,20 @@
|
||||
## Kaizen Agents
|
||||
|
||||
Specialized agent personas available on demand via the state-hub MCP.
|
||||
|
||||
**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
|
||||
**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
|
||||
|
||||
Common agents:
|
||||
|
||||
| Agent | Category | When to use |
|
||||
|-------|----------|-------------|
|
||||
| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
|
||||
| `code-refactoring` | quality | Code quality analysis and safe refactoring |
|
||||
| `test-maintenance` | testing | Diagnose and fix failing tests |
|
||||
| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
|
||||
| `keepaTodofile` | process | Maintain TODO.md during work |
|
||||
| `project-management` | process | Track status, determine next steps |
|
||||
| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
|
||||
|
||||
All 17 agents: call `list_kaizen_agents()` for the full list.
|
||||
@@ -1,31 +1,8 @@
|
||||
## Architecture
|
||||
|
||||
OpsBridge has two logical components:
|
||||
|
||||
**1. OpsBridge — tunnel lifecycle manager** (this repo)
|
||||
Manages named SSH reverse tunnels defined in `~/.config/bridge/tunnels.yaml`.
|
||||
Each tunnel runs in a subprocess with a reconnect backoff loop; PIDs are tracked
|
||||
in `~/.local/state/bridge/`. Bridge states: `stopped → starting → connected →
|
||||
degraded → failed`. The `degraded` state means SSH is up but the optional HTTP
|
||||
health check is failing.
|
||||
|
||||
**2. OpsCatalog — operations knowledge repository** (planned extension)
|
||||
A Git-backed YAML catalog of operations domains, targets, bridges, and actor
|
||||
classes. OpsBridge consumes this catalog to resolve bridge identifiers and
|
||||
orient operators. Schema examples are in `wiki/OpsCatalogSpecification.md`.
|
||||
The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
|
||||
targets/, bridges/, docs/}`.
|
||||
|
||||
Key design constraints:
|
||||
- OpsBridge owns lifecycle management only; it does not own identity/credentials
|
||||
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
|
||||
in config, CLI args, and log filenames must stay consistent
|
||||
- Actor attribution (human operator vs. automation agent) is tracked per bridge
|
||||
for audit log traceability (FRS §5.7)
|
||||
|
||||
Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
|
||||
(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
|
||||
<!-- TODO: Describe the key design decisions and component structure.
|
||||
Key modules, data flows, external integrations, state machines, etc. -->
|
||||
|
||||
## Quick Reference
|
||||
|
||||
`~/the-custodian/state-hub/mcp_server/TOOLS.md`
|
||||
`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
|
||||
|
||||
50
.claude/rules/credential-routing.md
Normal file
50
.claude/rules/credential-routing.md
Normal file
@@ -0,0 +1,50 @@
|
||||
# Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
38
.claude/rules/first-session.md
Normal file
38
.claude/rules/first-session.md
Normal file
@@ -0,0 +1,38 @@
|
||||
## First Session Protocol
|
||||
|
||||
Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
|
||||
The project is registered but work has not yet been structured.
|
||||
|
||||
**Step 1 — Read, don't write**
|
||||
- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
|
||||
- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
|
||||
- Scan repo root: README, directory structure, existing code or docs
|
||||
|
||||
**Step 2 — Survey in-progress work**
|
||||
Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
|
||||
|
||||
**Step 3 — Propose workstreams to Bernd**
|
||||
Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
|
||||
roadmap phase. **Wait for approval before creating.**
|
||||
|
||||
**Step 4 — Create workplan file first, then DB record (ADR-001)**
|
||||
```
|
||||
workplans/BRIDGE-WP-NNNN-<slug>.md ← write this first
|
||||
```
|
||||
Then register in the hub:
|
||||
```
|
||||
create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
|
||||
create_task(workstream_id="<id>", title="...", priority="high|medium|low")
|
||||
```
|
||||
|
||||
**Step 5 — Record the setup**
|
||||
```
|
||||
add_progress_event(
|
||||
summary="First session: structured infotech into N workstreams, M tasks",
|
||||
event_type="milestone",
|
||||
topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
|
||||
detail={"workstreams": [...], "tasks_created": M}
|
||||
)
|
||||
```
|
||||
|
||||
<!-- Delete or archive this file once past first session -->
|
||||
@@ -1,6 +1,8 @@
|
||||
## Repo boundary
|
||||
|
||||
This repo owns **tunnel lifecycle management only**. It does not own:
|
||||
- State hub code → `the-custodian/state-hub/`
|
||||
- SSH key management → `railiance-infra/` (S1) or user dotfiles
|
||||
- Ansible/provisioning → `railiance-infra/`
|
||||
This repo owns **ops-bridge** only. It does not own:
|
||||
|
||||
<!-- TODO: List what belongs in adjacent repos, e.g.:
|
||||
- SSH key management → railiance-infra/
|
||||
- State hub code → state-hub/
|
||||
-->
|
||||
|
||||
@@ -1,7 +1,5 @@
|
||||
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution
|
||||
environments (COULOMBCORE, Railiance nodes) connected to the local Custodian
|
||||
State Hub so Claude Code sessions on those machines have full MCP connectivity.
|
||||
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
|
||||
|
||||
**Domain:** custodian
|
||||
**Domain:** infotech
|
||||
**Repo slug:** ops-bridge
|
||||
**Repo ID:** 1bf99f56-6e94-4379-a9ea-295a4c181889
|
||||
**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
|
||||
|
||||
@@ -1,24 +1,85 @@
|
||||
## Custodian State Hub Integration
|
||||
## Session Protocol
|
||||
|
||||
State Hub: http://127.0.0.1:8000
|
||||
|
||||
### Session Protocol
|
||||
Dev Hub (State Hub API): http://127.0.0.1:8000
|
||||
MCP server name in `~/.claude.json`: `dev-hub`
|
||||
|
||||
**Step 1 — Orient**
|
||||
|
||||
Read the offline-safe brief first — it works without a live hub connection:
|
||||
```bash
|
||||
cat .custodian-brief.md
|
||||
```
|
||||
get_domain_summary("custodian")
|
||||
Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
|
||||
```
|
||||
get_domain_summary("infotech")
|
||||
```
|
||||
If MCP tools are unavailable in the current agent session, use the REST API:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
|
||||
```
|
||||
If the hub is offline: `cd ~/state-hub && make api`
|
||||
|
||||
**Step 2 — Check inbox**
|
||||
With MCP tools:
|
||||
```
|
||||
get_messages(to_agent="ops-bridge", unread_only=True)
|
||||
```
|
||||
Mark read with `mark_message_read(message_id)`. Reply or act on coordination
|
||||
requests before proceeding.
|
||||
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
**Step 2 — Scan workplans**
|
||||
```
|
||||
**Step 3 — Scan workplans**
|
||||
```bash
|
||||
ls workplans/
|
||||
```
|
||||
For each file with `status: ready`, `active`, or `blocked`, note pending
|
||||
`wait`/`todo`/`progress` tasks.
|
||||
|
||||
**During work:** use `record_decision()`, `add_progress_event()`, `resolve_decision()`.
|
||||
**Step 4 — Present brief**
|
||||
|
||||
**Session close:** `add_progress_event()` with workstream_id.
|
||||
1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
|
||||
2. **Pending tasks** from `workplans/` + any `[repo:ops-bridge]` hub tasks
|
||||
3. **Goal guidance** — if `goal_guidance` in summary:
|
||||
- `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
|
||||
- `alignment_warnings`: flag if active work is not aligned with current goal
|
||||
4. **Suggested next action** — highest-priority open item
|
||||
5. **SBOM status** — flag if `last_sbom_at` is unset for this repo
|
||||
|
||||
If workplan files were modified, run from `~/the-custodian/state-hub/`:
|
||||
```bash
|
||||
make fix-consistency REPO=ops-bridge
|
||||
If no workstreams: follow First Session Protocol (`first-session.md`).
|
||||
|
||||
**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
|
||||
|
||||
> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
|
||||
> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
|
||||
|
||||
**Session close:**
|
||||
With MCP tools:
|
||||
```
|
||||
add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
|
||||
```
|
||||
Without MCP tools:
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
|
||||
```
|
||||
If workplan files were modified, ensure the local copy is up to date first:
|
||||
```bash
|
||||
git -C <repo_path> pull --ff-only
|
||||
cd ~/state-hub && make fix-consistency REPO=ops-bridge
|
||||
```
|
||||
For repos where implementation runs on a remote machine (e.g. CoulombCore),
|
||||
use the combined target which pulls before fixing:
|
||||
```bash
|
||||
cd ~/state-hub && make fix-consistency-remote REPO=ops-bridge
|
||||
```
|
||||
**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
|
||||
will sync the file to match DB. **C-16** (repo behind remote) blocks all writes
|
||||
until you pull — intentional to prevent clobbering remote progress.
|
||||
|
||||
@@ -1,46 +1,19 @@
|
||||
## What this repo builds
|
||||
|
||||
A CLI tool (`bridge`) that manages named SSH reverse tunnels:
|
||||
|
||||
```
|
||||
bridge up [TUNNEL] # start tunnel(s)
|
||||
bridge down [TUNNEL] # stop tunnel(s)
|
||||
bridge restart [TUNNEL] # restart tunnel(s)
|
||||
bridge status # show all tunnels: state, uptime, last health check
|
||||
bridge logs [TUNNEL] # tail reconnect log
|
||||
```
|
||||
|
||||
Config file: `~/.config/bridge/tunnels.yaml`
|
||||
|
||||
Each tunnel:
|
||||
- Named (e.g. `state-hub-coulombcore`)
|
||||
- Reverse SSH port-forward: `ssh -R remote_port:127.0.0.1:local_port host`
|
||||
- Auto-reconnects on drop (backoff loop)
|
||||
- Optional HTTP health check to confirm the forwarded service is reachable
|
||||
|
||||
PRD: `workplans/BRIDGE-WP-0001-initial-implementation.md`
|
||||
|
||||
## Stack
|
||||
|
||||
- **Language:** Python 3.11+
|
||||
- **CLI framework:** Typer
|
||||
- **Dependencies:** typer, pyyaml, httpx
|
||||
- **Packaging:** `uv tool install` (single command install, no venv activation)
|
||||
- **No system daemons** — process management is internal, PID tracked in
|
||||
`~/.local/state/bridge/`
|
||||
<!-- TODO: Fill in language, frameworks, and key dependencies -->
|
||||
- **Language:**
|
||||
- **Key deps:**
|
||||
|
||||
## Dev Commands
|
||||
|
||||
```bash
|
||||
# Install locally for development
|
||||
uv tool install -e .
|
||||
# TODO: Fill in the standard commands for this repo
|
||||
|
||||
# Install dependencies
|
||||
|
||||
# Run tests
|
||||
uv run pytest
|
||||
|
||||
# Run a single test
|
||||
uv run pytest tests/test_tunnel.py::test_name -v
|
||||
# Lint / type check
|
||||
|
||||
# Lint
|
||||
uv run ruff check .
|
||||
# Build / package (if applicable)
|
||||
```
|
||||
|
||||
@@ -1,6 +1,40 @@
|
||||
### Workplan Convention (ADR-001)
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
File location: `workplans/BRIDGE-WP-NNNN-<slug>.md`
|
||||
Prefix: `BRIDGE-WP`
|
||||
ID prefix: `BRIDGE-WP-`
|
||||
|
||||
<!-- Ralph Loop rules are defined globally in ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
Work items originate as files in this repo **before** being registered in the hub.
|
||||
|
||||
Canonical workplan/workstream frontmatter statuses are:
|
||||
`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
|
||||
Use `proposed` for a newly drafted plan, `ready` after review against current
|
||||
repo state, and `finished` when implementation is complete. `stalled` and
|
||||
`needs_review` are derived health labels, not stored statuses.
|
||||
|
||||
Closed workplans may be moved to `workplans/archived/` with a completion-date
|
||||
prefix: `YYMMDD-BRIDGE-WP-NNNN-<slug>.md`. The frontmatter id remains
|
||||
unchanged; the prefix is only for quick visual reference.
|
||||
|
||||
Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
|
||||
`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
|
||||
`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
|
||||
directly. Promote anything requiring analysis, design, approval, dependencies, or
|
||||
multiple planned phases into a normal workplan.
|
||||
|
||||
Ecosystem todos from other agents arrive as `[repo:ops-bridge]` hub tasks —
|
||||
visible at session start. Pick one up by creating the workplan file, then registering
|
||||
the workstream.
|
||||
|
||||
Task blocks use this shape:
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
```
|
||||
|
||||
Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
|
||||
blocked work and `cancel` for stopped work.
|
||||
|
||||
<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
|
||||
|
||||
7
.codex/config.toml
Normal file
7
.codex/config.toml
Normal file
@@ -0,0 +1,7 @@
|
||||
[mcp_servers.ops-bridge]
|
||||
command = "uv"
|
||||
args = [
|
||||
"run",
|
||||
"python",
|
||||
"src/bridge/mcp_server/server.py",
|
||||
]
|
||||
18
.custodian-brief.md
Normal file
18
.custodian-brief.md
Normal file
@@ -0,0 +1,18 @@
|
||||
<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
|
||||
# Custodian Brief — ops-bridge
|
||||
|
||||
**Domain:** infotech
|
||||
**Last synced:** 2026-07-03 16:52 UTC
|
||||
**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
|
||||
|
||||
## Active Workstreams
|
||||
|
||||
*(none — repo may need first-session setup)*
|
||||
|
||||
---
|
||||
## MCP Orientation (when available)
|
||||
|
||||
If the state-hub MCP server is reachable, call:
|
||||
`get_domain_summary("infotech")`
|
||||
This provides richer cross-domain context.
|
||||
If the MCP call fails, use this file as your orientation source.
|
||||
26
.repo-classification.yaml
Normal file
26
.repo-classification.yaml
Normal file
@@ -0,0 +1,26 @@
|
||||
# Repo classification (Repo Classification Standard v1.0).
|
||||
|
||||
repo_classification:
|
||||
standard: Repo Classification Standard
|
||||
version: '1.0'
|
||||
classified_at: '2026-06-22'
|
||||
classified_by: human
|
||||
category: tooling
|
||||
domain: infotech
|
||||
secondary_domains: []
|
||||
capability_tags:
|
||||
- operations
|
||||
- access-control
|
||||
- platform
|
||||
- observability
|
||||
- orchestration
|
||||
business_stake:
|
||||
- operations
|
||||
- technology
|
||||
- automation
|
||||
business_mechanics:
|
||||
- control
|
||||
- operation
|
||||
- adaptation
|
||||
notes: SSH reverse-tunnel lifecycle manager keeping remote environments connected to the
|
||||
State Hub. Operational tooling -> product.
|
||||
219
AGENTS.md
Normal file
219
AGENTS.md
Normal file
@@ -0,0 +1,219 @@
|
||||
# ops-bridge — Agent Instructions
|
||||
|
||||
## Repo Identity
|
||||
|
||||
**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
|
||||
|
||||
**Domain:** infotech
|
||||
**Repo slug:** ops-bridge
|
||||
**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
|
||||
**Workplan prefix:** `BRIDGE-WP-`
|
||||
|
||||
---
|
||||
|
||||
## State Hub Integration
|
||||
|
||||
The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
|
||||
there is no MCP server for Codex agents.
|
||||
|
||||
| Context | URL |
|
||||
|---------|-----|
|
||||
| Local workstation | `http://127.0.0.1:8000` |
|
||||
| Remote via tunnel | `http://127.0.0.1:18000` |
|
||||
|
||||
### Orient at session start
|
||||
|
||||
```bash
|
||||
# Offline brief — works without hub connection
|
||||
cat .custodian-brief.md
|
||||
|
||||
# Active workstreams for this domain
|
||||
curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
|
||||
| python3 -m json.tool
|
||||
|
||||
# Check inbox
|
||||
curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
|
||||
| python3 -m json.tool
|
||||
```
|
||||
|
||||
Mark a message read:
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
|
||||
-H "Content-Type: application/json" -d '{}'
|
||||
```
|
||||
|
||||
### Log progress (required at session close)
|
||||
|
||||
```bash
|
||||
curl -s -X POST http://127.0.0.1:8000/progress/ \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"summary": "what was done",
|
||||
"event_type": "note",
|
||||
"author": "codex",
|
||||
"workstream_id": "<uuid>",
|
||||
"task_id": "<uuid>"
|
||||
}'
|
||||
```
|
||||
|
||||
Omit `workstream_id` / `task_id` when not applicable.
|
||||
|
||||
### Update task status
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"status": "progress"}'
|
||||
# values: wait | todo | progress | done | cancel
|
||||
```
|
||||
|
||||
### Flag a task for human review
|
||||
|
||||
```bash
|
||||
curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"needs_human": true, "intervention_note": "reason"}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Session Protocol
|
||||
|
||||
**Start:**
|
||||
1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
|
||||
2. Check inbox: `GET /messages/?to_agent=ops-bridge&unread_only=true`; mark read
|
||||
3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
|
||||
4. Check human-needed tasks: `GET /tasks/?needs_human=true`
|
||||
|
||||
**During work:**
|
||||
- Update task statuses in workplan files as tasks progress
|
||||
- Record significant decisions via `POST /decisions/`
|
||||
|
||||
**Close:**
|
||||
1. Update workplan file task statuses to reflect progress
|
||||
2. Log: `POST /progress/` with a summary of what changed
|
||||
3. Note for the custodian operator: after workplan file changes, run from
|
||||
`~/state-hub`:
|
||||
```bash
|
||||
make fix-consistency REPO=ops-bridge
|
||||
```
|
||||
This syncs task status from files into the hub DB.
|
||||
|
||||
---
|
||||
|
||||
## Credential and access routing
|
||||
|
||||
**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
|
||||
for inference. Run this check **before** requesting secrets, API keys, SSH access,
|
||||
login tokens, or database passwords — in any repo, not only `ops-warden`.
|
||||
|
||||
ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
|
||||
other credential need belongs to another subsystem. **Do not** message
|
||||
`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
|
||||
|
||||
### Lookup (do this first)
|
||||
|
||||
```bash
|
||||
warden route find "<describe your need>" --json
|
||||
warden route show <catalog-id> --json
|
||||
```
|
||||
|
||||
Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
|
||||
|
||||
| Agent runtime | How to orient |
|
||||
| --- | --- |
|
||||
| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
|
||||
| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
|
||||
| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
|
||||
|
||||
### Quick routing table
|
||||
|
||||
| I need… | Owner | ops-warden executes? |
|
||||
| --- | --- | --- |
|
||||
| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
|
||||
| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
|
||||
| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
|
||||
| Authorization decision | flex-auth | No — route only |
|
||||
| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
|
||||
| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
|
||||
|
||||
### Anti-patterns (do not do these)
|
||||
|
||||
- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
|
||||
- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
|
||||
- Pasting secrets into Git, State Hub, workplans, logs, or chat
|
||||
|
||||
### Other capabilities (reuse-surface)
|
||||
|
||||
Non-credential capabilities are usually discovered through **reuse-surface** federation
|
||||
(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
|
||||
every repo's agent instructions because it is high-frequency, high-risk, and easy to
|
||||
get wrong.
|
||||
|
||||
**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
|
||||
|
||||
<!-- REPO-AGENTS-EXTENSIONS -->
|
||||
<!-- Append repo-specific agent instructions below this marker.
|
||||
The state-hub template sync preserves content after this line. -->
|
||||
|
||||
---
|
||||
|
||||
## Workplan Convention (ADR-001)
|
||||
|
||||
Work items originate as files in this repo — not in the hub. The hub is a
|
||||
read/cache/index layer that rebuilds from files.
|
||||
|
||||
**File location:** `workplans/OPS-WP-NNNN-<slug>.md`
|
||||
|
||||
**Archived location:** finished workplans may move to
|
||||
`workplans/archived/YYMMDD-OPS-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
|
||||
the completion/archive date; the frontmatter `id` does not change.
|
||||
|
||||
**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
|
||||
`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
|
||||
this only for low-risk work completed directly; create a normal workplan for
|
||||
anything needing analysis, design, approval, dependencies, or multiple phases.
|
||||
|
||||
**Frontmatter:**
|
||||
|
||||
```yaml
|
||||
---
|
||||
id: OPS-WP-NNNN
|
||||
type: workplan
|
||||
title: "..."
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: proposed | ready | active | blocked | backlog | finished | archived
|
||||
owner: codex
|
||||
topic_slug: ...
|
||||
created: "YYYY-MM-DD"
|
||||
updated: "YYYY-MM-DD"
|
||||
state_hub_workstream_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
---
|
||||
```
|
||||
|
||||
Use `proposed` for a new draft, `ready` after review against current repo
|
||||
state, and `finished` after implementation. `stalled` and `needs_review` are
|
||||
derived health labels, not frontmatter statuses.
|
||||
|
||||
**Task block format** (one per `##` section):
|
||||
|
||||
```
|
||||
## Task Title
|
||||
|
||||
` ` `task
|
||||
id: OPS-WP-NNNN-T01
|
||||
status: wait | todo | progress | done | cancel
|
||||
priority: high | medium | low
|
||||
state_hub_task_id: "<uuid>" # written by fix-consistency — do not edit
|
||||
` ` `
|
||||
|
||||
Task description text.
|
||||
```
|
||||
|
||||
Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
|
||||
|
||||
To create a new workplan:
|
||||
1. Write the file following the format above
|
||||
2. Notify the custodian operator to run `make fix-consistency REPO=ops-bridge`
|
||||
(or send a message to the hub agent via `POST /messages/`)
|
||||
@@ -1,8 +1,12 @@
|
||||
# ops-bridge — Claude Code Instructions
|
||||
|
||||
@SCOPE.md
|
||||
@.claude/rules/repo-identity.md
|
||||
@.claude/rules/session-protocol.md
|
||||
@.claude/rules/first-session.md
|
||||
@.claude/rules/workplan-convention.md
|
||||
@.claude/rules/stack-and-commands.md
|
||||
@.claude/rules/architecture.md
|
||||
@.claude/rules/repo-boundary.md
|
||||
@.claude/rules/credential-routing.md
|
||||
@.claude/rules/agents.md
|
||||
|
||||
92
INTENT.md
Normal file
92
INTENT.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# INTENT
|
||||
|
||||
## Purpose
|
||||
|
||||
This repository exists to provide a **reliable, inspectable, and controllable connectivity layer**
|
||||
between distributed dev, build, test and execution environments for dev and ops personal human and agentic.
|
||||
|
||||
Its role is to ensure that remote machines can **consistently and safely “phone home”** without requiring complex network infrastructure or manual intervention.
|
||||
|
||||
---
|
||||
|
||||
## Primary Utility
|
||||
|
||||
The repository provides a **managed SSH reverse tunneling system** that:
|
||||
|
||||
* Maintains continuous connectivity between remote systems and a central hub
|
||||
* Makes connectivity **observable, auditable, and controllable**
|
||||
* Exposes this capability as both a **CLI tool and an MCP-accessible service**
|
||||
|
||||
It transforms raw SSH port-forwarding into a **first-class operational primitive**.
|
||||
|
||||
---
|
||||
|
||||
## Intended Users
|
||||
|
||||
* Human operators (`adm`) managing infrastructure and connectivity
|
||||
* LLM-based agents (`agt`) requiring stable access to local services
|
||||
* Deterministic automations (`atm`) coordinating distributed workloads
|
||||
|
||||
---
|
||||
|
||||
## Strategic Role in the System
|
||||
|
||||
This repository acts as the **connectivity backbone** of the custodian ecosystem:
|
||||
|
||||
* It enables remote agents and services to participate in a **locally anchored control plane**
|
||||
* It decouples **execution location** from **control location**
|
||||
* It supports a **hub-and-spoke topology** where the Custodian State Hub remains central
|
||||
|
||||
---
|
||||
|
||||
## Strategic Boundaries
|
||||
|
||||
This repository is **not** intended to:
|
||||
|
||||
* Replace SSH as a general-purpose access mechanism
|
||||
* Act as a credential authority or security policy engine
|
||||
* Provide full network virtualization (e.g., VPN, mesh networking)
|
||||
* Host or orchestrate application workloads
|
||||
|
||||
Its responsibility ends at **secure, observable, and managed connectivity via tunnels**.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
* **Continuity over convenience**
|
||||
Connectivity must persist across failures without manual recovery
|
||||
|
||||
* **Observability as a first-class concern**
|
||||
All lifecycle events must be traceable and attributable
|
||||
|
||||
* **Actor-aware operations**
|
||||
Every action is tied to a clearly defined actor type (`adm`, `agt`, `atm`)
|
||||
|
||||
* **Pluggable security integration**
|
||||
Works with both static keys and external certificate authorities without owning them
|
||||
|
||||
* **Toolability**
|
||||
All capabilities should be accessible programmatically (MCP) and operationally (CLI)
|
||||
|
||||
---
|
||||
|
||||
## Maturity Target
|
||||
|
||||
A mature version of this repository should:
|
||||
|
||||
* Provide **fully autonomous tunnel lifecycle management** across heterogeneous environments
|
||||
* Integrate seamlessly with **centralized access control and certificate systems**
|
||||
* Serve as a **standardized connectivity primitive** across all Custodian-managed systems
|
||||
* Offer **complete operational transparency** for all connectivity-related actions
|
||||
* Be robust enough to act as the **default connectivity layer** for distributed agent systems
|
||||
|
||||
---
|
||||
|
||||
## Stability Note
|
||||
|
||||
Changes to this file represent a **deliberate shift in repository purpose or role** within the system architecture.
|
||||
|
||||
Such changes should be rare and made with explicit intent.
|
||||
|
||||
|
||||
31
Makefile
31
Makefile
@@ -1,10 +1,31 @@
|
||||
.PHONY: test lint install
|
||||
.DEFAULT_GOAL := help
|
||||
|
||||
test:
|
||||
.PHONY: help setup test lint install mcp-http mcp-stop cron-install-cron cron-uninstall-cron
|
||||
|
||||
help: ## List available make targets
|
||||
@awk 'BEGIN {FS = ":.*## "}; /^[a-zA-Z0-9_.-]+:.*## / {printf " %-16s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
|
||||
|
||||
setup: ## Sync dependencies and install the bridge CLI wrapper
|
||||
uv sync --all-groups
|
||||
uv tool install -e . --force
|
||||
|
||||
test: ## Run the test suite
|
||||
uv run pytest
|
||||
|
||||
lint:
|
||||
lint: ## Run ruff lint checks
|
||||
uv run ruff check .
|
||||
|
||||
install:
|
||||
uv tool install -e .
|
||||
install: ## Install the bridge CLI wrapper
|
||||
uv tool install -e . --force
|
||||
|
||||
mcp-http: ## Start MCP server in SSE mode (default port 8002)
|
||||
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
|
||||
|
||||
mcp-stop: ## Stop MCP server running on port 8002
|
||||
@lsof -ti:$${BRIDGE_MCP_PORT:-8002} | xargs -r kill -TERM && echo "MCP server stopped" || echo "No MCP server running on port $${BRIDGE_MCP_PORT:-8002}"
|
||||
|
||||
cron-install-cron: ## Install 03:00 nightly stale-forward cleanup cron
|
||||
bridge maintenance install-cron
|
||||
|
||||
cron-uninstall-cron: ## Remove nightly stale-forward cleanup cron
|
||||
bridge maintenance uninstall-cron
|
||||
|
||||
25
README.txt
25
README.txt
@@ -243,6 +243,31 @@ has not yet cleaned up the socket), so the next reconnect attempt hits
|
||||
"remote port forwarding failed" and exits with code 255. With ClientAlive
|
||||
enabled, sshd evicts stale sessions within ~90 seconds and frees the port.
|
||||
|
||||
NIGHTLY STALE-FORWARD CLEANUP
|
||||
------------------------------
|
||||
|
||||
When a bridge client dies without tearing down its SSH session, the remote
|
||||
host can keep port 18000 (etc.) bound to a zombie sshd listener. The port
|
||||
accepts connections but never forwards them, which breaks in-cluster proxies
|
||||
such as actcore-state-hub-bridge on railiance01.
|
||||
|
||||
Install a 03:00 local-time cron job that probes each reverse tunnel's remote
|
||||
forward, kills stale listeners when the local service is healthy but the
|
||||
remote forward is not, and restarts the tunnel:
|
||||
|
||||
bridge maintenance install-cron
|
||||
|
||||
Manual run:
|
||||
|
||||
bridge maintenance cleanup --restart
|
||||
|
||||
Inspect or remove the cron entry:
|
||||
|
||||
bridge maintenance show-cron
|
||||
bridge maintenance uninstall-cron
|
||||
|
||||
Logs append to ~/.local/state/bridge/cleanup.log
|
||||
|
||||
Apply and reload (no disconnect):
|
||||
|
||||
sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
|
||||
|
||||
51
SCOPE.md
51
SCOPE.md
@@ -8,7 +8,7 @@
|
||||
|
||||
## One-liner
|
||||
|
||||
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards.
|
||||
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.
|
||||
|
||||
---
|
||||
|
||||
@@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
||||
|
||||
## In Scope
|
||||
|
||||
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`)
|
||||
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
|
||||
- Auto-reconnect with exponential backoff and configurable retry policy
|
||||
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
|
||||
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
|
||||
- Actor attribution: per-tunnel actor class (human / automation) for audit traceability
|
||||
- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
|
||||
with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
|
||||
- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
|
||||
works without any CA or external tooling
|
||||
- **cert_command mode** (optional): pluggable shell command that issues a short-lived
|
||||
CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
|
||||
`cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
|
||||
- PID + state file management in `~/.local/state/bridge/`
|
||||
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
|
||||
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
|
||||
@@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Identity/credential management (uses existing SSH keys)
|
||||
- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
|
||||
certs via the `cert_command` interface but never signs anything itself)
|
||||
- SSH key generation for human admins (self-service: `ssh-keygen`)
|
||||
- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
|
||||
- Long-running application hosting on remote machines (port-forward only, not deployment)
|
||||
- VPN or layer-3 connectivity
|
||||
- Monitoring/alerting beyond JSON audit logs
|
||||
@@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
||||
## Relevant When
|
||||
|
||||
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
|
||||
- Need audit trail of which actor (human vs. automation) started/stopped tunnels
|
||||
- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
|
||||
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub
|
||||
- Diagnosing connectivity issues between local hub and remote services
|
||||
- Checking certificate validity for active tunnels (`bridge cert-status`)
|
||||
- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials
|
||||
|
||||
---
|
||||
|
||||
@@ -60,8 +71,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
||||
|
||||
## Current State
|
||||
|
||||
- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped)
|
||||
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated
|
||||
- Status: active (v0.1 core complete; AccessManagementDirective alignment done — BRIDGE-WP-0004)
|
||||
- Implementation: ~80% — CLI tunneling fully functional, MCP integration working, health
|
||||
checks and audit logging complete; ActorType enum (adm/agt/atm) enforced; cert_command
|
||||
mode implemented with TTL-aware refresh and cert_identity audit logging; OpsCatalog
|
||||
framework present but not yet populated
|
||||
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures
|
||||
- Usage: running in lab for daily Railiance/Temporal connectivity
|
||||
|
||||
@@ -77,17 +91,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
||||
|
||||
## Terminology
|
||||
|
||||
- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check
|
||||
- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
|
||||
cert_command, cert_identity
|
||||
- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
|
||||
- Also known as: "the bridge"
|
||||
- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
|
||||
- Potentially confusing: "bridge state" is a tunnel-specific state machine
|
||||
(stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
|
||||
- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)
|
||||
|
||||
---
|
||||
|
||||
## Related / Overlapping Repositories
|
||||
## Related / Overlapping
|
||||
|
||||
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
|
||||
- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
|
||||
`cert_command` when short-lived certificates are required
|
||||
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel
|
||||
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home
|
||||
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
|
||||
host-side principal deployment (`/etc/ssh/auth_principals/`)
|
||||
|
||||
---
|
||||
|
||||
@@ -105,5 +126,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge
|
||||
## Getting Oriented
|
||||
|
||||
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
|
||||
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files)
|
||||
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; MCP: `bridge_status()`
|
||||
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
|
||||
`~/.local/state/bridge/` (PID/state/cert files)
|
||||
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
|
||||
MCP: `bridge_status()`
|
||||
- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
|
||||
- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)
|
||||
|
||||
@@ -11,7 +11,7 @@ dependencies = [
|
||||
"typer>=0.12",
|
||||
"pyyaml>=6.0",
|
||||
"httpx>=0.27",
|
||||
"fastmcp>=2.0.0",
|
||||
"fastmcp>=2.0.0,<3.1.0",
|
||||
]
|
||||
|
||||
[project.scripts]
|
||||
|
||||
12
registry/README.md
Normal file
12
registry/README.md
Normal file
@@ -0,0 +1,12 @@
|
||||
# Capability Registry
|
||||
|
||||
Markdown-first capability index for federation and reuse planning.
|
||||
|
||||
## Authoring
|
||||
|
||||
1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
|
||||
2. Add the row to `indexes/capabilities.yaml`.
|
||||
3. Run `reuse-surface validate` from a checkout with the CLI installed.
|
||||
4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
|
||||
|
||||
Federation contract: reuse-surface `docs/RegistryFederation.md`.
|
||||
0
registry/capabilities/.gitkeep
Normal file
0
registry/capabilities/.gitkeep
Normal file
4
registry/indexes/capabilities.yaml
Normal file
4
registry/indexes/capabilities.yaml
Normal file
@@ -0,0 +1,4 @@
|
||||
version: 1
|
||||
updated: '2026-06-16'
|
||||
domain: helix_forge
|
||||
capabilities: []
|
||||
@@ -16,6 +16,7 @@ class AuditEvent(str, Enum):
|
||||
HEALTH_CHECK_FAILED = "health_check_failed"
|
||||
HEALTH_CHECK_RECOVERED = "health_check_recovered"
|
||||
BRIDGE_STOPPED = "bridge_stopped"
|
||||
CERT_EXPIRING = "cert_expiring"
|
||||
|
||||
|
||||
def _default_state_dir() -> Path:
|
||||
@@ -34,19 +35,22 @@ class AuditLogger:
|
||||
tunnel: str,
|
||||
event: AuditEvent,
|
||||
actor: str,
|
||||
actor_class: str,
|
||||
actor_type: str,
|
||||
detail: str = "",
|
||||
cert_identity: Optional[str] = None,
|
||||
) -> None:
|
||||
self._dir.mkdir(parents=True, exist_ok=True)
|
||||
entry: Dict[str, Any] = {
|
||||
"timestamp": datetime.now(timezone.utc).isoformat(),
|
||||
"tunnel": tunnel,
|
||||
"actor": actor,
|
||||
"actor_class": actor_class,
|
||||
"actor_type": actor_type,
|
||||
"event": event.value,
|
||||
}
|
||||
if detail:
|
||||
entry["detail"] = detail
|
||||
if cert_identity:
|
||||
entry["cert_identity"] = cert_identity
|
||||
with self._log_path(tunnel).open("a") as f:
|
||||
f.write(json.dumps(entry) + "\n")
|
||||
|
||||
|
||||
@@ -73,6 +73,11 @@ CAPABILITIES: list[Capability] = [
|
||||
description="End-to-end tunnel diagnostics via SSH: SSH PID alive + remote port listening",
|
||||
required_access_modes=frozenset({"cli", "mcp"}),
|
||||
),
|
||||
Capability(
|
||||
name="bridge_cert_status",
|
||||
description="Show certificate status for tunnels using cert_command mode",
|
||||
required_access_modes=frozenset({"cli"}),
|
||||
),
|
||||
]
|
||||
|
||||
CAPABILITIES_BY_NAME: dict[str, Capability] = {c.name: c for c in CAPABILITIES}
|
||||
|
||||
328
src/bridge/cleanup.py
Normal file
328
src/bridge/cleanup.py
Normal file
@@ -0,0 +1,328 @@
|
||||
"""Nightly maintenance: detect and clear stale SSH remote port forwards."""
|
||||
from __future__ import annotations
|
||||
|
||||
import subprocess
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
from urllib.parse import urlparse, urlunparse
|
||||
|
||||
import httpx
|
||||
|
||||
from bridge.diagnostics import _remote_port_probe_command, check_tunnel
|
||||
from bridge.manager import TunnelManager
|
||||
from bridge.models import TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
@dataclass
|
||||
class CleanupAction:
|
||||
tunnel: str
|
||||
action: str # skipped | healthy | cleaned | cleaned_and_restarted | error
|
||||
detail: str = ""
|
||||
|
||||
|
||||
@dataclass
|
||||
class CleanupReport:
|
||||
actions: list[CleanupAction]
|
||||
|
||||
@property
|
||||
def cleaned_count(self) -> int:
|
||||
return sum(1 for a in self.actions if a.action.startswith("cleaned"))
|
||||
|
||||
|
||||
def remote_forward_health_url(cfg: TunnelConfig) -> Optional[str]:
|
||||
"""Map the local health_check URL to the remote forwarded port."""
|
||||
if cfg.health_check is None or cfg.direction == "local":
|
||||
return None
|
||||
parsed = urlparse(cfg.health_check.url)
|
||||
if not parsed.hostname:
|
||||
return None
|
||||
netloc = f"{parsed.hostname}:{cfg.remote_port}"
|
||||
return urlunparse(parsed._replace(netloc=netloc))
|
||||
|
||||
|
||||
def _ssh_base_cmd(cfg: TunnelConfig) -> list[str]:
|
||||
from pathlib import Path
|
||||
|
||||
return [
|
||||
"ssh",
|
||||
"-i",
|
||||
str(Path(cfg.ssh_key).expanduser()),
|
||||
"-o",
|
||||
"BatchMode=yes",
|
||||
"-o",
|
||||
"ConnectTimeout=10",
|
||||
"-o",
|
||||
"StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
]
|
||||
|
||||
|
||||
def _run_ssh(cfg: TunnelConfig, remote_command: str, *, timeout: float = 30) -> subprocess.CompletedProcess[str]:
|
||||
return subprocess.run(
|
||||
[*_ssh_base_cmd(cfg), remote_command],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout,
|
||||
)
|
||||
|
||||
|
||||
def remote_port_listening(cfg: TunnelConfig) -> bool:
|
||||
proc = _run_ssh(cfg, _remote_port_probe_command(cfg.remote_port), timeout=15)
|
||||
return proc.stdout.strip() == "ok"
|
||||
|
||||
|
||||
def probe_remote_forward(cfg: TunnelConfig) -> tuple[bool, str]:
|
||||
"""Return (healthy, detail) for the remote forwarded service."""
|
||||
url = remote_forward_health_url(cfg)
|
||||
if url is None:
|
||||
return True, "no remote health url configured"
|
||||
timeout = cfg.health_check.timeout_seconds if cfg.health_check else 5
|
||||
remote_cmd = (
|
||||
f"curl -sf --max-time {timeout} {url!r} >/dev/null "
|
||||
"&& echo ok || echo fail"
|
||||
)
|
||||
try:
|
||||
proc = _run_ssh(cfg, remote_cmd, timeout=timeout + 15)
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "remote health probe timed out"
|
||||
output = proc.stdout.strip()
|
||||
if output == "ok":
|
||||
return True, "remote forward healthy"
|
||||
if proc.returncode != 0 and proc.stderr.strip():
|
||||
return False, proc.stderr.strip()
|
||||
return False, "remote forward unhealthy"
|
||||
|
||||
|
||||
def local_service_healthy(cfg: TunnelConfig) -> Optional[bool]:
|
||||
if cfg.health_check is None:
|
||||
return None
|
||||
try:
|
||||
resp = httpx.get(
|
||||
cfg.health_check.url,
|
||||
timeout=cfg.health_check.timeout_seconds,
|
||||
)
|
||||
return resp.is_success
|
||||
except Exception:
|
||||
return False
|
||||
|
||||
|
||||
def _remote_cleanup_script(port: int) -> str:
|
||||
return f"""set -eu
|
||||
port={port}
|
||||
pids=""
|
||||
if command -v lsof >/dev/null 2>&1; then
|
||||
pids=$(sudo -n lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
|
||||
if [ -z "$pids" ]; then
|
||||
pids=$(lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
|
||||
fi
|
||||
fi
|
||||
if [ -z "$pids" ] && command -v fuser >/dev/null 2>&1; then
|
||||
pids=$(fuser -n tcp $port 2>/dev/null | tr -s ' ' '\\n' | grep -E '^[0-9]+$' || true)
|
||||
fi
|
||||
if [ -z "$pids" ]; then
|
||||
echo "no_listeners"
|
||||
exit 0
|
||||
fi
|
||||
echo "killing:$pids"
|
||||
for pid in $pids; do
|
||||
kill "$pid" 2>/dev/null || sudo -n kill "$pid" 2>/dev/null || true
|
||||
done
|
||||
sleep 1
|
||||
if ss -tln 2>/dev/null | grep -q ":$port "; then
|
||||
echo "still_listening"
|
||||
else
|
||||
echo "cleared"
|
||||
fi
|
||||
"""
|
||||
|
||||
|
||||
def clear_stale_remote_binding(cfg: TunnelConfig) -> tuple[bool, str]:
|
||||
try:
|
||||
proc = _run_ssh(cfg, _remote_cleanup_script(cfg.remote_port), timeout=30)
|
||||
except subprocess.TimeoutExpired:
|
||||
return False, "remote cleanup timed out"
|
||||
output = proc.stdout.strip()
|
||||
if "cleared" in output:
|
||||
return True, output
|
||||
if "no_listeners" in output:
|
||||
return True, "no listeners found"
|
||||
if "still_listening" in output:
|
||||
return False, output
|
||||
detail = output or proc.stderr.strip() or f"exit {proc.returncode}"
|
||||
return False, detail
|
||||
|
||||
|
||||
def should_cleanup_tunnel(
|
||||
cfg: TunnelConfig,
|
||||
state_mgr: StateManager,
|
||||
) -> tuple[bool, str]:
|
||||
"""Decide whether a reverse tunnel's remote binding looks stale."""
|
||||
if cfg.direction == "local":
|
||||
return False, "local tunnel"
|
||||
|
||||
if not remote_port_listening(cfg):
|
||||
return False, "remote port closed"
|
||||
|
||||
remote_ok, remote_detail = probe_remote_forward(cfg)
|
||||
if remote_ok:
|
||||
return False, remote_detail
|
||||
|
||||
check = check_tunnel(cfg, state_mgr)
|
||||
local_ok = local_service_healthy(cfg)
|
||||
|
||||
if local_ok is True and not remote_ok:
|
||||
return True, f"stale forward: {remote_detail}"
|
||||
|
||||
if check.ssh_process != "ok" and check.remote_port == "listening":
|
||||
return True, f"orphan forward while ssh {check.ssh_process}: {remote_detail}"
|
||||
|
||||
if check.ssh_process == "ok" and not remote_ok:
|
||||
return True, f"broken forward with live client: {remote_detail}"
|
||||
|
||||
return False, remote_detail
|
||||
|
||||
|
||||
def cleanup_tunnel(
|
||||
cfg: TunnelConfig,
|
||||
state_mgr: StateManager,
|
||||
*,
|
||||
restart: bool,
|
||||
) -> CleanupAction:
|
||||
name = cfg.name
|
||||
try:
|
||||
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
|
||||
if not needed:
|
||||
return CleanupAction(name, "healthy", reason)
|
||||
|
||||
ok, detail = clear_stale_remote_binding(cfg)
|
||||
if not ok:
|
||||
return CleanupAction(name, "error", f"cleanup failed: {detail}")
|
||||
|
||||
if not restart:
|
||||
return CleanupAction(name, "cleaned", f"{reason}; {detail}")
|
||||
|
||||
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
|
||||
was_running = mgr.is_running()
|
||||
if was_running:
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
action = "cleaned_and_restarted"
|
||||
verb = "restarted" if was_running else "started"
|
||||
return CleanupAction(name, action, f"{reason}; {verb} tunnel; {detail}")
|
||||
except Exception as exc:
|
||||
return CleanupAction(name, "error", str(exc))
|
||||
|
||||
|
||||
def restart_tunnel(
|
||||
cfg: TunnelConfig,
|
||||
state_mgr: StateManager,
|
||||
) -> CleanupAction:
|
||||
"""Restart one tunnel with blank-slate recovery for reverse tunnels."""
|
||||
if cfg.direction == "local":
|
||||
mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
return CleanupAction(cfg.name, "restarted", "local tunnel stop/start")
|
||||
return cleanup_tunnel(cfg, state_mgr, restart=True)
|
||||
|
||||
|
||||
def restart_all_tunnels(
|
||||
cfg,
|
||||
state_mgr: StateManager,
|
||||
) -> list[CleanupAction]:
|
||||
"""Restart every inline tunnel (reverse via cleanup path, local via stop/start)."""
|
||||
return [restart_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]
|
||||
|
||||
|
||||
def cleanup_all_tunnels(
|
||||
cfg,
|
||||
state_mgr: StateManager,
|
||||
*,
|
||||
restart: bool,
|
||||
tunnel_name: Optional[str] = None,
|
||||
) -> CleanupReport:
|
||||
tunnels = cfg.tunnels.values()
|
||||
if tunnel_name is not None:
|
||||
if tunnel_name not in cfg.tunnels:
|
||||
raise KeyError(tunnel_name)
|
||||
tunnels = [cfg.tunnels[tunnel_name]]
|
||||
|
||||
actions = [
|
||||
cleanup_tunnel(tcfg, state_mgr, restart=restart)
|
||||
for tcfg in tunnels
|
||||
if tcfg.direction != "local"
|
||||
]
|
||||
return CleanupReport(actions=actions)
|
||||
|
||||
|
||||
CRON_MARKER = "# ops-bridge: maintenance cleanup"
|
||||
CRON_SCHEDULE = "0 3 * * *"
|
||||
CRON_LOG = "~/.local/state/bridge/cleanup.log"
|
||||
|
||||
|
||||
def build_cron_line() -> str:
|
||||
bridge_bin = "~/.local/bin/bridge"
|
||||
return (
|
||||
f"{CRON_SCHEDULE} BRIDGE_CONFIG=~/.config/bridge/tunnels.yaml "
|
||||
f"{bridge_bin} maintenance cleanup --restart "
|
||||
f">> {CRON_LOG} 2>&1 {CRON_MARKER}"
|
||||
)
|
||||
|
||||
|
||||
def read_installed_cron() -> Optional[str]:
|
||||
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
|
||||
if proc.returncode != 0:
|
||||
return None
|
||||
for line in proc.stdout.splitlines():
|
||||
if CRON_MARKER in line:
|
||||
return line.strip()
|
||||
return None
|
||||
|
||||
|
||||
def install_cleanup_cron() -> tuple[bool, str]:
|
||||
existing = read_installed_cron()
|
||||
if existing:
|
||||
return False, f"cron already installed: {existing}"
|
||||
|
||||
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
|
||||
current = proc.stdout if proc.returncode == 0 else ""
|
||||
new_line = build_cron_line()
|
||||
body = current.rstrip("\n")
|
||||
if body:
|
||||
body += "\n"
|
||||
body += new_line + "\n"
|
||||
write = subprocess.run(
|
||||
["crontab", "-"],
|
||||
input=body,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if write.returncode != 0:
|
||||
return False, write.stderr.strip() or "crontab write failed"
|
||||
return True, new_line
|
||||
|
||||
|
||||
def uninstall_cleanup_cron() -> tuple[bool, str]:
|
||||
proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
|
||||
if proc.returncode != 0:
|
||||
return False, "no crontab installed"
|
||||
kept = [
|
||||
line
|
||||
for line in proc.stdout.splitlines()
|
||||
if CRON_MARKER not in line
|
||||
]
|
||||
if len(kept) == len(proc.stdout.splitlines()):
|
||||
return False, "cleanup cron not found"
|
||||
body = "\n".join(kept).rstrip("\n")
|
||||
if body:
|
||||
body += "\n"
|
||||
write = subprocess.run(
|
||||
["crontab", "-"],
|
||||
input=body,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if write.returncode != 0:
|
||||
return False, write.stderr.strip() or "crontab write failed"
|
||||
return True, "removed cleanup cron entry"
|
||||
@@ -4,12 +4,24 @@ from __future__ import annotations
|
||||
import dataclasses
|
||||
import json
|
||||
import os
|
||||
import subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import typer
|
||||
|
||||
from bridge.audit import AuditLogger
|
||||
from bridge.cleanup import (
|
||||
CleanupAction,
|
||||
build_cron_line,
|
||||
cleanup_all_tunnels,
|
||||
install_cleanup_cron,
|
||||
read_installed_cron,
|
||||
restart_all_tunnels,
|
||||
restart_tunnel,
|
||||
uninstall_cleanup_cron,
|
||||
)
|
||||
from bridge.config import ConfigError, load_config
|
||||
from bridge.diagnostics import check_all_tunnels, check_tunnel
|
||||
from bridge.manager import TunnelManager
|
||||
@@ -23,9 +35,11 @@ app = typer.Typer(
|
||||
|
||||
targets_app = typer.Typer(help="Inspect infrastructure targets from the OpsCatalog.")
|
||||
catalog_app = typer.Typer(help="Inspect and validate the OpsCatalog.")
|
||||
maintenance_app = typer.Typer(help="Scheduled maintenance for tunnel hygiene.")
|
||||
|
||||
app.add_typer(targets_app, name="targets")
|
||||
app.add_typer(catalog_app, name="catalog")
|
||||
app.add_typer(maintenance_app, name="maintenance")
|
||||
|
||||
|
||||
def _state_dir() -> Path:
|
||||
@@ -142,27 +156,37 @@ def down(
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
def _emit_restart_actions(actions: list[CleanupAction]) -> None:
|
||||
any_error = False
|
||||
for action in actions:
|
||||
typer.echo(f"{action.tunnel}: {action.action} — {action.detail}")
|
||||
if action.action == "error":
|
||||
any_error = True
|
||||
if any_error:
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
@app.command()
|
||||
def restart(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
):
|
||||
"""Restart one or all tunnels."""
|
||||
"""Restart one or all tunnels.
|
||||
|
||||
Reverse tunnels run conditional remote stale-forward cleanup before
|
||||
reconnecting; healthy forwards are left running. Local-direction tunnels
|
||||
use local stop/start only.
|
||||
"""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
if tunnel:
|
||||
tcfg = _resolve_tunnel(cfg, tunnel)
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
typer.echo(f"Restarted tunnel '{tunnel}'.")
|
||||
actions = [restart_tunnel(tcfg, state_mgr)]
|
||||
else:
|
||||
for name in _all_tunnel_names(cfg):
|
||||
tcfg = cfg.tunnels[name]
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
typer.echo(f"Restarted tunnel '{name}'.")
|
||||
actions = restart_all_tunnels(cfg, state_mgr)
|
||||
|
||||
_emit_restart_actions(actions)
|
||||
|
||||
|
||||
@app.command()
|
||||
@@ -357,6 +381,84 @@ def _print_check_table(results):
|
||||
typer.echo(_fmt(row))
|
||||
|
||||
|
||||
@app.command("cert-status")
|
||||
def cert_status(
|
||||
tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""Show certificate status for tunnels using cert_command mode."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
|
||||
names = [tunnel] if tunnel else list(cfg.tunnels.keys())
|
||||
rows = []
|
||||
any_expired = False
|
||||
|
||||
for name in names:
|
||||
cert_file = sd / f"{name}-cert.pub"
|
||||
if not cert_file.exists():
|
||||
rows.append({"tunnel": name, "mode": "static-key", "cert_file": None})
|
||||
continue
|
||||
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ssh-keygen", "-L", "-f", str(cert_file)],
|
||||
capture_output=True, text=True, check=False,
|
||||
)
|
||||
info = {"tunnel": name, "mode": "cert", "cert_file": str(cert_file)}
|
||||
for line in result.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if line.startswith("Key ID:"):
|
||||
info["key_id"] = line.split(":", 1)[1].strip().strip('"')
|
||||
elif line.startswith("Valid:"):
|
||||
parts = line.split()
|
||||
if len(parts) >= 5 and parts[1] == "from" and parts[3] == "to":
|
||||
info["valid_from"] = parts[2]
|
||||
info["valid_until"] = parts[4]
|
||||
try:
|
||||
expires = datetime.fromisoformat(parts[4])
|
||||
now = datetime.now()
|
||||
remaining = expires - now
|
||||
if remaining.total_seconds() <= 0:
|
||||
info["expired"] = True
|
||||
any_expired = True
|
||||
else:
|
||||
info["expired"] = False
|
||||
mins = int(remaining.total_seconds() // 60)
|
||||
info["ttl_remaining"] = f"{mins}m"
|
||||
except ValueError:
|
||||
pass
|
||||
rows.append(info)
|
||||
except FileNotFoundError:
|
||||
rows.append({"tunnel": name, "mode": "cert", "error": "ssh-keygen not found"})
|
||||
|
||||
if as_json:
|
||||
typer.echo(json.dumps(rows, indent=2))
|
||||
else:
|
||||
for row in rows:
|
||||
mode = row.get("mode", "unknown")
|
||||
if mode == "static-key":
|
||||
typer.echo(f"{row['tunnel']} static-key / no cert")
|
||||
elif "error" in row:
|
||||
typer.echo(f"{row['tunnel']} ERROR: {row['error']}")
|
||||
else:
|
||||
parts = [row["tunnel"]]
|
||||
if "key_id" in row:
|
||||
parts.append(f"id={row['key_id']}")
|
||||
if "valid_from" in row:
|
||||
parts.append(f"from={row['valid_from']}")
|
||||
if "valid_until" in row:
|
||||
parts.append(f"until={row['valid_until']}")
|
||||
if row.get("expired"):
|
||||
parts.append("EXPIRED")
|
||||
elif "ttl_remaining" in row:
|
||||
parts.append(f"ttl={row['ttl_remaining']}")
|
||||
typer.echo(" ".join(parts))
|
||||
|
||||
if any_expired:
|
||||
raise typer.Exit(1)
|
||||
|
||||
|
||||
# ─── targets commands ─────────────────────────────────────────────────────────
|
||||
|
||||
@targets_app.callback(invoke_without_command=True)
|
||||
@@ -553,3 +655,119 @@ def catalog_show(
|
||||
if b.target in cat.targets:
|
||||
t = cat.targets[b.target]
|
||||
typer.echo(f"Target: {t.description or t.id} ({t.kind})")
|
||||
|
||||
|
||||
_CONVENTIONS_TEXT = """\
|
||||
Actor Naming Conventions (from AccessManagementDirective.md §2)
|
||||
|
||||
Every actor declared under `actors:` in ~/.config/bridge/tunnels.yaml must have
|
||||
a `class` field, and the actor name must start with the class-specific prefix:
|
||||
|
||||
class prefix purpose
|
||||
----- ------ ------------------------------------------------------------
|
||||
adm adm- Human operator (interactive shell when needed)
|
||||
agt agt- LLM-powered autonomous agent (Claude Code, etc.)
|
||||
atm atm- Deterministic script / cron job / pipeline
|
||||
|
||||
Legacy class aliases (deprecated, still accepted with a warning):
|
||||
human -> adm
|
||||
automation -> atm
|
||||
|
||||
Examples:
|
||||
adm-bernd: { class: adm, description: Bernd Worsch }
|
||||
agt-claude-coulombcore: { class: agt, description: Claude Code on CoulombCore }
|
||||
atm-backup-daily: { class: atm, description: Nightly DB backup }
|
||||
|
||||
Full specification:
|
||||
<ops-bridge repo>/wiki/AccessManagementDirective.md
|
||||
"""
|
||||
|
||||
|
||||
@maintenance_app.command("cleanup")
|
||||
def maintenance_cleanup(
|
||||
tunnel: Optional[str] = typer.Argument(
|
||||
None,
|
||||
help="Tunnel name (omit for all reverse tunnels)",
|
||||
),
|
||||
restart: bool = typer.Option(
|
||||
False,
|
||||
"--restart",
|
||||
help="Restart tunnels after clearing stale remote bindings",
|
||||
),
|
||||
as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
|
||||
):
|
||||
"""Clear stale SSH remote port forwards that block tunnel reconnects."""
|
||||
cfg = _load_or_exit()
|
||||
sd = _state_dir()
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
try:
|
||||
report = cleanup_all_tunnels(
|
||||
cfg,
|
||||
state_mgr,
|
||||
restart=restart,
|
||||
tunnel_name=tunnel,
|
||||
)
|
||||
except KeyError:
|
||||
typer.echo(f"Error: tunnel '{tunnel}' not found in config", err=True)
|
||||
raise typer.Exit(1)
|
||||
|
||||
if as_json:
|
||||
payload = {
|
||||
"cleaned_count": report.cleaned_count,
|
||||
"actions": [
|
||||
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
|
||||
for a in report.actions
|
||||
],
|
||||
}
|
||||
typer.echo(json.dumps(payload, indent=2))
|
||||
return
|
||||
|
||||
if not report.actions:
|
||||
typer.echo("No reverse tunnels configured.")
|
||||
return
|
||||
|
||||
for action in report.actions:
|
||||
typer.echo(f"{action.tunnel}: {action.action} — {action.detail}")
|
||||
typer.echo(f"done ({report.cleaned_count} cleaned)")
|
||||
|
||||
|
||||
@maintenance_app.command("install-cron")
|
||||
def maintenance_install_cron():
|
||||
"""Install a 03:00 daily cron job for `bridge maintenance cleanup --restart`."""
|
||||
installed, message = install_cleanup_cron()
|
||||
if installed:
|
||||
typer.echo("Installed nightly cleanup cron:")
|
||||
typer.echo(f" {message}")
|
||||
else:
|
||||
typer.echo(message)
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
@maintenance_app.command("uninstall-cron")
|
||||
def maintenance_uninstall_cron():
|
||||
"""Remove the nightly cleanup cron job."""
|
||||
removed, message = uninstall_cleanup_cron()
|
||||
if removed:
|
||||
typer.echo(message)
|
||||
else:
|
||||
typer.echo(message)
|
||||
raise typer.Exit(2)
|
||||
|
||||
|
||||
@maintenance_app.command("show-cron")
|
||||
def maintenance_show_cron():
|
||||
"""Show the configured nightly cleanup cron line."""
|
||||
existing = read_installed_cron()
|
||||
if existing:
|
||||
typer.echo(existing)
|
||||
else:
|
||||
typer.echo("Nightly cleanup cron is not installed.")
|
||||
typer.echo("Would install:")
|
||||
typer.echo(f" {build_cron_line()}")
|
||||
|
||||
|
||||
@app.command()
|
||||
def conventions():
|
||||
"""Show the actor naming conventions enforced by tunnels.yaml."""
|
||||
typer.echo(_CONVENTIONS_TEXT)
|
||||
|
||||
@@ -2,13 +2,14 @@
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import warnings
|
||||
from dataclasses import dataclass
|
||||
from pathlib import Path
|
||||
from typing import Dict, Optional
|
||||
|
||||
import yaml
|
||||
|
||||
from bridge.models import ActorInfo, HealthCheckConfig, ReconnectPolicy, TunnelConfig
|
||||
from bridge.models import ActorInfo, ActorType, HealthCheckConfig, ReconnectPolicy, TunnelConfig
|
||||
|
||||
|
||||
class ConfigError(Exception):
|
||||
@@ -91,6 +92,10 @@ def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
|
||||
if direction not in ("reverse", "local"):
|
||||
raise ConfigError(f"Tunnel '{name}' direction must be 'reverse' or 'local', got: {direction!r}")
|
||||
|
||||
cert_command = data.get("cert_command") or None
|
||||
if cert_command is not None:
|
||||
cert_command = str(cert_command)
|
||||
|
||||
return TunnelConfig(
|
||||
name=name,
|
||||
host=str(data["host"]),
|
||||
@@ -102,9 +107,42 @@ def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
|
||||
reconnect=reconnect,
|
||||
health_check=health_check,
|
||||
direction=direction,
|
||||
remote_host=str(data.get("remote_host", "127.0.0.1")),
|
||||
cert_command=cert_command,
|
||||
)
|
||||
|
||||
|
||||
_LEGACY_CLASS_MAP = {
|
||||
"human": ActorType.ADM,
|
||||
"automation": ActorType.ATM,
|
||||
}
|
||||
|
||||
_ACTOR_TYPE_PREFIXES = {
|
||||
ActorType.ADM: "adm-",
|
||||
ActorType.AGT: "agt-",
|
||||
ActorType.ATM: "atm-",
|
||||
}
|
||||
|
||||
|
||||
def _parse_actor_type(name: str, raw_class: str) -> ActorType:
|
||||
if raw_class in _LEGACY_CLASS_MAP:
|
||||
warnings.warn(
|
||||
f"Actor '{name}': class '{raw_class}' is deprecated; "
|
||||
f"use '{_LEGACY_CLASS_MAP[raw_class].value}' instead.",
|
||||
DeprecationWarning,
|
||||
stacklevel=4,
|
||||
)
|
||||
return _LEGACY_CLASS_MAP[raw_class]
|
||||
try:
|
||||
return ActorType(raw_class)
|
||||
except ValueError:
|
||||
raise ConfigError(
|
||||
f"Actor '{name}' has unknown class '{raw_class}'; "
|
||||
f"must be one of: adm, agt, atm (or legacy: human, automation). "
|
||||
f"Run `bridge conventions` for the full naming rules."
|
||||
)
|
||||
|
||||
|
||||
def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
|
||||
actors = {}
|
||||
for name, data in raw.items():
|
||||
@@ -112,9 +150,17 @@ def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
|
||||
raise ConfigError(f"Actor '{name}' must be a mapping")
|
||||
if "class" not in data:
|
||||
raise ConfigError(f"Actor '{name}' missing required field: class")
|
||||
actor_type = _parse_actor_type(name, str(data["class"]))
|
||||
required_prefix = _ACTOR_TYPE_PREFIXES[actor_type]
|
||||
if not name.startswith(required_prefix):
|
||||
raise ConfigError(
|
||||
f"Actor '{name}' has type '{actor_type.value}' but name must start "
|
||||
f"with '{required_prefix}' (got '{name}'). "
|
||||
f"Run `bridge conventions` for the full naming rules."
|
||||
)
|
||||
actors[name] = ActorInfo(
|
||||
name=name,
|
||||
actor_class=str(data["class"]),
|
||||
actor_type=actor_type,
|
||||
description=str(data.get("description", "")),
|
||||
)
|
||||
return actors
|
||||
|
||||
@@ -1,6 +1,7 @@
|
||||
"""End-to-end tunnel diagnostics for OpsBridge."""
|
||||
from __future__ import annotations
|
||||
|
||||
import socket
|
||||
import subprocess
|
||||
import time
|
||||
from dataclasses import dataclass
|
||||
@@ -13,6 +14,38 @@ from bridge.models import BridgeState, TunnelConfig
|
||||
from bridge.state import StateManager, _pid_alive
|
||||
|
||||
|
||||
def _remote_port_probe_command(remote_port: int) -> str:
|
||||
"""Build a portable remote shell probe for a listening TCP port."""
|
||||
return (
|
||||
f"port={remote_port}; "
|
||||
"if command -v ss >/dev/null 2>&1; then "
|
||||
"ss -tnlp 2>/dev/null | grep -q \":$port \" && echo ok || echo closed; "
|
||||
"elif command -v netstat >/dev/null 2>&1; then "
|
||||
"netstat -tnlp 2>/dev/null | "
|
||||
"grep -q \"[.:]$port[[:space:]]\" && echo ok || echo closed; "
|
||||
"else "
|
||||
"hex=$(printf '%04X' \"$port\"); "
|
||||
"awk -v p=\":$hex\" "
|
||||
"'NR > 1 && $4 == \"0A\" && index($2, p) { found = 1 } "
|
||||
"END { print found ? \"ok\" : \"closed\" }' "
|
||||
"/proc/net/tcp /proc/net/tcp6 2>/dev/null; "
|
||||
"fi"
|
||||
)
|
||||
|
||||
|
||||
def _probe_local_port(local_port: int) -> str:
|
||||
"""Check whether the local side of an SSH -L tunnel is accepting TCP."""
|
||||
try:
|
||||
with socket.create_connection(("127.0.0.1", local_port), timeout=5):
|
||||
return "listening"
|
||||
except ConnectionRefusedError:
|
||||
return "closed"
|
||||
except socket.timeout:
|
||||
return "error:timeout"
|
||||
except OSError as e:
|
||||
return f"error:{e}"
|
||||
|
||||
|
||||
@dataclass
|
||||
class TunnelCheckResult:
|
||||
tunnel: str
|
||||
@@ -52,35 +85,38 @@ def check_tunnel(cfg: TunnelConfig, state_mgr: StateManager) -> TunnelCheckResul
|
||||
and ssh_process != "ok"
|
||||
)
|
||||
|
||||
# 3. SSH probe for remote port
|
||||
key_path = str(Path(cfg.ssh_key).expanduser())
|
||||
cmd = [
|
||||
"ssh",
|
||||
"-i", key_path,
|
||||
"-o", "BatchMode=yes",
|
||||
"-o", "ConnectTimeout=5",
|
||||
"-o", "StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
f"ss -tnlp 2>/dev/null | grep -q ':{cfg.remote_port} ' && echo ok || echo closed",
|
||||
]
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10,
|
||||
)
|
||||
output = proc.stdout.strip()
|
||||
if output == "ok":
|
||||
remote_port = "listening"
|
||||
elif output == "closed":
|
||||
remote_port = "closed"
|
||||
else:
|
||||
remote_port = f"error:{proc.stderr.strip() or 'unknown'}"
|
||||
except subprocess.TimeoutExpired:
|
||||
remote_port = "error:timeout"
|
||||
except Exception as e:
|
||||
remote_port = f"error:{e}"
|
||||
# 3. Port probe: reverse tunnels listen remotely; local tunnels listen here.
|
||||
if cfg.direction == "local":
|
||||
remote_port = _probe_local_port(cfg.local_port)
|
||||
else:
|
||||
key_path = str(Path(cfg.ssh_key).expanduser())
|
||||
cmd = [
|
||||
"ssh",
|
||||
"-i", key_path,
|
||||
"-o", "BatchMode=yes",
|
||||
"-o", "ConnectTimeout=5",
|
||||
"-o", "StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
_remote_port_probe_command(cfg.remote_port),
|
||||
]
|
||||
try:
|
||||
proc = subprocess.run(
|
||||
cmd,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=10,
|
||||
)
|
||||
output = proc.stdout.strip()
|
||||
if output == "ok":
|
||||
remote_port = "listening"
|
||||
elif output == "closed":
|
||||
remote_port = "closed"
|
||||
else:
|
||||
remote_port = f"error:{proc.stderr.strip() or 'unknown'}"
|
||||
except subprocess.TimeoutExpired:
|
||||
remote_port = "error:timeout"
|
||||
except Exception as e:
|
||||
remote_port = f"error:{e}"
|
||||
|
||||
# 4. Local API health check (optional)
|
||||
local_api: Optional[str] = None
|
||||
|
||||
@@ -6,35 +6,102 @@ import os
|
||||
import signal
|
||||
import subprocess
|
||||
import time
|
||||
from datetime import datetime, timedelta
|
||||
from pathlib import Path
|
||||
from typing import List, Optional
|
||||
|
||||
from bridge.audit import AuditEvent, AuditLogger
|
||||
from bridge.health import HealthChecker
|
||||
from bridge.models import BridgeState, TunnelConfig
|
||||
from bridge.models import BridgeState, CertAcquisitionError, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
log = logging.getLogger(__name__)
|
||||
|
||||
|
||||
def build_ssh_command(cfg: TunnelConfig) -> List[str]:
|
||||
def _actor_type_from_name(name: str) -> str:
|
||||
for prefix in ("adm", "agt", "atm"):
|
||||
if name.startswith(f"{prefix}-"):
|
||||
return prefix
|
||||
return "unknown"
|
||||
|
||||
|
||||
def build_ssh_command(cfg: TunnelConfig, cert_path: Optional[Path] = None) -> List[str]:
|
||||
"""Build the SSH tunnel command (reverse -R or local -L)."""
|
||||
key = os.path.expanduser(cfg.ssh_key)
|
||||
if cfg.direction == "local":
|
||||
forward_flag = ["-L", f"{cfg.local_port}:127.0.0.1:{cfg.remote_port}"]
|
||||
forward_flag = ["-L", f"{cfg.local_port}:{cfg.remote_host}:{cfg.remote_port}"]
|
||||
else:
|
||||
forward_flag = ["-R", f"{cfg.remote_port}:127.0.0.1:{cfg.local_port}"]
|
||||
return [
|
||||
forward_flag = ["-R", f"{cfg.remote_port}:{cfg.remote_host}:{cfg.local_port}"]
|
||||
cmd = [
|
||||
"ssh",
|
||||
"-N",
|
||||
*forward_flag,
|
||||
"-i", key,
|
||||
]
|
||||
if cert_path is not None:
|
||||
cmd += ["-i", str(cert_path)]
|
||||
cmd += [
|
||||
"-o", "ServerAliveInterval=10",
|
||||
"-o", "ServerAliveCountMax=3",
|
||||
"-o", "ExitOnForwardFailure=yes",
|
||||
"-o", "StrictHostKeyChecking=accept-new",
|
||||
f"{cfg.ssh_user}@{cfg.host}",
|
||||
]
|
||||
return cmd
|
||||
|
||||
|
||||
def _run_cert_command(cfg: TunnelConfig, state_dir: Path) -> Optional[Path]:
|
||||
"""Run cert_command and write cert to state dir. Returns cert path or None."""
|
||||
if cfg.cert_command is None:
|
||||
return None
|
||||
result = subprocess.run(
|
||||
cfg.cert_command,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
if result.returncode != 0:
|
||||
raise CertAcquisitionError(result.stderr.strip())
|
||||
cert_path = state_dir / f"{cfg.name}-cert.pub"
|
||||
cert_path.write_text(result.stdout)
|
||||
return cert_path
|
||||
|
||||
|
||||
def _parse_cert_identity(cert_path: Path) -> Optional[str]:
|
||||
"""Parse Key ID from ssh-keygen -L output."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ssh-keygen", "-L", "-f", str(cert_path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
for line in result.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if line.startswith("Key ID:"):
|
||||
return line.split(":", 1)[1].strip().strip('"')
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
def _parse_cert_expiry(cert_path: Path) -> Optional[datetime]:
|
||||
"""Parse Valid-before datetime from ssh-keygen -L output."""
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["ssh-keygen", "-L", "-f", str(cert_path)],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
)
|
||||
for line in result.stdout.splitlines():
|
||||
line = line.strip()
|
||||
if line.startswith("Valid:"):
|
||||
# "Valid: from 2026-05-15T10:00:00 to 2026-05-15T22:00:00"
|
||||
parts = line.split()
|
||||
if len(parts) >= 5 and parts[3] == "to":
|
||||
return datetime.fromisoformat(parts[4])
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
class TunnelManager:
|
||||
@@ -56,7 +123,8 @@ class TunnelManager:
|
||||
return self._state.is_running(self._cfg.name)
|
||||
|
||||
def _actor_info(self):
|
||||
return self._cfg.actor, "unknown"
|
||||
actor = self._cfg.actor
|
||||
return actor, _actor_type_from_name(actor)
|
||||
|
||||
def _next_backoff(self, attempt: int) -> int:
|
||||
initial = self._cfg.reconnect.backoff_initial
|
||||
@@ -71,12 +139,12 @@ class TunnelManager:
|
||||
return
|
||||
|
||||
self._state.write_state(self._cfg.name, BridgeState.STARTING)
|
||||
actor, actor_class = self._actor_info()
|
||||
actor, actor_type = self._actor_info()
|
||||
self._audit.log(
|
||||
tunnel=self._cfg.name,
|
||||
event=AuditEvent.BRIDGE_STARTED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
|
||||
pid = os.fork()
|
||||
@@ -99,7 +167,7 @@ class TunnelManager:
|
||||
tunnel=self._cfg.name,
|
||||
event=AuditEvent.BRIDGE_STOPPED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
|
||||
os._exit(0)
|
||||
@@ -131,12 +199,12 @@ class TunnelManager:
|
||||
|
||||
self._state.clear_pid(self._cfg.name)
|
||||
self._state.write_state(self._cfg.name, BridgeState.STOPPED)
|
||||
actor, actor_class = self._actor_info()
|
||||
actor, actor_type = self._actor_info()
|
||||
self._audit.log(
|
||||
tunnel=self._cfg.name,
|
||||
event=AuditEvent.BRIDGE_STOPPED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
|
||||
def _run_loop(self) -> None:
|
||||
@@ -144,11 +212,11 @@ class TunnelManager:
|
||||
import asyncio
|
||||
|
||||
cfg = self._cfg
|
||||
actor, actor_class = self._actor_info()
|
||||
actor, actor_type = self._actor_info()
|
||||
attempt = 0
|
||||
max_attempts = cfg.reconnect.max_attempts # 0 = infinite
|
||||
state_dir = self._state._dir
|
||||
|
||||
# Setup signal handler for graceful shutdown
|
||||
_stop = [False]
|
||||
|
||||
def _on_term(signum, frame):
|
||||
@@ -162,7 +230,31 @@ class TunnelManager:
|
||||
self._state.write_state(cfg.name, BridgeState.FAILED)
|
||||
break
|
||||
|
||||
cmd = build_ssh_command(cfg)
|
||||
# Acquire cert before each SSH launch (T3, T7)
|
||||
try:
|
||||
cert_path = _run_cert_command(cfg, state_dir)
|
||||
except CertAcquisitionError as e:
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_DISCONNECTED,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
detail=f"cert acquisition failed: {e}",
|
||||
)
|
||||
attempt += 1
|
||||
if max_attempts > 0 and attempt >= max_attempts:
|
||||
self._state.write_state(cfg.name, BridgeState.FAILED)
|
||||
break
|
||||
backoff = self._next_backoff(attempt - 1)
|
||||
self._state.write_state(cfg.name, BridgeState.RECONNECTING)
|
||||
log.info("Cert acquisition failed, retrying in %ds", backoff)
|
||||
time.sleep(backoff)
|
||||
continue
|
||||
|
||||
cert_identity = _parse_cert_identity(cert_path) if cert_path else None
|
||||
cert_expires_at = _parse_cert_expiry(cert_path) if cert_path else None
|
||||
|
||||
cmd = build_ssh_command(cfg, cert_path=cert_path)
|
||||
log.info("Starting SSH: %s", " ".join(cmd))
|
||||
self._state.write_state(cfg.name, BridgeState.STARTING)
|
||||
|
||||
@@ -174,24 +266,30 @@ class TunnelManager:
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_DISCONNECTED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
detail="ssh binary not found",
|
||||
)
|
||||
break
|
||||
|
||||
# Wait briefly then assume connected if still running
|
||||
time.sleep(2)
|
||||
_ttl_refresh = False
|
||||
if proc.poll() is None:
|
||||
self._state.write_state(cfg.name, BridgeState.CONNECTED)
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_CONNECTED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
cert_identity=cert_identity,
|
||||
)
|
||||
attempt = 0
|
||||
|
||||
# Health check loop
|
||||
def _check_ttl() -> bool:
|
||||
"""Return True if cert is within 5 min of expiry and SSH should restart."""
|
||||
if cert_expires_at is None:
|
||||
return False
|
||||
return datetime.now() >= cert_expires_at - timedelta(minutes=5)
|
||||
|
||||
if cfg.health_check:
|
||||
checker = HealthChecker(
|
||||
url=cfg.health_check.url,
|
||||
@@ -199,6 +297,18 @@ class TunnelManager:
|
||||
)
|
||||
health_failing = False
|
||||
while not _stop[0] and proc.poll() is None:
|
||||
if _check_ttl():
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.CERT_EXPIRING,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
cert_identity=cert_identity,
|
||||
detail=str(cert_expires_at),
|
||||
)
|
||||
proc.terminate()
|
||||
_ttl_refresh = True
|
||||
break
|
||||
result = asyncio.run(checker.check())
|
||||
if result.ok:
|
||||
if health_failing:
|
||||
@@ -208,7 +318,7 @@ class TunnelManager:
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.HEALTH_CHECK_RECOVERED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
)
|
||||
else:
|
||||
if not health_failing:
|
||||
@@ -218,21 +328,36 @@ class TunnelManager:
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.HEALTH_CHECK_FAILED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
detail=result.error or f"HTTP {result.status_code}",
|
||||
)
|
||||
time.sleep(cfg.health_check.interval_seconds)
|
||||
else:
|
||||
while not _stop[0] and proc.poll() is None:
|
||||
if _check_ttl():
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.CERT_EXPIRING,
|
||||
actor=actor,
|
||||
actor_type=actor_type,
|
||||
cert_identity=cert_identity,
|
||||
detail=str(cert_expires_at),
|
||||
)
|
||||
proc.terminate()
|
||||
_ttl_refresh = True
|
||||
break
|
||||
time.sleep(1)
|
||||
|
||||
# SSH exited
|
||||
if _ttl_refresh:
|
||||
# Planned cert refresh — don't count as failure, no backoff
|
||||
continue
|
||||
|
||||
if proc.poll() is not None:
|
||||
self._audit.log(
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_DISCONNECTED,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
detail=f"exit code {proc.returncode}",
|
||||
)
|
||||
|
||||
@@ -248,7 +373,7 @@ class TunnelManager:
|
||||
tunnel=cfg.name,
|
||||
event=AuditEvent.BRIDGE_RECONNECTING,
|
||||
actor=actor,
|
||||
actor_class=actor_class,
|
||||
actor_type=actor_type,
|
||||
detail=f"retry {attempt}, backoff {backoff}s",
|
||||
)
|
||||
log.info("Reconnecting in %ds (attempt %d)", backoff, attempt)
|
||||
|
||||
@@ -169,19 +169,22 @@ def bridge_down(tunnel: Optional[str] = None) -> dict:
|
||||
def bridge_restart(tunnel: Optional[str] = None) -> dict:
|
||||
"""Restart one or all configured tunnels.
|
||||
|
||||
Reverse tunnels run conditional remote stale-forward cleanup before
|
||||
reconnecting; healthy forwards are left running.
|
||||
|
||||
Args:
|
||||
tunnel: Tunnel name to restart. If omitted, restarts all inline tunnels.
|
||||
|
||||
Returns:
|
||||
{"restarted": [...]} or {"error": "..."}
|
||||
{"actions": [{"tunnel", "action", "detail"}, ...]} or {"error": "..."}
|
||||
"""
|
||||
cfg, err = _load_cfg_or_error()
|
||||
if err:
|
||||
return err
|
||||
|
||||
from bridge.manager import TunnelManager
|
||||
from bridge.cleanup import restart_all_tunnels, restart_tunnel
|
||||
sd = _state_dir()
|
||||
restarted = []
|
||||
state_mgr = StateManager(state_dir=sd)
|
||||
|
||||
if tunnel:
|
||||
from bridge.catalog.loader import load_catalog
|
||||
@@ -196,18 +199,19 @@ def bridge_restart(tunnel: Optional[str] = None) -> dict:
|
||||
tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
|
||||
except BridgeNotFound:
|
||||
return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
restarted.append(tunnel)
|
||||
actions = [restart_tunnel(tcfg, state_mgr)]
|
||||
else:
|
||||
for name, tcfg in cfg.tunnels.items():
|
||||
mgr = TunnelManager(tcfg, state_dir=sd)
|
||||
mgr.stop()
|
||||
mgr.start()
|
||||
restarted.append(name)
|
||||
actions = restart_all_tunnels(cfg, state_mgr)
|
||||
|
||||
return {"restarted": restarted}
|
||||
payload = {
|
||||
"actions": [
|
||||
{"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
|
||||
for a in actions
|
||||
],
|
||||
}
|
||||
if any(a.action == "error" for a in actions):
|
||||
payload["error"] = "one or more tunnels failed to restart"
|
||||
return payload
|
||||
|
||||
|
||||
@mcp.tool()
|
||||
@@ -513,4 +517,13 @@ def resource_catalog_targets() -> str:
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
if __name__ == "__main__":
|
||||
mcp.run(transport="stdio")
|
||||
import argparse
|
||||
parser = argparse.ArgumentParser(description="OpsBridge MCP server")
|
||||
parser.add_argument("--http", action="store_true", help="Run in SSE/HTTP mode instead of stdio")
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.http:
|
||||
port = int(os.environ.get("BRIDGE_MCP_PORT", "8002"))
|
||||
mcp.run(transport="sse", host="127.0.0.1", port=port)
|
||||
else:
|
||||
mcp.run(transport="stdio")
|
||||
|
||||
@@ -15,6 +15,16 @@ class BridgeState(str, Enum):
|
||||
FAILED = "failed"
|
||||
|
||||
|
||||
class ActorType(str, Enum):
|
||||
ADM = "adm" # human operator
|
||||
AGT = "agt" # LLM-powered autonomous agent
|
||||
ATM = "atm" # deterministic script / pipeline
|
||||
|
||||
|
||||
class CertAcquisitionError(Exception):
|
||||
"""Raised when cert_command fails to produce a certificate."""
|
||||
|
||||
|
||||
@dataclass
|
||||
class ReconnectPolicy:
|
||||
max_attempts: int = 0 # 0 = infinite
|
||||
@@ -41,10 +51,15 @@ class TunnelConfig:
|
||||
reconnect: ReconnectPolicy = field(default_factory=ReconnectPolicy)
|
||||
health_check: Optional[HealthCheckConfig] = None
|
||||
direction: str = "reverse" # "reverse" (-R) or "local" (-L)
|
||||
# Forward-destination host as seen from the remote end (direction "local")
|
||||
# or from this workstation (direction "reverse"). Defaults to loopback;
|
||||
# set e.g. a k3s ClusterIP to tunnel to an in-cluster Service.
|
||||
remote_host: str = "127.0.0.1"
|
||||
cert_command: Optional[str] = None
|
||||
|
||||
|
||||
@dataclass
|
||||
class ActorInfo:
|
||||
name: str
|
||||
actor_class: str # "human" or "automation"
|
||||
actor_type: ActorType
|
||||
description: str = ""
|
||||
|
||||
@@ -23,10 +23,10 @@ VALID_CONFIG = textwrap.dedent("""\
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
@@ -38,10 +38,10 @@ VALID_CONFIG_WITH_CATALOG = textwrap.dedent("""\
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
catalog_path: {catalog_path}
|
||||
""")
|
||||
|
||||
@@ -22,7 +22,7 @@ class TestAuditLogger:
|
||||
tunnel="my-tunnel",
|
||||
event=AuditEvent.BRIDGE_STARTED,
|
||||
actor="operator.bernd",
|
||||
actor_class="human",
|
||||
actor_type="adm",
|
||||
)
|
||||
log_file = log_dir / "my-tunnel.log"
|
||||
assert log_file.exists()
|
||||
@@ -32,7 +32,7 @@ class TestAuditLogger:
|
||||
tunnel="my-tunnel",
|
||||
event=AuditEvent.BRIDGE_STARTED,
|
||||
actor="operator.bernd",
|
||||
actor_class="human",
|
||||
actor_type="adm",
|
||||
)
|
||||
lines = (log_dir / "my-tunnel.log").read_text().strip().splitlines()
|
||||
assert len(lines) == 1
|
||||
@@ -40,12 +40,12 @@ class TestAuditLogger:
|
||||
assert entry["tunnel"] == "my-tunnel"
|
||||
assert entry["event"] == "bridge_started"
|
||||
assert entry["actor"] == "operator.bernd"
|
||||
assert entry["actor_class"] == "human"
|
||||
assert entry["actor_type"] == "adm"
|
||||
assert "timestamp" in entry
|
||||
|
||||
def test_multiple_events_append(self, logger, log_dir):
|
||||
for event in [AuditEvent.BRIDGE_STARTED, AuditEvent.BRIDGE_CONNECTED, AuditEvent.BRIDGE_STOPPED]:
|
||||
logger.log(tunnel="t", event=event, actor="a", actor_class="human")
|
||||
logger.log(tunnel="t", event=event, actor="a", actor_type="adm")
|
||||
lines = (log_dir / "t.log").read_text().strip().splitlines()
|
||||
assert len(lines) == 3
|
||||
|
||||
@@ -54,7 +54,7 @@ class TestAuditLogger:
|
||||
tunnel="t",
|
||||
event=AuditEvent.HEALTH_CHECK_FAILED,
|
||||
actor="a",
|
||||
actor_class="automation",
|
||||
actor_type="atm",
|
||||
detail="connection refused",
|
||||
)
|
||||
entry = json.loads((log_dir / "t.log").read_text().strip())
|
||||
@@ -72,15 +72,15 @@ class TestAuditLogger:
|
||||
|
||||
def test_timestamp_is_iso8601(self, logger, log_dir):
|
||||
from datetime import datetime
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_class="human")
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
|
||||
entry = json.loads((log_dir / "t.log").read_text().strip())
|
||||
# Should parse without error
|
||||
dt = datetime.fromisoformat(entry["timestamp"])
|
||||
assert dt.tzinfo is not None or True # UTC or naive both acceptable
|
||||
|
||||
def test_read_events(self, logger, log_dir):
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_class="human")
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_class="human")
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_type="adm")
|
||||
logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
|
||||
events = logger.read_events("t")
|
||||
assert len(events) == 2
|
||||
assert events[0]["event"] == "bridge_started"
|
||||
|
||||
130
tests/test_cleanup.py
Normal file
130
tests/test_cleanup.py
Normal file
@@ -0,0 +1,130 @@
|
||||
"""Tests for stale SSH forward cleanup."""
|
||||
from __future__ import annotations
|
||||
|
||||
import textwrap
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
from typer.testing import CliRunner
|
||||
|
||||
from bridge.cleanup import (
|
||||
CleanupAction,
|
||||
build_cron_line,
|
||||
cleanup_all_tunnels,
|
||||
remote_forward_health_url,
|
||||
should_cleanup_tunnel,
|
||||
)
|
||||
from bridge.cli import app
|
||||
from bridge.config import load_config
|
||||
from bridge.models import HealthCheckConfig, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
|
||||
def _tunnel(**overrides) -> TunnelConfig:
|
||||
base = dict(
|
||||
name="state-hub-railiance01",
|
||||
host="92.205.62.239",
|
||||
remote_port=18000,
|
||||
local_port=8000,
|
||||
ssh_user="tegwick",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="agt-claude-railiance01",
|
||||
health_check=HealthCheckConfig(
|
||||
url="http://127.0.0.1:8000/state/health",
|
||||
timeout_seconds=5,
|
||||
),
|
||||
)
|
||||
base.update(overrides)
|
||||
return TunnelConfig(**base)
|
||||
|
||||
|
||||
class TestRemoteForwardHealthUrl:
|
||||
def test_maps_local_port_to_remote(self):
|
||||
cfg = _tunnel()
|
||||
assert remote_forward_health_url(cfg) == "http://127.0.0.1:18000/state/health"
|
||||
|
||||
def test_returns_none_for_local_tunnel(self):
|
||||
cfg = _tunnel(direction="local")
|
||||
assert remote_forward_health_url(cfg) is None
|
||||
|
||||
|
||||
class TestShouldCleanupTunnel:
|
||||
def test_skips_healthy_remote_forward(self, tmp_path):
|
||||
cfg = _tunnel()
|
||||
state_mgr = StateManager(state_dir=tmp_path)
|
||||
with (
|
||||
patch("bridge.cleanup.remote_port_listening", return_value=True),
|
||||
patch("bridge.cleanup.probe_remote_forward", return_value=(True, "ok")),
|
||||
):
|
||||
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
|
||||
assert needed is False
|
||||
|
||||
def test_detects_stale_forward_when_local_ok_remote_fails(self, tmp_path):
|
||||
cfg = _tunnel()
|
||||
state_mgr = StateManager(state_dir=tmp_path)
|
||||
with (
|
||||
patch("bridge.cleanup.remote_port_listening", return_value=True),
|
||||
patch("bridge.cleanup.probe_remote_forward", return_value=(False, "timeout")),
|
||||
patch("bridge.cleanup.local_service_healthy", return_value=True),
|
||||
patch(
|
||||
"bridge.cleanup.check_tunnel",
|
||||
return_value=MagicMock(ssh_process="ok", remote_port="listening"),
|
||||
),
|
||||
):
|
||||
needed, reason = should_cleanup_tunnel(cfg, state_mgr)
|
||||
assert needed is True
|
||||
assert "stale forward" in reason
|
||||
|
||||
|
||||
class TestCleanupAllTunnels:
|
||||
def test_reports_cleaned_tunnel(self, tmp_path, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "tunnels.yaml"))
|
||||
(tmp_path / "tunnels.yaml").write_text(
|
||||
textwrap.dedent(
|
||||
"""\
|
||||
tunnels:
|
||||
state-hub-railiance01:
|
||||
host: 92.205.62.239
|
||||
remote_port: 18000
|
||||
local_port: 8000
|
||||
ssh_user: tegwick
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agt-claude-railiance01
|
||||
health_check:
|
||||
url: http://127.0.0.1:8000/state/health
|
||||
actors:
|
||||
agt-claude-railiance01:
|
||||
class: agt
|
||||
"""
|
||||
)
|
||||
)
|
||||
cfg = load_config()
|
||||
state_mgr = StateManager(state_dir=tmp_path / "state")
|
||||
with patch(
|
||||
"bridge.cleanup.cleanup_tunnel",
|
||||
return_value=CleanupAction("state-hub-railiance01", "cleaned", "cleared"),
|
||||
):
|
||||
report = cleanup_all_tunnels(cfg, state_mgr, restart=False)
|
||||
assert report.cleaned_count == 1
|
||||
assert report.actions[0].action == "cleaned"
|
||||
|
||||
|
||||
class TestMaintenanceCli:
|
||||
def test_cleanup_help(self):
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(app, ["maintenance", "cleanup", "--help"])
|
||||
assert result.exit_code == 0
|
||||
assert "restart" in result.output.lower()
|
||||
|
||||
def test_show_cron_prints_template_when_not_installed(self):
|
||||
runner = CliRunner()
|
||||
with patch("bridge.cli.read_installed_cron", return_value=None):
|
||||
result = runner.invoke(app, ["maintenance", "show-cron"])
|
||||
assert result.exit_code == 0
|
||||
assert "0 3 * * *" in result.output
|
||||
|
||||
|
||||
def test_build_cron_line_contains_marker():
|
||||
line = build_cron_line()
|
||||
assert "0 3 * * *" in line
|
||||
assert "maintenance cleanup --restart" in line
|
||||
assert "ops-bridge: maintenance cleanup" in line
|
||||
@@ -17,10 +17,10 @@ VALID_CONFIG = textwrap.dedent("""\
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
@@ -266,22 +266,146 @@ class TestCheckCommand:
|
||||
assert result.exit_code == 1
|
||||
|
||||
|
||||
REVERSE_CONFIG = VALID_CONFIG
|
||||
|
||||
LOCAL_TUNNEL_CONFIG = textwrap.dedent("""\
|
||||
tunnels:
|
||||
k3s-api:
|
||||
host: host.local
|
||||
remote_port: 6443
|
||||
local_port: 6443
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: adm-bernd
|
||||
direction: local
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
|
||||
class TestRestartCommand:
|
||||
def test_restart_unknown_tunnel_exit_1(self, env):
|
||||
result = runner.invoke(app, ["restart", "nonexistent"], env=env)
|
||||
assert result.exit_code == 1
|
||||
|
||||
def test_restart_help_mentions_remote_cleanup(self):
|
||||
result = runner.invoke(app, ["restart", "--help"])
|
||||
assert result.exit_code == 0
|
||||
assert "stale-forward" in result.output.lower() or "remote" in result.output.lower()
|
||||
|
||||
@pytest.mark.capability("bridge_restart")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_restart_calls_stop_then_start(self, env):
|
||||
with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
|
||||
def test_restart_reverse_tunnel_delegates_to_cleanup(self, env):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cli.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel", "healthy", "remote forward healthy"
|
||||
)
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
mock_restart.assert_called_once()
|
||||
assert "test-tunnel: healthy" in result.output
|
||||
|
||||
def test_restart_reverse_tunnel_reports_cleaned_and_restarted(self, env):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cli.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel",
|
||||
"cleaned_and_restarted",
|
||||
"stale forward; restarted tunnel; cleared",
|
||||
)
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
assert "cleaned_and_restarted" in result.output
|
||||
|
||||
def test_restart_reverse_tunnel_error_exit_1(self, env):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cli.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel", "error", "cleanup failed: still_listening"
|
||||
)
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
|
||||
assert result.exit_code == 1
|
||||
assert "error" in result.output
|
||||
|
||||
def test_restart_local_tunnel_uses_stop_start(self, tmp_path, state_dir):
|
||||
config_file = tmp_path / "tunnels.yaml"
|
||||
config_file.write_text(LOCAL_TUNNEL_CONFIG)
|
||||
env = {
|
||||
"BRIDGE_CONFIG": str(config_file),
|
||||
"BRIDGE_STATE_DIR": str(state_dir),
|
||||
}
|
||||
|
||||
with patch("bridge.cleanup.TunnelManager") as mock_mgr_cls:
|
||||
mock_mgr = MagicMock()
|
||||
mock_mgr_cls.return_value = mock_mgr
|
||||
call_order = []
|
||||
mock_mgr.stop.side_effect = lambda: call_order.append("stop")
|
||||
mock_mgr.start.side_effect = lambda: call_order.append("start")
|
||||
|
||||
result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
|
||||
result = runner.invoke(app, ["restart", "k3s-api"], env=env)
|
||||
|
||||
assert result.exit_code == 0
|
||||
assert call_order == ["stop", "start"]
|
||||
assert "k3s-api: restarted" in result.output
|
||||
|
||||
|
||||
class TestCertStatusCommand:
|
||||
@pytest.mark.capability("bridge_cert_status")
|
||||
@pytest.mark.access_mode("cli")
|
||||
def test_cert_status_no_cert_shows_static_key(self, env, state_dir):
|
||||
result = runner.invoke(app, ["cert-status"], env=env)
|
||||
assert result.exit_code == 0
|
||||
assert "static-key" in result.output
|
||||
|
||||
def test_cert_status_json_no_cert(self, env, state_dir):
|
||||
result = runner.invoke(app, ["cert-status", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert data[0]["mode"] == "static-key"
|
||||
|
||||
def test_cert_status_exit_1_on_expired(self, env, state_dir, tmp_path):
|
||||
# Write a fake cert file in state dir; mock ssh-keygen to report expired
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
cert_file = state_dir / "test-tunnel-cert.pub"
|
||||
cert_file.write_text("fake cert")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout=(
|
||||
"test-tunnel-cert.pub:\n"
|
||||
" Key ID: \"agt-test\"\n"
|
||||
" Valid: from 2026-01-01T00:00:00 to 2026-01-02T00:00:00\n"
|
||||
),
|
||||
returncode=0,
|
||||
)
|
||||
result = runner.invoke(app, ["cert-status"], env=env)
|
||||
assert result.exit_code == 1
|
||||
assert "EXPIRED" in result.output
|
||||
|
||||
def test_cert_status_json_with_cert(self, env, state_dir):
|
||||
state_dir.mkdir(parents=True, exist_ok=True)
|
||||
cert_file = state_dir / "test-tunnel-cert.pub"
|
||||
cert_file.write_text("fake cert")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout=(
|
||||
"test-tunnel-cert.pub:\n"
|
||||
" Key ID: \"agt-test\"\n"
|
||||
" Valid: from 2030-01-01T00:00:00 to 2030-01-02T00:00:00\n"
|
||||
),
|
||||
returncode=0,
|
||||
)
|
||||
result = runner.invoke(app, ["cert-status", "--json"], env=env)
|
||||
assert result.exit_code == 0
|
||||
data = json.loads(result.output)
|
||||
assert data[0]["mode"] == "cert"
|
||||
assert data[0]["key_id"] == "agt-test"
|
||||
assert data[0]["expired"] is False
|
||||
|
||||
@@ -1,9 +1,11 @@
|
||||
"""Tests for config loading."""
|
||||
import textwrap
|
||||
import warnings
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.config import ConfigError, load_config
|
||||
from bridge.models import ActorType
|
||||
|
||||
|
||||
VALID_YAML = textwrap.dedent("""\
|
||||
@@ -14,7 +16,7 @@ VALID_YAML = textwrap.dedent("""\
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: agent.claude-coulombcore
|
||||
actor: agt-claude-coulombcore
|
||||
health_check:
|
||||
url: http://127.0.0.1:18000/health
|
||||
interval_seconds: 30
|
||||
@@ -25,11 +27,11 @@ VALID_YAML = textwrap.dedent("""\
|
||||
backoff_max: 60
|
||||
|
||||
actors:
|
||||
agent.claude-coulombcore:
|
||||
class: automation
|
||||
agt-claude-coulombcore:
|
||||
class: agt
|
||||
description: Claude Code agent on CoulombCore
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd Worsch
|
||||
""")
|
||||
|
||||
@@ -50,7 +52,7 @@ def test_load_valid_config(config_file, monkeypatch):
|
||||
assert t.remote_port == 18000
|
||||
assert t.local_port == 8000
|
||||
assert t.ssh_user == "ubuntu"
|
||||
assert t.actor == "agent.claude-coulombcore"
|
||||
assert t.actor == "agt-claude-coulombcore"
|
||||
|
||||
|
||||
def test_health_check_loaded(config_file, monkeypatch):
|
||||
@@ -74,10 +76,10 @@ def test_reconnect_policy_loaded(config_file, monkeypatch):
|
||||
def test_actors_loaded(config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
assert "agent.claude-coulombcore" in cfg.actors
|
||||
a = cfg.actors["agent.claude-coulombcore"]
|
||||
assert a.actor_class == "automation"
|
||||
assert "operator.bernd" in cfg.actors
|
||||
assert "agt-claude-coulombcore" in cfg.actors
|
||||
a = cfg.actors["agt-claude-coulombcore"]
|
||||
assert a.actor_type == ActorType.AGT
|
||||
assert "adm-bernd" in cfg.actors
|
||||
|
||||
|
||||
def test_missing_required_field_raises(tmp_path, monkeypatch):
|
||||
@@ -118,12 +120,180 @@ def test_tunnel_without_health_check(tmp_path, monkeypatch):
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_rsa
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.tunnels["simple"].health_check is None
|
||||
|
||||
|
||||
class TestActorTypeValidation:
|
||||
def test_canonical_agt_accepted(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: agt-claude
|
||||
actors:
|
||||
agt-claude:
|
||||
class: agt
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.actors["agt-claude"].actor_type == ActorType.AGT
|
||||
|
||||
def test_canonical_atm_accepted(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: atm-backup
|
||||
actors:
|
||||
atm-backup:
|
||||
class: atm
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.actors["atm-backup"].actor_type == ActorType.ATM
|
||||
|
||||
def test_wrong_prefix_raises_config_error(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: agt
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="must start with 'agt-'"):
|
||||
load_config()
|
||||
|
||||
def test_missing_prefix_raises_config_error(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: operator.bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: adm
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="must start with 'adm-'"):
|
||||
load_config()
|
||||
|
||||
def test_unknown_class_raises_config_error(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: wizard
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with pytest.raises(ConfigError, match="unknown class"):
|
||||
load_config()
|
||||
|
||||
def test_legacy_human_maps_to_adm_with_warning(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
adm-bernd:
|
||||
class: human
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter("always")
|
||||
cfg = load_config()
|
||||
assert cfg.actors["adm-bernd"].actor_type == ActorType.ADM
|
||||
assert any("deprecated" in str(x.message).lower() for x in w)
|
||||
|
||||
def test_legacy_automation_maps_to_atm_with_warning(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: atm-cron
|
||||
actors:
|
||||
atm-cron:
|
||||
class: automation
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
with warnings.catch_warnings(record=True) as w:
|
||||
warnings.simplefilter("always")
|
||||
cfg = load_config()
|
||||
assert cfg.actors["atm-cron"].actor_type == ActorType.ATM
|
||||
assert any("deprecated" in str(x.message).lower() for x in w)
|
||||
|
||||
|
||||
class TestCertCommandConfig:
|
||||
def test_cert_command_parsed(self, tmp_path, monkeypatch):
|
||||
f = tmp_path / "t.yaml"
|
||||
f.write_text(textwrap.dedent("""\
|
||||
tunnels:
|
||||
t:
|
||||
host: h
|
||||
remote_port: 1
|
||||
local_port: 2
|
||||
ssh_user: u
|
||||
ssh_key: ~/.ssh/k
|
||||
actor: agt-bridge
|
||||
cert_command: "warden sign agt-bridge --pubkey /tmp/k.pub"
|
||||
actors:
|
||||
agt-bridge:
|
||||
class: agt
|
||||
"""))
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(f))
|
||||
cfg = load_config()
|
||||
assert cfg.tunnels["t"].cert_command == "warden sign agt-bridge --pubkey /tmp/k.pub"
|
||||
|
||||
def test_no_cert_command_is_none(self, config_file, monkeypatch):
|
||||
monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
|
||||
cfg = load_config()
|
||||
assert cfg.tunnels["state-hub-coulombcore"].cert_command is None
|
||||
|
||||
@@ -6,7 +6,11 @@ from unittest.mock import MagicMock, patch
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.diagnostics import TunnelCheckResult, check_all_tunnels, check_tunnel
|
||||
from bridge.diagnostics import (
|
||||
_remote_port_probe_command,
|
||||
check_all_tunnels,
|
||||
check_tunnel,
|
||||
)
|
||||
from bridge.models import BridgeState, TunnelConfig
|
||||
from bridge.state import StateManager
|
||||
|
||||
@@ -20,7 +24,7 @@ def tcfg():
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="operator.bernd",
|
||||
actor="adm-bernd",
|
||||
)
|
||||
|
||||
|
||||
@@ -32,6 +36,14 @@ def state_mgr(tmp_path):
|
||||
|
||||
|
||||
class TestCheckTunnel:
|
||||
def test_remote_port_probe_has_minimal_host_fallback(self):
|
||||
"""Remote probe supports minimal hosts without ss/netstat."""
|
||||
command = _remote_port_probe_command(18000)
|
||||
assert "command -v ss" in command
|
||||
assert "command -v netstat" in command
|
||||
assert "/proc/net/tcp" in command
|
||||
assert "/proc/net/tcp6" in command
|
||||
|
||||
def test_no_pid(self, tcfg, state_mgr):
|
||||
"""No PID file → ssh_process='no_pid', ok=False."""
|
||||
with patch("bridge.diagnostics.subprocess.run") as mock_run:
|
||||
@@ -83,6 +95,29 @@ class TestCheckTunnel:
|
||||
assert result.remote_port == "closed"
|
||||
assert result.ok is False
|
||||
|
||||
def test_local_direction_checks_local_port(self, tcfg, state_mgr):
|
||||
"""Local tunnels verify the local listener instead of a remote -R port."""
|
||||
local_cfg = TunnelConfig(
|
||||
name="local-tunnel",
|
||||
host="haskelseed.local",
|
||||
remote_port=1234,
|
||||
local_port=11234,
|
||||
ssh_user="root",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="adm-bernd",
|
||||
direction="local",
|
||||
)
|
||||
state_mgr.write_pid("local-tunnel", 12345)
|
||||
with (
|
||||
patch("bridge.diagnostics._pid_alive", return_value=True),
|
||||
patch("bridge.diagnostics._probe_local_port", return_value="listening"),
|
||||
patch("bridge.diagnostics.subprocess.run") as mock_run,
|
||||
):
|
||||
result = check_tunnel(local_cfg, state_mgr)
|
||||
mock_run.assert_not_called()
|
||||
assert result.remote_port == "listening"
|
||||
assert result.ok is True
|
||||
|
||||
def test_ssh_timeout(self, tcfg, state_mgr):
|
||||
"""SSH probe timeout → remote_port='error:timeout'."""
|
||||
state_mgr.write_pid("test-tunnel", 12345)
|
||||
@@ -114,7 +149,7 @@ class TestCheckTunnel:
|
||||
local_port=8000,
|
||||
ssh_user="ubuntu",
|
||||
ssh_key="~/.ssh/id_ops",
|
||||
actor="operator.bernd",
|
||||
actor="adm-bernd",
|
||||
health_check=HealthCheckConfig(url="http://127.0.0.1:8000/health"),
|
||||
)
|
||||
state_mgr.write_pid("test-tunnel", 12345)
|
||||
@@ -135,7 +170,8 @@ class TestCheckAllTunnels:
|
||||
def test_check_all_iterates_tunnels(self, tmp_path):
|
||||
"""check_all_tunnels returns one result per tunnel in cfg."""
|
||||
from bridge.config import load_config
|
||||
import textwrap, os
|
||||
import textwrap
|
||||
import os
|
||||
|
||||
cfg_file = tmp_path / "tunnels.yaml"
|
||||
cfg_file.write_text(textwrap.dedent("""\
|
||||
@@ -146,17 +182,17 @@ class TestCheckAllTunnels:
|
||||
local_port: 8001
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
t2:
|
||||
host: h2.local
|
||||
remote_port: 18002
|
||||
local_port: 8002
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
"""))
|
||||
os.environ["BRIDGE_CONFIG"] = str(cfg_file)
|
||||
|
||||
@@ -18,14 +18,14 @@ MINIMAL_CONFIG = textwrap.dedent("""\
|
||||
local_port: 8000
|
||||
ssh_user: testuser
|
||||
ssh_key: ~/.ssh/id_rsa
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
reconnect:
|
||||
max_attempts: 2
|
||||
backoff_initial: 1
|
||||
backoff_max: 2
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
""")
|
||||
|
||||
@@ -51,7 +51,7 @@ def tunnel_cfg():
|
||||
local_port=8000,
|
||||
ssh_user="testuser",
|
||||
ssh_key="~/.ssh/id_rsa",
|
||||
actor="operator.bernd",
|
||||
actor="adm-bernd",
|
||||
reconnect=ReconnectPolicy(max_attempts=2, backoff_initial=1, backoff_max=2),
|
||||
)
|
||||
|
||||
@@ -142,7 +142,7 @@ class TestHealthCheckDegradedPath:
|
||||
local_port=8001,
|
||||
ssh_user="u",
|
||||
ssh_key="k",
|
||||
actor="operator.bernd",
|
||||
actor="adm-bernd",
|
||||
reconnect=ReconnectPolicy(max_attempts=1, backoff_initial=1, backoff_max=1),
|
||||
health_check=hc_cfg,
|
||||
)
|
||||
|
||||
@@ -3,6 +3,8 @@ import os
|
||||
import signal
|
||||
from unittest.mock import MagicMock, patch
|
||||
|
||||
from dataclasses import replace
|
||||
|
||||
import pytest
|
||||
|
||||
from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
|
||||
@@ -38,6 +40,16 @@ class TestBuildSshCommand:
|
||||
assert "-i" in cmd
|
||||
assert "ubuntu@host.local" in cmd
|
||||
|
||||
def test_remote_host_override_local(self, tunnel_cfg):
|
||||
cfg = replace(tunnel_cfg, direction="local", remote_host="10.43.103.154")
|
||||
cmd = build_ssh_command(cfg)
|
||||
assert "-L" in cmd
|
||||
assert f"{cfg.local_port}:10.43.103.154:{cfg.remote_port}" in cmd
|
||||
|
||||
def test_remote_host_default_loopback(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
assert "18000:127.0.0.1:8000" in cmd
|
||||
|
||||
def test_server_alive_options(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
assert "-o" in cmd
|
||||
@@ -105,3 +117,99 @@ class TestTunnelManager:
|
||||
def test_is_running_false_initially(self, tunnel_cfg, state_dir):
|
||||
mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
|
||||
assert not mgr.is_running()
|
||||
|
||||
|
||||
class TestBuildSshCommandWithCert:
|
||||
def test_no_cert_path_omits_extra_i(self, tunnel_cfg):
|
||||
cmd = build_ssh_command(tunnel_cfg)
|
||||
assert cmd.count("-i") == 1
|
||||
|
||||
def test_cert_path_appends_after_key(self, tunnel_cfg, tmp_path):
|
||||
cert = tmp_path / "test-cert.pub"
|
||||
cert.write_text("cert")
|
||||
cmd = build_ssh_command(tunnel_cfg, cert_path=cert)
|
||||
i_indices = [i for i, x in enumerate(cmd) if x == "-i"]
|
||||
assert len(i_indices) == 2
|
||||
key_idx, cert_idx = i_indices
|
||||
assert not cmd[key_idx + 1].endswith("-cert.pub") # key comes first
|
||||
assert cmd[cert_idx + 1] == str(cert)
|
||||
|
||||
|
||||
class TestRunCertCommand:
|
||||
def test_returns_none_when_no_cert_command(self, tunnel_cfg, tmp_path):
|
||||
from bridge.manager import _run_cert_command
|
||||
assert _run_cert_command(tunnel_cfg, tmp_path) is None
|
||||
|
||||
def test_writes_cert_and_returns_path(self, tunnel_cfg, tmp_path):
|
||||
from bridge.manager import _run_cert_command
|
||||
tunnel_cfg.cert_command = "echo 'ssh-rsa-cert AAAA'"
|
||||
path = _run_cert_command(tunnel_cfg, tmp_path)
|
||||
assert path is not None
|
||||
assert path.exists()
|
||||
assert "ssh-rsa-cert" in path.read_text()
|
||||
|
||||
def test_raises_on_nonzero_exit(self, tunnel_cfg, tmp_path):
|
||||
from bridge.manager import _run_cert_command
|
||||
from bridge.models import CertAcquisitionError
|
||||
tunnel_cfg.cert_command = "exit 1"
|
||||
with pytest.raises(CertAcquisitionError):
|
||||
_run_cert_command(tunnel_cfg, tmp_path)
|
||||
|
||||
|
||||
class TestActorTypeFromName:
|
||||
def test_adm_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("adm-bernd") == "adm"
|
||||
|
||||
def test_agt_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("agt-claude") == "agt"
|
||||
|
||||
def test_atm_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("atm-cron") == "atm"
|
||||
|
||||
def test_unknown_prefix(self):
|
||||
from bridge.manager import _actor_type_from_name
|
||||
assert _actor_type_from_name("operator.bernd") == "unknown"
|
||||
|
||||
|
||||
class TestTtlRefresh:
|
||||
def test_parse_cert_expiry_returns_none_for_missing_file(self, tmp_path):
|
||||
from bridge.manager import _parse_cert_expiry
|
||||
missing = tmp_path / "no.pub"
|
||||
result = _parse_cert_expiry(missing)
|
||||
assert result is None
|
||||
|
||||
def test_parse_cert_identity_returns_none_for_missing_file(self, tmp_path):
|
||||
from bridge.manager import _parse_cert_identity
|
||||
missing = tmp_path / "no.pub"
|
||||
result = _parse_cert_identity(missing)
|
||||
assert result is None
|
||||
|
||||
def test_parse_cert_identity_from_keygen_output(self, tmp_path):
|
||||
from unittest.mock import patch, MagicMock
|
||||
from bridge.manager import _parse_cert_identity
|
||||
cert = tmp_path / "test.pub"
|
||||
cert.write_text("fake")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout='test.pub:\n Key ID: "agt-bridge"\n',
|
||||
returncode=0,
|
||||
)
|
||||
result = _parse_cert_identity(cert)
|
||||
assert result == "agt-bridge"
|
||||
|
||||
def test_parse_cert_expiry_from_keygen_output(self, tmp_path):
|
||||
from unittest.mock import patch, MagicMock
|
||||
from bridge.manager import _parse_cert_expiry
|
||||
cert = tmp_path / "test.pub"
|
||||
cert.write_text("fake")
|
||||
with patch("subprocess.run") as mock_run:
|
||||
mock_run.return_value = MagicMock(
|
||||
stdout="test.pub:\n Valid: from 2026-05-15T10:00:00 to 2030-05-15T22:00:00\n",
|
||||
returncode=0,
|
||||
)
|
||||
result = _parse_cert_expiry(cert)
|
||||
assert result is not None
|
||||
assert result.year == 2030
|
||||
|
||||
@@ -49,10 +49,10 @@ def _simple_config(tmp_path: Path) -> Path:
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
"""))
|
||||
|
||||
@@ -66,10 +66,10 @@ def _catalog_config(tmp_path: Path, catalog_dir: Path) -> Path:
|
||||
local_port: 8000
|
||||
ssh_user: ubuntu
|
||||
ssh_key: ~/.ssh/id_ops
|
||||
actor: operator.bernd
|
||||
actor: adm-bernd
|
||||
actors:
|
||||
operator.bernd:
|
||||
class: human
|
||||
adm-bernd:
|
||||
class: adm
|
||||
description: Bernd
|
||||
catalog_path: {catalog_dir}
|
||||
"""))
|
||||
@@ -237,22 +237,22 @@ class TestMcpBridgeDown:
|
||||
class TestMcpBridgeRestart:
|
||||
@pytest.mark.capability("bridge_restart")
|
||||
@pytest.mark.access_mode("mcp")
|
||||
async def test_bridge_restart_calls_stop_then_start(self, env_simple):
|
||||
with patch("bridge.manager.TunnelManager") as mock_cls:
|
||||
mock_mgr = MagicMock()
|
||||
call_order = []
|
||||
mock_mgr.stop.side_effect = lambda: call_order.append("stop")
|
||||
mock_mgr.start.side_effect = lambda: call_order.append("start")
|
||||
mock_cls.return_value = mock_mgr
|
||||
async def test_bridge_restart_delegates_to_cleanup(self, env_simple):
|
||||
from bridge.cleanup import CleanupAction
|
||||
|
||||
with patch("bridge.cleanup.restart_tunnel") as mock_restart:
|
||||
mock_restart.return_value = CleanupAction(
|
||||
"test-tunnel", "healthy", "remote forward healthy"
|
||||
)
|
||||
|
||||
from fastmcp import Client
|
||||
async with Client(mcp) as c:
|
||||
result = await c.call_tool("bridge_restart", {"tunnel": "test-tunnel"})
|
||||
|
||||
data = _data(result)
|
||||
assert "restarted" in data
|
||||
assert "test-tunnel" in data["restarted"]
|
||||
assert call_order == ["stop", "start"]
|
||||
assert data["actions"][0]["tunnel"] == "test-tunnel"
|
||||
assert data["actions"][0]["action"] == "healthy"
|
||||
mock_restart.assert_called_once()
|
||||
|
||||
async def test_bridge_restart_unknown_tunnel(self, env_simple):
|
||||
from fastmcp import Client
|
||||
@@ -278,8 +278,8 @@ class TestMcpBridgeLogs:
|
||||
_json.dumps({
|
||||
"timestamp": "2026-01-01T00:00:00+00:00",
|
||||
"tunnel": "test-tunnel",
|
||||
"actor": "operator.bernd",
|
||||
"actor_class": "human",
|
||||
"actor": "adm-bernd",
|
||||
"actor_type": "adm",
|
||||
"event": "bridge_started",
|
||||
}) + "\n"
|
||||
)
|
||||
|
||||
@@ -69,6 +69,7 @@ class TestTunnelConfig:
|
||||
|
||||
class TestActorInfo:
|
||||
def test_fields(self):
|
||||
a = ActorInfo(name="operator.bernd", actor_class="human", description="Bernd")
|
||||
assert a.name == "operator.bernd"
|
||||
assert a.actor_class == "human"
|
||||
from bridge.models import ActorType
|
||||
a = ActorInfo(name="adm-bernd", actor_type=ActorType.ADM, description="Bernd")
|
||||
assert a.name == "adm-bernd"
|
||||
assert a.actor_type == ActorType.ADM
|
||||
|
||||
18
uv.lock
generated
18
uv.lock
generated
@@ -345,7 +345,7 @@ wheels = [
|
||||
|
||||
[[package]]
|
||||
name = "fastmcp"
|
||||
version = "3.1.0"
|
||||
version = "3.0.2"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
dependencies = [
|
||||
{ name = "authlib" },
|
||||
@@ -365,14 +365,13 @@ dependencies = [
|
||||
{ name = "python-dotenv" },
|
||||
{ name = "pyyaml" },
|
||||
{ name = "rich" },
|
||||
{ name = "uncalled-for" },
|
||||
{ name = "uvicorn" },
|
||||
{ name = "watchfiles" },
|
||||
{ name = "websockets" },
|
||||
]
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/0a/70/862026c4589441f86ad3108f05bfb2f781c6b322ad60a982f40b303b47d7/fastmcp-3.1.0.tar.gz", hash = "sha256:e25264794c734b9977502a51466961eeecff92a0c2f3b49c40c070993628d6d0", size = 17347083 }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/11/6b/1a7ec89727797fb07ec0928e9070fa2f45e7b35718e1fe01633a34c35e45/fastmcp-3.0.2.tar.gz", hash = "sha256:6bd73b4a3bab773ee6932df5249dcbcd78ed18365ed0aeeb97bb42702a7198d7", size = 17239351 }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/17/07/516f5b20d88932e5a466c2216b628e5358a71b3a9f522215607c3281de05/fastmcp-3.1.0-py3-none-any.whl", hash = "sha256:b1f73b56fd3b0cb2bd9e2a144fc650d5cc31587ed129d996db7710e464ae8010", size = 633749 },
|
||||
{ url = "https://files.pythonhosted.org/packages/0a/5a/f410a9015cfde71adf646dab4ef2feae49f92f34f6050fcfb265eb126b30/fastmcp-3.0.2-py3-none-any.whl", hash = "sha256:f513d80d4b30b54749fe8950116b1aab843f3c293f5cb971fc8665cb48dbb028", size = 606268 },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
@@ -664,7 +663,7 @@ dev = [
|
||||
|
||||
[package.metadata]
|
||||
requires-dist = [
|
||||
{ name = "fastmcp", specifier = ">=2.0.0" },
|
||||
{ name = "fastmcp", specifier = ">=2.0.0,<3.1.0" },
|
||||
{ name = "httpx", specifier = ">=0.27" },
|
||||
{ name = "pyyaml", specifier = ">=6.0" },
|
||||
{ name = "typer", specifier = ">=0.12" },
|
||||
@@ -1297,15 +1296,6 @@ wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/dc/9b/47798a6c91d8bdb567fe2698fe81e0c6b7cb7ef4d13da4114b41d239f65d/typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7", size = 14611 },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "uncalled-for"
|
||||
version = "0.2.0"
|
||||
source = { registry = "https://pypi.org/simple" }
|
||||
sdist = { url = "https://files.pythonhosted.org/packages/02/7c/b5b7d8136f872e3f13b0584e576886de0489d7213a12de6bebf29ff6ebfc/uncalled_for-0.2.0.tar.gz", hash = "sha256:b4f8fdbcec328c5a113807d653e041c5094473dd4afa7c34599ace69ccb7e69f", size = 49488 }
|
||||
wheels = [
|
||||
{ url = "https://files.pythonhosted.org/packages/ff/7f/4320d9ce3be404e6310b915c3629fe27bf1e2f438a1a7a3cb0396e32e9a9/uncalled_for-0.2.0-py3-none-any.whl", hash = "sha256:2c0bd338faff5f930918f79e7eb9ff48290df2cb05fcc0b40a7f334e55d4d85f", size = 11351 },
|
||||
]
|
||||
|
||||
[[package]]
|
||||
name = "uvicorn"
|
||||
version = "0.41.0"
|
||||
|
||||
203
wiki/AccessManagementDirective.md
Normal file
203
wiki/AccessManagementDirective.md
Normal file
@@ -0,0 +1,203 @@
|
||||
AccessManagementDirective
|
||||
|
||||
*Practical host access control management *
|
||||
|
||||
# AccessManagementDirective
|
||||
|
||||
**Document Title:** SSH Access Management Directive
|
||||
**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements)
|
||||
**Date:** 28 March 2026
|
||||
**Audience:** Operations Department
|
||||
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
|
||||
**Author:** Grok (on behalf of the team)
|
||||
**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.
|
||||
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
|
||||
|
||||
## 0. Prerequisites
|
||||
|
||||
Before bootstrapping, the following must be in place:
|
||||
- Ansible (or equivalent config-management tool) with a central inventory.
|
||||
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
|
||||
- GitOps repository containing the authoritative principals inventory.
|
||||
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
|
||||
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
|
||||
|
||||
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
|
||||
|
||||
## 1. Concept Overview
|
||||
|
||||
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
|
||||
|
||||
**Why this model?**
|
||||
- A central CA signs short-lived certificates for every login.
|
||||
- No more manual key copying, key sprawl, or painful revocation.
|
||||
- Built-in expiration, role-based principals, and auditability.
|
||||
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
|
||||
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
|
||||
|
||||
**Core Principles**
|
||||
- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
|
||||
- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).
|
||||
- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host.
|
||||
- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
|
||||
- **Separation of concerns** –
|
||||
- **Admins (adm)**: Human operators (full interactive shell when needed).
|
||||
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
|
||||
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
|
||||
|
||||
## 2. Actor Definitions & Access Model
|
||||
|
||||
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|
||||
|------------|-------------------|-------------|------------------------------|---------------------------|
|
||||
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
|
||||
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
|
||||
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
|
||||
|
||||
**Certificate Naming Convention**
|
||||
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
|
||||
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
|
||||
|
||||
**LLM-Agent Risk Clarification**
|
||||
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
|
||||
|
||||
## 3. Bootstrapping the System (One-Time Setup)
|
||||
|
||||
### 3.1. Create the CA (do this once, offline)
|
||||
```bash
|
||||
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
|
||||
```
|
||||
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
|
||||
- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.
|
||||
- Public key: `ca_user.pub`
|
||||
|
||||
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
|
||||
- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
|
||||
- Update `/etc/ssh/sshd_config`:
|
||||
```bash
|
||||
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
|
||||
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
|
||||
PubkeyAuthentication yes
|
||||
PasswordAuthentication no
|
||||
PermitRootLogin no
|
||||
```
|
||||
- Create principals directory and files from the central Git inventory.
|
||||
- `systemctl restart sshd`
|
||||
|
||||
### 3.3. Initial Admin Access
|
||||
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
|
||||
|
||||
## 4. Automatic Management of Access Rights
|
||||
|
||||
### 4.1. Daily / On-Demand Workflow
|
||||
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
|
||||
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
|
||||
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
|
||||
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
|
||||
|
||||
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
|
||||
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
|
||||
- Example inventory snippet:
|
||||
```yaml
|
||||
hosts:
|
||||
- name: prod-db-01
|
||||
allowed_principals:
|
||||
adm: [adm-full]
|
||||
agt: [agt-incident-resolver-v2]
|
||||
atm: [atm-backup-daily, atm-logrotate]
|
||||
```
|
||||
|
||||
3. **Revocation & Rotation**
|
||||
- Short expiry = automatic revocation.
|
||||
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
|
||||
- Agents/automations never store long-lived private keys on disk.
|
||||
|
||||
4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`)
|
||||
```python
|
||||
#!/usr/bin/env python3
|
||||
import subprocess, os, tempfile
|
||||
# Request short-lived cert from Vault
|
||||
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
|
||||
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
|
||||
f.write(cert.encode())
|
||||
cert_path = f.name
|
||||
# Load into ssh-agent and exec the real command
|
||||
subprocess.run(["ssh-add", cert_path])
|
||||
os.execvp(sys.argv[1], sys.argv[1:])
|
||||
```
|
||||
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
|
||||
|
||||
### 4.2. Human UX Guidance
|
||||
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
|
||||
|
||||
### 4.3. Emergency Break-Glass Procedure
|
||||
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
|
||||
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
|
||||
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
|
||||
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
|
||||
4. After recovery, immediately rotate the CA and run a full scorecard.
|
||||
|
||||
## 5. AccessManagement Scorecard (Checklist)
|
||||
|
||||
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
|
||||
|
||||
| Category | Check | Target | Tool |
|
||||
|----------|-------|--------|------|
|
||||
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
|
||||
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
|
||||
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
|
||||
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
|
||||
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
|
||||
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
|
||||
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
|
||||
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
|
||||
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
|
||||
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
|
||||
| **Score** | ≥ 10/10 = **Operational** | - | - |
|
||||
|
||||
**Scorecard Execution Command** (run from ops laptop):
|
||||
```bash
|
||||
ansible all -m command -a "ssh-access-scorecard.sh" --become
|
||||
```
|
||||
|
||||
## 6. Scope & Operational Boundaries
|
||||
|
||||
### 6.1. When Bootstrapping Is Officially Closed
|
||||
The system is **fully operational** when **ALL** of the following are true:
|
||||
- Scorecard passes 10/10 on every host.
|
||||
- Central Git repo contains the authoritative principals inventory.
|
||||
- First three admins have successfully used signed certificates for 7 consecutive days.
|
||||
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
|
||||
- CI/CD pipeline for host config updates is green and runs hourly.
|
||||
- Emergency break-glass procedure has been tested once.
|
||||
|
||||
**Declaration:** Ops Lead signs off with date in the Git commit message.
|
||||
|
||||
### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling
|
||||
Stay with **native OpenSSH CA + Ansible + Vault** while:
|
||||
- ≤ 200 hosts
|
||||
- ≤ 50 distinct agent/automation identities
|
||||
- No regulatory requirement for SSO or full session recording
|
||||
|
||||
**Switch triggers** (any one):
|
||||
- > 200 hosts OR rapid daily growth
|
||||
- Need for human SSO (Okta/Google) integration
|
||||
- Requirement for audited web-based SSH sessions or just-in-time access approval
|
||||
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
|
||||
- Audit/compliance demands central policy engine or session recording
|
||||
|
||||
**Recommended next-level tools** (in order):
|
||||
1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID).
|
||||
2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily.
|
||||
3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC.
|
||||
|
||||
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
|
||||
|
||||
## 7. Enforcement & Review
|
||||
- **Quarterly review** of this directive and scorecard results.
|
||||
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
|
||||
- **Questions / improvements** → create PR against this file in the ops repo.
|
||||
|
||||
**End of Document**
|
||||
Approved for immediate use across all production and staging environments.
|
||||
|
||||
xxx
|
||||
@@ -157,31 +157,82 @@ Just controlled operational access when you need it.
|
||||
Start a bridge:
|
||||
|
||||
```
|
||||
ob up hostA=hostB
|
||||
bridge up state-hub-railiance01
|
||||
```
|
||||
|
||||
Check active bridges:
|
||||
|
||||
```
|
||||
ob status
|
||||
bridge status
|
||||
```
|
||||
|
||||
Investigate infrastructure targets:
|
||||
|
||||
```
|
||||
ob targets
|
||||
bridge targets
|
||||
```
|
||||
|
||||
Stop the bridge when finished:
|
||||
|
||||
```
|
||||
ob down hostA=hostB
|
||||
bridge down state-hub-railiance01
|
||||
```
|
||||
|
||||
OpsBridge handles the lifecycle so operators can focus on solving the problem.
|
||||
|
||||
---
|
||||
|
||||
# Tunnel lifecycle commands
|
||||
|
||||
| Command | Purpose |
|
||||
|---------|---------|
|
||||
| `bridge up` | Start tunnel(s) that are not already running |
|
||||
| `bridge down` | Stop tunnel(s) that are running |
|
||||
| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
|
||||
| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
|
||||
|
||||
## `bridge restart` — blank-slate recovery
|
||||
|
||||
`bridge restart` means *operational again*, not merely cycling the local manager
|
||||
PID while a broken remote listener still holds the port.
|
||||
|
||||
For **reverse** tunnels (State Hub exposure on remote hosts), restart:
|
||||
|
||||
1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
|
||||
2. Clears orphan listeners on the remote host when needed
|
||||
3. Reconnects the tunnel (stop + start) only when cleanup was required
|
||||
|
||||
When the remote forward is already healthy, restart reports `healthy` and leaves
|
||||
the working tunnel running — no unnecessary disruption.
|
||||
|
||||
For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
|
||||
`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
|
||||
|
||||
Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
|
||||
restart contract. The nightly cron (`bridge maintenance install-cron`) runs
|
||||
`maintenance cleanup --restart` at 03:00.
|
||||
|
||||
**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
|
||||
blocked `bridge restart` until operators discovered the maintenance subcommand.
|
||||
See `state-hub/history/20260621-weekend-automation-assessment.md` and
|
||||
`BRIDGE-WP-0005` in this repo.
|
||||
|
||||
## Host roles
|
||||
|
||||
Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
|
||||
|
||||
| Role | Hosts | Behaviour |
|
||||
|------|-------|-----------|
|
||||
| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
|
||||
| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
|
||||
| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
|
||||
|
||||
Conditional remote cleanup before restart benefits all reverse tunnels.
|
||||
`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
|
||||
forwards are untouched.
|
||||
|
||||
---
|
||||
|
||||
# The Philosophy Behind OpsBridge
|
||||
|
||||
Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:
|
||||
|
||||
56
workplans/ADHOC-2026-06-14.md
Normal file
56
workplans/ADHOC-2026-06-14.md
Normal file
@@ -0,0 +1,56 @@
|
||||
---
|
||||
id: ADHOC-2026-06-14
|
||||
type: workplan
|
||||
title: "Ad hoc ops-bridge fixes for 2026-06-14"
|
||||
domain: custodian
|
||||
repo: ops-bridge
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: ops-bridge
|
||||
created: "2026-06-14"
|
||||
updated: "2026-06-14"
|
||||
state_hub_workstream_id: "fbc2ef7e-626f-4c6a-bdf8-c69bf29097ce"
|
||||
---
|
||||
|
||||
## Fix haskelseed bridge diagnostics
|
||||
|
||||
```task
|
||||
id: ADHOC-2026-06-14-T01
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "ffe6b8d8-889c-4ec4-8b64-00b77f86e39f"
|
||||
```
|
||||
|
||||
`haskelseed` is an Alpine host without `ss`, so `bridge check` reported
|
||||
reverse tunnel ports as closed even while SSH reverse listeners were present.
|
||||
Updated diagnostics to fall back from `ss` to `netstat` and then
|
||||
`/proc/net/tcp`/`tcp6`. Also fixed local-direction diagnostics so
|
||||
`nix-daemon-haskelseed` checks the local `-L` listener instead of probing a
|
||||
remote reverse port.
|
||||
|
||||
Verification:
|
||||
|
||||
- `state-hub-haskelseed` responded through `127.0.0.1:18000/state/health`.
|
||||
- `bridge check --json` reported all configured tunnels `ok: true`.
|
||||
- `python3 -m pytest tests/test_cli.py tests/test_diagnostics.py` passed.
|
||||
|
||||
## Make default target safe and add setup
|
||||
|
||||
```task
|
||||
id: ADHOC-2026-06-14-T02
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "3b932955-0d75-4b95-9821-92bfa2dadbd0"
|
||||
```
|
||||
|
||||
Changed `make` to default to a help listing that only shows targets with
|
||||
`##` comments. Added `make setup` to run `uv sync --all-groups` and reinstall
|
||||
the editable `bridge` CLI wrapper through `uv tool install -e . --force`.
|
||||
|
||||
Verification:
|
||||
|
||||
- `uv sync --all-groups` succeeded and installed the project environment.
|
||||
- `make` listed targets only and did not run tests or setup.
|
||||
- `make setup` succeeded and installed the `bridge` executable.
|
||||
- `make test` passed all 235 tests.
|
||||
- `make lint` passed.
|
||||
@@ -2,7 +2,7 @@
|
||||
id: BRIDGE-WP-0001
|
||||
type: workplan
|
||||
title: "OpsBridge Initial Implementation"
|
||||
domain: custodian
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: completed
|
||||
owner: Bernd
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
id: BRIDGE-WP-0002
|
||||
type: workplan
|
||||
title: "OpsCatalog Extension"
|
||||
domain: custodian
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: completed
|
||||
owner: Bernd
|
||||
|
||||
@@ -2,7 +2,7 @@
|
||||
id: BRIDGE-WP-0003
|
||||
type: workplan
|
||||
title: "OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage"
|
||||
domain: custodian
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: Bernd
|
||||
|
||||
340
workplans/BRIDGE-WP-0004-directive-alignment.md
Normal file
340
workplans/BRIDGE-WP-0004-directive-alignment.md
Normal file
@@ -0,0 +1,340 @@
|
||||
---
|
||||
id: BRIDGE-WP-0004
|
||||
type: workplan
|
||||
title: "AccessManagementDirective Alignment"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: Bernd
|
||||
topic_slug: custodian
|
||||
created: "2026-03-28"
|
||||
updated: "2026-03-28"
|
||||
state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
|
||||
|
||||
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
|
||||
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
|
||||
preserving full backward compatibility with the existing static-key mode.
|
||||
|
||||
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
|
||||
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
|
||||
|
||||
---
|
||||
|
||||
## Goal
|
||||
|
||||
After this workplan:
|
||||
|
||||
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
|
||||
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
|
||||
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
|
||||
handled transparently by the tunnel manager.
|
||||
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
|
||||
the directive, with config validation that enforces naming conventions.
|
||||
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
|
||||
§5 SIEM traceability requirement.
|
||||
|
||||
---
|
||||
|
||||
## Reference Documents
|
||||
|
||||
| Document | Location |
|
||||
|---|---|
|
||||
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
|
||||
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
|
||||
| PRD | `wiki/OpsBridgePrd.md` |
|
||||
| FRS | `wiki/OpsBridgeFrs.md` |
|
||||
|
||||
---
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Static key mode stays first-class
|
||||
|
||||
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
|
||||
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
|
||||
explicitly supported for:
|
||||
- Lab/dev environments without a CA
|
||||
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
|
||||
- Environments below the directive's complexity threshold
|
||||
|
||||
### cert_command interface
|
||||
|
||||
```yaml
|
||||
# tunnels.yaml — optional cert_command field
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: agt-state-hub-bridge
|
||||
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
|
||||
actor: agt-state-hub-bridge
|
||||
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
|
||||
```
|
||||
|
||||
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
|
||||
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
|
||||
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
|
||||
on tunnel stop.
|
||||
|
||||
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
|
||||
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
|
||||
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
|
||||
|
||||
### TTL-aware cert refresh
|
||||
|
||||
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
|
||||
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
|
||||
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
|
||||
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
|
||||
failure, no reconnect backoff triggered.
|
||||
|
||||
If `cert_command` is absent, no TTL logic runs.
|
||||
|
||||
### Actor type model
|
||||
|
||||
`actor_class: str # "human" | "automation"` is replaced by:
|
||||
|
||||
```python
|
||||
class ActorType(str, Enum):
|
||||
ADM = "adm" # human operator
|
||||
AGT = "agt" # LLM-powered autonomous agent
|
||||
ATM = "atm" # deterministic script / pipeline
|
||||
```
|
||||
|
||||
Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`.
|
||||
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
|
||||
canonical values.
|
||||
|
||||
Config validation: if `actor` name is set, it must start with the prefix matching its type
|
||||
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
|
||||
SIEM auditability.
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T1 — ActorType enum
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T1
|
||||
state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
|
||||
- [x] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` →
|
||||
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
|
||||
- [x] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
|
||||
`atm-*` for ATM; raise `ConfigError` on mismatch
|
||||
- [x] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value`
|
||||
- [x] Update tests
|
||||
|
||||
### T2 — cert_command config field
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T2
|
||||
state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
|
||||
- [x] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
|
||||
content (shell-level freedom intentional)
|
||||
- [x] Document in config example / SCOPE.md
|
||||
|
||||
### T3 — Cert acquisition in manager
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T3
|
||||
state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
|
||||
- If `cfg.cert_command` is None: return None (static key mode)
|
||||
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
|
||||
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
|
||||
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
|
||||
- [x] `build_ssh_command`: accept optional `cert_path`; when set, insert
|
||||
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
|
||||
- [x] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
|
||||
so every reconnect gets a fresh cert
|
||||
|
||||
### T4 — cert_identity in audit log
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T4
|
||||
state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
|
||||
extract `Key ID` (the `-I` value from signing time)
|
||||
- [x] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
|
||||
JSON entry when present
|
||||
- [x] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
|
||||
- [x] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
|
||||
|
||||
### T5 — TTL-aware cert refresh
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T5
|
||||
state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
|
||||
from `ssh-keygen -L` output → `cert_expires_at: datetime`
|
||||
- [x] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
|
||||
on each iteration
|
||||
- [x] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
|
||||
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
|
||||
next iteration)
|
||||
- [x] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
|
||||
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
|
||||
- [x] If `cert_command` is absent, skip all TTL logic entirely
|
||||
|
||||
### T6 — `bridge cert-status` command
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T6
|
||||
state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
- [x] `cli.py`: add `cert-status [TUNNEL]` subcommand
|
||||
- [x] For each tunnel (or the named one): read cert file from state dir if present,
|
||||
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
|
||||
time-to-expiry (or "static key / no cert" if absent)
|
||||
- [x] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
|
||||
- [x] `--json` flag for machine-readable output
|
||||
|
||||
### T7 — CertAcquisitionError handling
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T7
|
||||
state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] New exception `CertAcquisitionError` in `models.py`
|
||||
- [x] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
|
||||
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
|
||||
(cert failures are transient — e.g., Vault briefly unreachable)
|
||||
- [x] After `max_attempts` consecutive cert failures, transition to `FAILED` state
|
||||
|
||||
### T8 — SCOPE.md and documentation updates
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T8
|
||||
state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
|
||||
status: done
|
||||
priority: medium
|
||||
```
|
||||
|
||||
- [x] Update `SCOPE.md`: Current State updated to reflect completion; directive alignment done
|
||||
- [x] `wiki/OpsBridgeFrs.md` §5.7 already covers actor attribution abstractly — no changes needed
|
||||
- [x] `.claude/rules/architecture.md` already documents cert_command mode and actor vocab
|
||||
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency (deferred)
|
||||
|
||||
### T9 — Tests
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0004-T9
|
||||
state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
|
||||
status: done
|
||||
priority: high
|
||||
```
|
||||
|
||||
- [x] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
|
||||
cert_command parse
|
||||
- [x] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
|
||||
args; verify `CertAcquisitionError` on non-zero exit; TTL logic helpers
|
||||
- [x] `test_audit.py`: `cert_identity` field; actor_type rename
|
||||
- [x] `test_cli.py`: `cert-status` exit codes; JSON output shape
|
||||
- [x] 233 tests, 0 failures
|
||||
|
||||
---
|
||||
|
||||
## Config Schema — Before / After
|
||||
|
||||
### Before
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: ops-agent
|
||||
ssh_key: ~/.ssh/id_ed25519
|
||||
actor: automation-agent
|
||||
|
||||
actors:
|
||||
automation-agent:
|
||||
class: automation
|
||||
description: "state hub bridge agent"
|
||||
```
|
||||
|
||||
### After (static key mode — unchanged behavior)
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: agt-state-hub-bridge
|
||||
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
|
||||
actor: agt-state-hub-bridge
|
||||
|
||||
actors:
|
||||
agt-state-hub-bridge:
|
||||
class: agt
|
||||
description: "state hub bridge agent"
|
||||
```
|
||||
|
||||
### After (cert_command mode — ops-warden or any CA)
|
||||
```yaml
|
||||
tunnels:
|
||||
state-hub-coulombcore:
|
||||
host: coulombcore
|
||||
remote_port: 8001
|
||||
local_port: 8000
|
||||
ssh_user: agt-state-hub-bridge
|
||||
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
|
||||
actor: agt-state-hub-bridge
|
||||
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
|
||||
|
||||
actors:
|
||||
agt-state-hub-bridge:
|
||||
class: agt
|
||||
description: "state hub bridge agent"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [x] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
|
||||
warning only); tunnel behaves identically
|
||||
- [x] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
|
||||
- [x] Config with `cert_command` set: SSH process launched with both `-i key` and
|
||||
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
|
||||
- [x] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
|
||||
no TTL logic runs
|
||||
- [x] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
|
||||
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
|
||||
- [x] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
|
||||
- [x] `bridge cert-status` shows valid cert info; exits 1 on expired cert
|
||||
- [x] All tests pass: `uv run pytest` (233 passed)
|
||||
- [x] All lints pass: `uv run ruff check .`
|
||||
194
workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
Normal file
194
workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
Normal file
@@ -0,0 +1,194 @@
|
||||
---
|
||||
id: BRIDGE-WP-0005
|
||||
type: workplan
|
||||
title: "Restart includes remote cleanup (blank-slate recovery)"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-21"
|
||||
updated: "2026-06-21"
|
||||
state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
|
||||
---
|
||||
|
||||
# BRIDGE-WP-0005 — Restart includes remote cleanup
|
||||
|
||||
**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
|
||||
`sshd` remote forward on Railiance01 port `18000` blocked
|
||||
`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
|
||||
had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
|
||||
|
||||
**Operator expectation:** `bridge restart` should mean *operational again* — a
|
||||
blank-slate recovery — not merely "cycle the local manager PID while a broken
|
||||
remote listener still holds the port."
|
||||
|
||||
## Topology and failure modes (refined)
|
||||
|
||||
Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
|
||||
Cleanup policy must respect all of them.
|
||||
|
||||
### A. Workstation (laptop WSL) — tunnel **origin**
|
||||
|
||||
The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
|
||||
remote hosts:
|
||||
|
||||
| Remote host | Tunnels (reverse) | Role |
|
||||
|-------------|-------------------|------|
|
||||
| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
|
||||
| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
|
||||
| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
|
||||
|
||||
**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
|
||||
local bridge processes without graceful remote SSH teardown. Orphan `sshd`
|
||||
listeners on **all three remotes** are common after wake — especially
|
||||
`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
|
||||
|
||||
### B. Haskelseed — also intermittently offline
|
||||
|
||||
Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
|
||||
different networks. The same orphan-forward pattern applies to its reverse ports
|
||||
when the workstation-side tunnel dies uncleanly.
|
||||
|
||||
### C. VPS remotes (coulombcore, railiance01)
|
||||
|
||||
Normally always-on. Maintenance reboots clear remote kernel state, but:
|
||||
|
||||
- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
|
||||
with a dead local SSH child;
|
||||
- when the laptop returns, orphan forwards from the **previous** session may
|
||||
still block new `-R` binds if the VPS did not reboot.
|
||||
|
||||
**Conclusion:** conditional remote cleanup before restart benefits **all reverse
|
||||
tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
|
||||
skips healthy forwards — VPS tunnels with live working forwards are untouched.
|
||||
|
||||
### D. Local-direction tunnels — no remote cleanup
|
||||
|
||||
`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
|
||||
forward mode from workstation to remote services. They do not bind remote reverse
|
||||
ports for State Hub. **`restart` stays local stop/start only** for these.
|
||||
|
||||
## Design (decided)
|
||||
|
||||
| Command | Behaviour after this workplan |
|
||||
|---------|-------------------------------|
|
||||
| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
|
||||
| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
|
||||
| `bridge up` | Out of scope here (see T4 optional follow-up). |
|
||||
|
||||
Implementation sketch: replace the body of `cli.restart()` with a call to
|
||||
`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
|
||||
or per-tunnel `cleanup_tunnel` when a single tunnel is named.
|
||||
|
||||
Emit the same action summary strings cleanup already uses (`healthy`,
|
||||
`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
|
||||
positive during T2).
|
||||
- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
|
||||
- Renaming tunnels or changing `tunnels.yaml` host entries.
|
||||
|
||||
---
|
||||
|
||||
## T1 — Wire restart through cleanup path
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
|
||||
```
|
||||
|
||||
Refactor `bridge/cli.py` `restart()` so reverse tunnels call
|
||||
`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
|
||||
`TunnelManager.stop()` + `start()`.
|
||||
|
||||
Requirements:
|
||||
|
||||
- Single-tunnel and all-tunnel restart both work.
|
||||
- Local-direction tunnels keep stop/start only.
|
||||
- Exit codes: preserve today’s semantics where practical; exit non-zero if any
|
||||
named tunnel ends in `CleanupAction.action == "error"`.
|
||||
- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
|
||||
etc.), not only "Restarted tunnel".
|
||||
|
||||
## T2 — Tests and regression coverage
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
|
||||
```
|
||||
|
||||
Update `tests/test_cli.py`:
|
||||
|
||||
- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
|
||||
reverse tunnels.
|
||||
- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
|
||||
invoked), local-direction tunnel (no cleanup call).
|
||||
- Reuse mocks from `tests/test_cleanup.py` patterns.
|
||||
|
||||
`make test` and `make lint` pass.
|
||||
|
||||
## T3 — Operator docs and CLI help
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
|
||||
```
|
||||
|
||||
Document the blank-slate restart contract:
|
||||
|
||||
- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
|
||||
- `bridge restart --help` — mention conditional remote stale-forward cleanup.
|
||||
- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
|
||||
maintenance — matching this workplan's topology section.
|
||||
- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
|
||||
incident note (one line each way).
|
||||
|
||||
## T4 — Optional: reconnect-loop hygiene (stretch)
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T04
|
||||
status: cancel
|
||||
priority: low
|
||||
state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
|
||||
```
|
||||
|
||||
Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
|
||||
once after repeated exit-255 bind failures (laptop wake without operator running
|
||||
`bridge restart`). Defer unless T1–T3 are done; mark `cancel` if heuristic risk
|
||||
outweighs benefit.
|
||||
|
||||
**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
|
||||
loop risks killing a legitimately healthy orphan forward owned by another session
|
||||
or operator. `bridge restart` now covers the operator-facing blank-slate path;
|
||||
nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
|
||||
wake-from-sleep reconnect failures remain frequent after a month of observation.
|
||||
|
||||
## T5 — Live verification on workstation + VPS
|
||||
|
||||
```task
|
||||
id: BRIDGE-WP-0005-T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
|
||||
```
|
||||
|
||||
After T1–T2 ship, verify on real config:
|
||||
|
||||
1. **railiance01** — `state-hub-mcp-railiance01` was `reconnecting` with stale
|
||||
forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
|
||||
`connected`.
|
||||
2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat
|
||||
path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
|
||||
3. **coulombcore** — `bridge restart state-hub-coulombcore` reported `healthy`,
|
||||
PID unchanged (4116), forward undisturbed.
|
||||
|
||||
State Hub progress logged (2026-06-21). Workplan marked `finished`.
|
||||
@@ -2,7 +2,7 @@
|
||||
id: OPS-WP-0001
|
||||
type: workplan
|
||||
title: "ops-bridge diagnostics and flow improvements"
|
||||
domain: custodian
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: claude
|
||||
|
||||
221
workplans/OPS-WP-0002-agent-usability.md
Normal file
221
workplans/OPS-WP-0002-agent-usability.md
Normal file
@@ -0,0 +1,221 @@
|
||||
---
|
||||
id: OPS-WP-0002
|
||||
type: workplan
|
||||
title: "Agent Usability — MCP Registration, Skill, and Worker Orientation"
|
||||
domain: infotech
|
||||
repo: ops-bridge
|
||||
status: done
|
||||
owner: custodian
|
||||
topic_slug: custodian
|
||||
created: "2026-03-21"
|
||||
updated: "2026-03-26"
|
||||
depends_on: OPS-WP-0001
|
||||
state_hub_workstream_id: "c195cc40-8be7-462e-be26-a7d6bda34cd5"
|
||||
---
|
||||
|
||||
# OPS-WP-0002 — Agent Usability: MCP Registration, Skill, and Worker Orientation
|
||||
|
||||
## Problem
|
||||
|
||||
The ops-bridge MCP server (`src/bridge/mcp_server/server.py`) is fully
|
||||
implemented with tools for `bridge_up/down/restart/status/check/logs` and
|
||||
catalog operations. But no agent can use it because:
|
||||
|
||||
1. **Not registered** — the server isn't in `~/.claude.json` and has no
|
||||
persistent transport mode. It only runs on stdio today.
|
||||
2. **No slash command** — agents working ad-hoc (not via MCP) have no
|
||||
quick way to check or restore tunnels.
|
||||
3. **No worker orientation** — agents on remote machines (CoulombCore,
|
||||
Railiance) don't know that bridge is available or how to use it when
|
||||
their state-hub connection drops.
|
||||
|
||||
## Goal
|
||||
|
||||
Any agent — on the workstation or a remote machine — can:
|
||||
- Check tunnel health in one call
|
||||
- Bring up a dropped tunnel without manual intervention
|
||||
- Recover the state-hub connection if it goes down mid-session
|
||||
|
||||
## Design
|
||||
|
||||
### MCP server (workstation, persistent)
|
||||
|
||||
Run as an SSE service on port 8002 (same pattern as state-hub on 8001).
|
||||
Registered at user scope in `~/.claude.json` so it's available to all
|
||||
Claude Code sessions.
|
||||
|
||||
The SSE transport is already supported by FastMCP — just change the
|
||||
`mcp.run()` call to accept an `--http` flag or read a `BRIDGE_MCP_PORT`
|
||||
env var.
|
||||
|
||||
### Slash command skill (all machines)
|
||||
|
||||
A `/bridge` skill at `~/.claude/commands/bridge.md` (global scope) that:
|
||||
- Reads `bridge status` output
|
||||
- Surfaces any tunnel that is down or stale
|
||||
- Offers to bring it up
|
||||
- Useful on machines that don't have the MCP server registered
|
||||
|
||||
### Worker agent orientation (remote machines)
|
||||
|
||||
Update `CLAUDE.md` (global) and `ops-bridge` session protocol to tell
|
||||
worker agents:
|
||||
- Check `bridge status` at session start when on a machine with
|
||||
ops-bridge installed
|
||||
- If state-hub tunnel is down: run `bridge up state-hub-<machine>` to
|
||||
restore it before making any state-hub API calls
|
||||
- If no bridge command: fall back to direct API URL if reachable
|
||||
|
||||
---
|
||||
|
||||
## Tasks
|
||||
|
||||
### T01 — SSE transport mode for MCP server
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "27fc6fa1-6d0e-438a-b4a3-c6091931da88"
|
||||
```
|
||||
|
||||
Add `--http` flag and `BRIDGE_MCP_PORT` env var to `server.py` entry
|
||||
point. When `--http` is set, run `mcp.run(transport="sse", port=PORT)`
|
||||
instead of stdio.
|
||||
|
||||
Add `make mcp-http` target to `Makefile`:
|
||||
```makefile
|
||||
mcp-http: ## Start MCP server in SSE mode (default port 8002)
|
||||
BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
|
||||
```
|
||||
|
||||
Add `make mcp-stop` target that kills any running MCP server on port
|
||||
8002.
|
||||
|
||||
Gate: `bridge_status()` tool callable via SSE on localhost:8002 after
|
||||
`make mcp-http`.
|
||||
|
||||
---
|
||||
|
||||
### T02 — Register MCP server in ~/.claude.json
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "2216457d-035e-4804-b685-18975f3c6d1f"
|
||||
```
|
||||
|
||||
Register the ops-bridge MCP server at user scope:
|
||||
```bash
|
||||
claude mcp add-json -s user ops-bridge \
|
||||
'{"type":"sse","url":"http://127.0.0.1:8002/sse"}'
|
||||
```
|
||||
|
||||
Document in `ops-bridge` CLAUDE.md:
|
||||
```
|
||||
To start the MCP server:
|
||||
cd ~/ops-bridge && make mcp-http
|
||||
|
||||
To verify registration:
|
||||
python3 -c "import json,os; d=json.load(open(os.path.expanduser('~/.claude.json'))); print(list(d.get('mcpServers',{}).keys()))"
|
||||
```
|
||||
|
||||
Update global `~/.claude/CLAUDE.md` to list `ops-bridge` MCP server
|
||||
alongside `state-hub`.
|
||||
|
||||
Gate: `ops-bridge` appears in Claude Code MCP tool list after `make
|
||||
mcp-http`.
|
||||
|
||||
---
|
||||
|
||||
### T03 — `/bridge` slash command skill
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T03
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "4b2e39eb-4585-4e60-ab16-9e7909eced74"
|
||||
```
|
||||
|
||||
Create `~/.claude/commands/bridge.md` — a global Claude Code skill for
|
||||
tunnel management.
|
||||
|
||||
**Behaviour:**
|
||||
1. Run `bridge status` and parse output
|
||||
2. Report each tunnel: name, state, LIVE column
|
||||
3. For any tunnel that is `stopped`, `reconnecting`, or `[STALE]`:
|
||||
- Offer to run `bridge up <tunnel-name>`
|
||||
- After `bridge up`, re-check with `bridge check <tunnel-name>`
|
||||
4. If all tunnels are `connected` and LIVE: report green and exit
|
||||
|
||||
**Skill definition:**
|
||||
```yaml
|
||||
---
|
||||
description: >
|
||||
Check ops-bridge tunnel health and restore any dropped tunnels.
|
||||
Reports status of all configured tunnels and offers to bring up
|
||||
any that are stopped or stale.
|
||||
argument-hint: "[tunnel-name]"
|
||||
allowed-tools:
|
||||
- Bash(bridge status)
|
||||
- Bash(bridge up*)
|
||||
- Bash(bridge down*)
|
||||
- Bash(bridge check*)
|
||||
- Bash(bridge logs*)
|
||||
---
|
||||
```
|
||||
|
||||
If an optional tunnel name is passed as `$ARGUMENTS`, scope all
|
||||
operations to that tunnel only.
|
||||
|
||||
Gate: `/bridge` skill runs cleanly when all tunnels are up; correctly
|
||||
identifies and recovers a manually-stopped tunnel.
|
||||
|
||||
---
|
||||
|
||||
### T04 — Worker agent orientation in CLAUDE.md
|
||||
|
||||
```task
|
||||
id: OPS-WP-0002-T04
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "cc64bb07-ea5d-498a-8c14-bb653581efe7"
|
||||
```
|
||||
|
||||
Update global `~/.claude/CLAUDE.md` — add a **Worker Agent — Bridge
|
||||
Protocol** section:
|
||||
|
||||
```markdown
|
||||
## Worker Agent — Bridge Protocol
|
||||
|
||||
When working on a remote machine (CoulombCore, Railiance nodes):
|
||||
|
||||
1. At session start, check if `bridge` is installed:
|
||||
`which bridge && bridge status`
|
||||
2. If state-hub tunnel is down: `bridge up state-hub-<machine-slug>`
|
||||
Wait for state `connected` before making state-hub API calls.
|
||||
3. If `bridge` is not installed, check if the state-hub API is directly
|
||||
reachable: `curl -s http://127.0.0.1:8000/state/health`
|
||||
4. Only proceed without state-hub if absolutely necessary — log a
|
||||
progress note about the outage when connectivity is restored.
|
||||
```
|
||||
|
||||
Also add a one-liner reminder to the ops-bridge session protocol in
|
||||
`.claude/rules/session-protocol.md`:
|
||||
> At session start: `bridge status` — bring up any stopped tunnels
|
||||
> before accessing remote services.
|
||||
|
||||
Gate: `~/.claude/CLAUDE.md` contains the Worker Agent section; ops-bridge
|
||||
session protocol references bridge status check.
|
||||
|
||||
---
|
||||
|
||||
## Done Criteria
|
||||
|
||||
- [x] `make mcp-http` starts the MCP server on port 8002 (SSE)
|
||||
- [x] `bridge_status` and `bridge_check` callable as MCP tools from Claude Code
|
||||
- [x] `ops-bridge` registered in `~/.claude.json` at user scope
|
||||
- [x] `/bridge` skill surfaces tunnel states and recovers a stopped tunnel
|
||||
- [x] Global CLAUDE.md has worker agent bridge protocol
|
||||
- [x] All existing tests pass after T01 changes (`make test`)
|
||||
Reference in New Issue
Block a user