chore(consistency): sync task status from DB [auto]

Updated by fix-consistency on 2026-07-03: - update .custodian-brief.md for ops-bridge
tunnels: optional remote_host forward destination (default 127.0.0.1)
2026-07-03 18:52:51 +02:00 · 2026-07-02 14:18:18 +02:00 · 2026-06-22 23:16:27 +02:00 · 2026-06-22 11:40:44 +02:00 · 2026-06-22 03:06:02 +02:00 · 2026-06-22 02:44:47 +02:00
52 changed files with 3295 additions and 275 deletions
--- a/.claude/rules/agents.md
+++ b/.claude/rules/agents.md
@@ -0,0 +1,20 @@
+## Kaizen Agents
+
+Specialized agent personas available on demand via the state-hub MCP.
+
+**Discover:** `list_kaizen_agents()` — returns all agents with name, description, category
+**Load:** `get_kaizen_agent("tdd-workflow")` — returns full instructions; read and follow them
+
+Common agents:
+
+| Agent | Category | When to use |
+|-------|----------|-------------|
+| `tdd-workflow` | testing | Step-by-step TDD8 workflow for any feature |
+| `code-refactoring` | quality | Code quality analysis and safe refactoring |
+| `test-maintenance` | testing | Diagnose and fix failing tests |
+| `requirements-engineering` | process | Prevent interface/mock mismatches upfront |
+| `keepaTodofile` | process | Maintain TODO.md during work |
+| `project-management` | process | Track status, determine next steps |
+| `datamodel-optimization` | quality | Optimize dataclasses and data structures |
+
+All 17 agents: call `list_kaizen_agents()` for the full list.
--- a/.claude/rules/architecture.md
+++ b/.claude/rules/architecture.md
@@ -1,31 +1,8 @@
 ## Architecture

-OpsBridge has two logical components:
-
-**1. OpsBridge — tunnel lifecycle manager** (this repo)
-Manages named SSH reverse tunnels defined in `~/.config/bridge/tunnels.yaml`.
-Each tunnel runs in a subprocess with a reconnect backoff loop; PIDs are tracked
-in `~/.local/state/bridge/`. Bridge states: `stopped → starting → connected →
-degraded → failed`. The `degraded` state means SSH is up but the optional HTTP
-health check is failing.
-
-**2. OpsCatalog — operations knowledge repository** (planned extension)
-A Git-backed YAML catalog of operations domains, targets, bridges, and actor
-classes. OpsBridge consumes this catalog to resolve bridge identifiers and
-orient operators. Schema examples are in `wiki/OpsCatalogSpecification.md`.
-The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
-targets/, bridges/, docs/}`.
-
-Key design constraints:
- OpsBridge owns lifecycle management only; it does not own identity/credentials
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
-  in config, CLI args, and log filenames must stay consistent
- Actor attribution (human operator vs. automation agent) is tracked per bridge
-  for audit log traceability (FRS §5.7)
-
-Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
-(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
+<!-- TODO: Describe the key design decisions and component structure.
+     Key modules, data flows, external integrations, state machines, etc. -->

 ## Quick Reference

-`~/the-custodian/state-hub/mcp_server/TOOLS.md`
+`~/state-hub/mcp_server/TOOLS.md` — MCP tool reference
--- a/.claude/rules/credential-routing.md
+++ b/.claude/rules/credential-routing.md
@@ -0,0 +1,50 @@
+# Credential and access routing
+
+**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
+for inference. Run this check **before** requesting secrets, API keys, SSH access,
+login tokens, or database passwords — in any repo, not only `ops-warden`.
+
+ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
+other credential need belongs to another subsystem. **Do not** message
+`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
+
+### Lookup (do this first)
+
+```bash
+warden route find "<describe your need>" --json
+warden route show <catalog-id> --json
+```
+
+Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
+
+| Agent runtime | How to orient |
+| --- | --- |
+| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
+| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
+| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
+
+### Quick routing table
+
+| I need… | Owner | ops-warden executes? |
+| --- | --- | --- |
+| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
+| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
+| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
+| Authorization decision | flex-auth | No — route only |
+| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
+| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
+
+### Anti-patterns (do not do these)
+
+- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
+- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
+- Pasting secrets into Git, State Hub, workplans, logs, or chat
+
+### Other capabilities (reuse-surface)
+
+Non-credential capabilities are usually discovered through **reuse-surface** federation
+(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
+every repo's agent instructions because it is high-frequency, high-risk, and easy to
+get wrong.
+
+**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
--- a/.claude/rules/first-session.md
+++ b/.claude/rules/first-session.md
@@ -0,0 +1,38 @@
+## First Session Protocol
+
+Triggered when `get_domain_summary("infotech")` shows **no workstreams**.
+The project is registered but work has not yet been structured.
+
+**Step 1 — Read, don't write**
+- `~/the-custodian/canon/projects/infotech/project_charter_v0.1.md` — purpose, scope
+- `~/the-custodian/canon/projects/infotech/roadmap_v0.1.md` — planned phases
+- Scan repo root: README, directory structure, existing code or docs
+
+**Step 2 — Survey in-progress work**
+Look for TODOs, open branches, half-finished files. Note done vs. started but incomplete.
+
+**Step 3 — Propose workstreams to Bernd**
+Propose 1–3 workstreams — each a coherent strand, weeks to months, anchored to a
+roadmap phase. **Wait for approval before creating.**
+
+**Step 4 — Create workplan file first, then DB record (ADR-001)**
+```
+workplans/BRIDGE-WP-NNNN-<slug>.md   ← write this first
+```
+Then register in the hub:
+```
+create_workstream(topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", title="...", owner="...", description="...")
+create_task(workstream_id="<id>", title="...", priority="high|medium|low")
+```
+
+**Step 5 — Record the setup**
+```
+add_progress_event(
+    summary="First session: structured infotech into N workstreams, M tasks",
+    event_type="milestone",
+    topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a",
+    detail={"workstreams": [...], "tasks_created": M}
+)
+```
+
+<!-- Delete or archive this file once past first session -->
--- a/.claude/rules/repo-boundary.md
+++ b/.claude/rules/repo-boundary.md
@@ -1,6 +1,8 @@
 ## Repo boundary

-This repo owns **tunnel lifecycle management only**. It does not own:
- State hub code → `the-custodian/state-hub/`
- SSH key management → `railiance-infra/` (S1) or user dotfiles
- Ansible/provisioning → `railiance-infra/`
+This repo owns **ops-bridge** only. It does not own:
+
+<!-- TODO: List what belongs in adjacent repos, e.g.:
+- SSH key management → railiance-infra/
+- State hub code     → state-hub/
+-->
--- a/.claude/rules/repo-identity.md
+++ b/.claude/rules/repo-identity.md
@@ -1,7 +1,5 @@
-**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution
-environments (COULOMBCORE, Railiance nodes) connected to the local Custodian
-State Hub so Claude Code sessions on those machines have full MCP connectivity.
+**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.

-**Domain:** custodian
+**Domain:** infotech
 **Repo slug:** ops-bridge
-**Repo ID:** 1bf99f56-6e94-4379-a9ea-295a4c181889
+**Topic ID:** cee7bedf-2b48-46ef-8601-006474f2ad7a
--- a/.claude/rules/session-protocol.md
+++ b/.claude/rules/session-protocol.md
@@ -1,24 +1,85 @@
-## Custodian State Hub Integration
+## Session Protocol

-State Hub: http://127.0.0.1:8000
-
-### Session Protocol
+Dev Hub (State Hub API): http://127.0.0.1:8000
+MCP server name in `~/.claude.json`: `dev-hub`

 **Step 1 — Orient**
+
+Read the offline-safe brief first — it works without a live hub connection:
+```bash
+cat .custodian-brief.md
 ```
-get_domain_summary("custodian")
+Then call the MCP tool for richer cross-domain context when MCP tools are exposed:
+```
+get_domain_summary("infotech")
+```
+If MCP tools are unavailable in the current agent session, use the REST API:
+```bash
+curl -s "http://127.0.0.1:8000/state/summary" | python3 -m json.tool
+```
+If the hub is offline: `cd ~/state-hub && make api`
+
+**Step 2 — Check inbox**
+With MCP tools:
+```
+get_messages(to_agent="ops-bridge", unread_only=True)
+```
+Mark read with `mark_message_read(message_id)`. Reply or act on coordination
+requests before proceeding.
+
+Without MCP tools:
+```bash
+curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
+  | python3 -m json.tool
+curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
+  -H "Content-Type: application/json" -d '{}'
 ```

-**Step 2 — Scan workplans**
-```
+**Step 3 — Scan workplans**
+```bash
 ls workplans/
 ```
+For each file with `status: ready`, `active`, or `blocked`, note pending
+`wait`/`todo`/`progress` tasks.

-**During work:** use `record_decision()`, `add_progress_event()`, `resolve_decision()`.
+**Step 4 — Present brief**

-**Session close:** `add_progress_event()` with workstream_id.
+1. **Active workstreams** for `infotech` — title, task counts, blocking decisions
+2. **Pending tasks** from `workplans/` + any `[repo:ops-bridge]` hub tasks
+3. **Goal guidance** — if `goal_guidance` in summary:
+   - `needs_workplan`: surface as top action — *"Repo goal '{title}' has no workplan yet"*
+   - `alignment_warnings`: flag if active work is not aligned with current goal
+4. **Suggested next action** — highest-priority open item
+5. **SBOM status** — flag if `last_sbom_at` is unset for this repo

-If workplan files were modified, run from `~/the-custodian/state-hub/`:
-```bash
-make fix-consistency REPO=ops-bridge
+If no workstreams: follow First Session Protocol (`first-session.md`).
+
+**During work:** `record_decision()` · `add_progress_event()` · `resolve_decision()`
+
+> State Hub is a *read model*. Bootstrap tools (`create_workstream`, `create_task`)
+> are First Session Protocol only. Work structure belongs in repo files (ADR-001).
+
+**Session close:**
+With MCP tools:
 ```
+add_progress_event(summary="...", topic_id="cee7bedf-2b48-46ef-8601-006474f2ad7a", workstream_id="<uuid>")
+```
+Without MCP tools:
+```bash
+curl -s -X POST http://127.0.0.1:8000/progress/ \
+  -H "Content-Type: application/json" \
+  -d '{"topic_id":"cee7bedf-2b48-46ef-8601-006474f2ad7a","workstream_id":"<uuid>","event_type":"note","summary":"what changed","author":"codex"}'
+```
+If workplan files were modified, ensure the local copy is up to date first:
+```bash
+git -C <repo_path> pull --ff-only
+cd ~/state-hub && make fix-consistency REPO=ops-bridge
+```
+For repos where implementation runs on a remote machine (e.g. CoulombCore),
+use the combined target which pulls before fixing:
+```bash
+cd ~/state-hub && make fix-consistency-remote REPO=ops-bridge
+```
+**C-15** (DB task ahead of file) is normal in multi-machine workflows — writeback
+will sync the file to match DB.  **C-16** (repo behind remote) blocks all writes
+until you pull — intentional to prevent clobbering remote progress.
--- a/.claude/rules/stack-and-commands.md
+++ b/.claude/rules/stack-and-commands.md
@@ -1,46 +1,19 @@
-## What this repo builds
-
-A CLI tool (`bridge`) that manages named SSH reverse tunnels:
-
-```
-bridge up [TUNNEL]      # start tunnel(s)
-bridge down [TUNNEL]    # stop tunnel(s)
-bridge restart [TUNNEL] # restart tunnel(s)
-bridge status           # show all tunnels: state, uptime, last health check
-bridge logs [TUNNEL]    # tail reconnect log
-```
-
-Config file: `~/.config/bridge/tunnels.yaml`
-
-Each tunnel:
- Named (e.g. `state-hub-coulombcore`)
- Reverse SSH port-forward: `ssh -R remote_port:127.0.0.1:local_port host`
- Auto-reconnects on drop (backoff loop)
- Optional HTTP health check to confirm the forwarded service is reachable
-
-PRD: `workplans/BRIDGE-WP-0001-initial-implementation.md`
-
 ## Stack

- **Language:** Python 3.11+
- **CLI framework:** Typer
- **Dependencies:** typer, pyyaml, httpx
- **Packaging:** `uv tool install` (single command install, no venv activation)
- **No system daemons** — process management is internal, PID tracked in
-  `~/.local/state/bridge/`
+<!-- TODO: Fill in language, frameworks, and key dependencies -->
+- **Language:**
+- **Key deps:**

 ## Dev Commands

 ```bash
-# Install locally for development
-uv tool install -e .
+# TODO: Fill in the standard commands for this repo
+
+# Install dependencies

 # Run tests
-uv run pytest

-# Run a single test
-uv run pytest tests/test_tunnel.py::test_name -v
+# Lint / type check

-# Lint
-uv run ruff check .
+# Build / package (if applicable)
 ```
--- a/.claude/rules/workplan-convention.md
+++ b/.claude/rules/workplan-convention.md
@@ -1,6 +1,40 @@
-### Workplan Convention (ADR-001)
+## Workplan Convention (ADR-001)

 File location: `workplans/BRIDGE-WP-NNNN-<slug>.md`
-Prefix: `BRIDGE-WP`
+ID prefix: `BRIDGE-WP-`

-<!-- Ralph Loop rules are defined globally in ~/.claude/CLAUDE.md — do not duplicate here -->
+Work items originate as files in this repo **before** being registered in the hub.
+
+Canonical workplan/workstream frontmatter statuses are:
+`proposed`, `ready`, `active`, `blocked`, `backlog`, `finished`, `archived`.
+Use `proposed` for a newly drafted plan, `ready` after review against current
+repo state, and `finished` when implementation is complete. `stalled` and
+`needs_review` are derived health labels, not stored statuses.
+
+Closed workplans may be moved to `workplans/archived/` with a completion-date
+prefix: `YYMMDD-BRIDGE-WP-NNNN-<slug>.md`. The frontmatter id remains
+unchanged; the prefix is only for quick visual reference.
+
+Small opportunistic tasks discovered during another session use **Ad Hoc Tasks**:
+`workplans/ADHOC-YYYY-MM-DD.md`, workstream slug `adhoc-YYYY-MM-DD`, and task ids
+`ADHOC-YYYY-MM-DD-T01`, `T02`, etc. Use adhocs only for low-risk work completed
+directly. Promote anything requiring analysis, design, approval, dependencies, or
+multiple planned phases into a normal workplan.
+
+Ecosystem todos from other agents arrive as `[repo:ops-bridge]` hub tasks —
+visible at session start. Pick one up by creating the workplan file, then registering
+the workstream.
+
+Task blocks use this shape:
+
+```task
+id: BRIDGE-WP-NNNN-T01
+status: wait | todo | progress | done | cancel
+priority: high | medium | low
+state_hub_task_id: "<uuid>"         # written by fix-consistency — do not edit
+```
+
+Status progression is `todo` → `progress` → `done`; use `wait` for waiting or
+blocked work and `cancel` for stopped work.
+
+<!-- Ralph Loop rules and HEUREKA sequence: ~/.claude/CLAUDE.md — do not duplicate here -->
--- a/.codex/config.toml
+++ b/.codex/config.toml
@@ -0,0 +1,7 @@
+[mcp_servers.ops-bridge]
+command = "uv"
+args = [
+    "run",
+    "python",
+    "src/bridge/mcp_server/server.py",
+]
--- a/.custodian-brief.md
+++ b/.custodian-brief.md
@@ -0,0 +1,18 @@
+<!-- custodian-brief: generated by fix-consistency — do not edit manually -->
+# Custodian Brief — ops-bridge
+
+**Domain:** infotech  
+**Last synced:** 2026-07-03 16:52 UTC  
+**State Hub:** http://127.0.0.1:8000 *(adjust if running on a remote machine)*
+
+## Active Workstreams
+
+*(none — repo may need first-session setup)*
+
+---
+## MCP Orientation (when available)
+
+If the state-hub MCP server is reachable, call:
+`get_domain_summary("infotech")`
+This provides richer cross-domain context.
+If the MCP call fails, use this file as your orientation source.
--- a/.repo-classification.yaml
+++ b/.repo-classification.yaml
@@ -0,0 +1,26 @@
+# Repo classification (Repo Classification Standard v1.0).
+
+repo_classification:
+  standard: Repo Classification Standard
+  version: '1.0'
+  classified_at: '2026-06-22'
+  classified_by: human
+  category: tooling
+  domain: infotech
+  secondary_domains: []
+  capability_tags:
+  - operations
+  - access-control
+  - platform
+  - observability
+  - orchestration
+  business_stake:
+  - operations
+  - technology
+  - automation
+  business_mechanics:
+  - control
+  - operation
+  - adaptation
+  notes: SSH reverse-tunnel lifecycle manager keeping remote environments connected to the
+    State Hub. Operational tooling -> product.
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -0,0 +1,219 @@
+# ops-bridge — Agent Instructions
+
+## Repo Identity
+
+**Purpose:** SSH reverse tunnel lifecycle manager. Keeps remote execution environments (COULOMBCORE, Railiance nodes) connected to the local state hub. Small CLI tool: bridge up/down/status/logs per named tunnel config.
+
+**Domain:** infotech
+**Repo slug:** ops-bridge
+**Topic ID:** `cee7bedf-2b48-46ef-8601-006474f2ad7a`
+**Workplan prefix:** `BRIDGE-WP-`
+
+---
+
+## State Hub Integration
+
+The Custodian State Hub tracks work across all domains. Interact via HTTP REST —
+there is no MCP server for Codex agents.
+
+| Context | URL |
+|---------|-----|
+| Local workstation | `http://127.0.0.1:8000` |
+| Remote via tunnel | `http://127.0.0.1:18000` |
+
+### Orient at session start
+
+```bash
+# Offline brief — works without hub connection
+cat .custodian-brief.md
+
+# Active workstreams for this domain
+curl -s "http://127.0.0.1:8000/workstreams/?topic_id=cee7bedf-2b48-46ef-8601-006474f2ad7a&status=active" \
+  | python3 -m json.tool
+
+# Check inbox
+curl -s "http://127.0.0.1:8000/messages/?to_agent=ops-bridge&unread_only=true" \
+  | python3 -m json.tool
+```
+
+Mark a message read:
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/messages/<id>/read" \
+  -H "Content-Type: application/json" -d '{}'
+```
+
+### Log progress (required at session close)
+
+```bash
+curl -s -X POST http://127.0.0.1:8000/progress/ \
+  -H "Content-Type: application/json" \
+  -d '{
+    "summary": "what was done",
+    "event_type": "note",
+    "author": "codex",
+    "workstream_id": "<uuid>",
+    "task_id": "<uuid>"
+  }'
+```
+
+Omit `workstream_id` / `task_id` when not applicable.
+
+### Update task status
+
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
+  -H "Content-Type: application/json" \
+  -d '{"status": "progress"}'
+# values: wait | todo | progress | done | cancel
+```
+
+### Flag a task for human review
+
+```bash
+curl -s -X PATCH "http://127.0.0.1:8000/tasks/<task_id>" \
+  -H "Content-Type: application/json" \
+  -d '{"needs_human": true, "intervention_note": "reason"}'
+```
+
+---
+
+## Session Protocol
+
+**Start:**
+1. `cat .custodian-brief.md` — domain goal and open workstreams (offline-safe)
+2. Check inbox: `GET /messages/?to_agent=ops-bridge&unread_only=true`; mark read
+3. Scan workplans: `ls workplans/` — note `status: ready`, `active`, or `blocked` files and open tasks
+4. Check human-needed tasks: `GET /tasks/?needs_human=true`
+
+**During work:**
+- Update task statuses in workplan files as tasks progress
+- Record significant decisions via `POST /decisions/`
+
+**Close:**
+1. Update workplan file task statuses to reflect progress
+2. Log: `POST /progress/` with a summary of what changed
+3. Note for the custodian operator: after workplan file changes, run from
+   `~/state-hub`:
+   ```bash
+   make fix-consistency REPO=ops-bridge
+   ```
+   This syncs task status from files into the hub DB.
+
+---
+
+## Credential and access routing
+
+**Audience:** Codex, Claude Code, Grok, and custodian agents that call **llm-connect**
+for inference. Run this check **before** requesting secrets, API keys, SSH access,
+login tokens, or database passwords — in any repo, not only `ops-warden`.
+
+ops-warden **issues SSH certificates only** (`warden sign`, `cert_command`). Every
+other credential need belongs to another subsystem. **Do not** message
+`ops-warden` on State Hub expecting a secret value; the reply is a pointer, not a key.
+
+### Lookup (do this first)
+
+```bash
+warden route find "<describe your need>" --json
+warden route show <catalog-id> --json
+```
+
+Requires the `warden` CLI from `~/ops-warden` (`uv tool install .` or `uv run warden`).
+
+| Agent runtime | How to orient |
+| --- | --- |
+| **Codex / Grok** (shell, HTTP State Hub) | `warden route` commands above; inbox `to_agent=ops-bridge` is for coordination, not secret vending |
+| **Claude Code** (MCP when available) | `get_domain_summary("custodian")` for workstreams; **still** use `warden route` for credential ownership |
+| **llm-connect** (inference service) | Never put secret retrieval in prompts; route custody to OpenBao/operator paths surfaced by `warden route` |
+
+### Quick routing table
+
+| I need… | Owner | ops-warden executes? |
+| --- | --- | --- |
+| SSH cert (`adm`/`agt`/`atm`) | ops-warden | **Yes** — `warden sign` |
+| API key, DB password, provider token | OpenBao (`railiance-platform`) | No — route only |
+| Login / OIDC / MFA | key-cape / Keycloak | No — route only |
+| Authorization decision | flex-auth | No — route only |
+| activity-core → issue-core emission | activity-core + issue-core | No — `warden route show activity-core-issue-sink` |
+| SSH tunnel | ops-bridge (+ `cert_command` from warden) | No — route only |
+
+### Anti-patterns (do not do these)
+
+- `POST /messages/` to `ops-warden` asking for `ISSUE_CORE_API_KEY`, `OPENROUTER_API_KEY`, etc.
+- Inventing `warden secret`, `warden login`, `warden bao`, `warden tunnel` — they do not exist
+- Pasting secrets into Git, State Hub, workplans, logs, or chat
+
+### Other capabilities (reuse-surface)
+
+Non-credential capabilities are usually discovered through **reuse-surface** federation
+(`reuse-surface` registry / `capability.*` indexes). Credential routing is inlined in
+every repo's agent instructions because it is high-frequency, high-risk, and easy to
+get wrong.
+
+**Canon:** `~/ops-warden/wiki/CredentialRouting.md` · catalog `~/ops-warden/registry/routing/catalog.yaml`
+
+<!-- REPO-AGENTS-EXTENSIONS -->
+<!-- Append repo-specific agent instructions below this marker.
+     The state-hub template sync preserves content after this line. -->
+
+---
+
+## Workplan Convention (ADR-001)
+
+Work items originate as files in this repo — not in the hub. The hub is a
+read/cache/index layer that rebuilds from files.
+
+**File location:** `workplans/OPS-WP-NNNN-<slug>.md`
+
+**Archived location:** finished workplans may move to
+`workplans/archived/YYMMDD-OPS-WP-NNNN-<slug>.md`. The `YYMMDD` prefix is
+the completion/archive date; the frontmatter `id` does not change.
+
+**Ad Hoc Tasks:** small opportunistic fixes discovered during a session use
+`workplans/ADHOC-YYYY-MM-DD.md` with task ids `ADHOC-YYYY-MM-DD-T01`, etc. Use
+this only for low-risk work completed directly; create a normal workplan for
+anything needing analysis, design, approval, dependencies, or multiple phases.
+
+**Frontmatter:**
+
+```yaml
+---
+id: OPS-WP-NNNN
+type: workplan
+title: "..."
+domain: infotech
+repo: ops-bridge
+status: proposed | ready | active | blocked | backlog | finished | archived
+owner: codex
+topic_slug: ...
+created: "YYYY-MM-DD"
+updated: "YYYY-MM-DD"
+state_hub_workstream_id: "<uuid>"   # written by fix-consistency — do not edit
+---
+```
+
+Use `proposed` for a new draft, `ready` after review against current repo
+state, and `finished` after implementation. `stalled` and `needs_review` are
+derived health labels, not frontmatter statuses.
+
+**Task block format** (one per `##` section):
+
+```
+## Task Title
+
+` ` `task
+id: OPS-WP-NNNN-T01
+status: wait | todo | progress | done | cancel
+priority: high | medium | low
+state_hub_task_id: "<uuid>"         # written by fix-consistency — do not edit
+` ` `
+
+Task description text.
+```
+
+Status progression: `todo` → `progress` → `done`; use `wait` for waiting/blocked work and `cancel` for stopped work.
+
+To create a new workplan:
+1. Write the file following the format above
+2. Notify the custodian operator to run `make fix-consistency REPO=ops-bridge`
+   (or send a message to the hub agent via `POST /messages/`)
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -1,8 +1,12 @@
 # ops-bridge — Claude Code Instructions

+@SCOPE.md
@.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md
+@.claude/rules/first-session.md
@.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md
+@.claude/rules/credential-routing.md
+@.claude/rules/agents.md
--- a/INTENT.md
+++ b/INTENT.md
@@ -0,0 +1,92 @@
+# INTENT
+
+## Purpose
+
+This repository exists to provide a **reliable, inspectable, and controllable connectivity layer** 
+between distributed dev, build, test and execution environments for dev and ops personal human and agentic.
+
+Its role is to ensure that remote machines can **consistently and safely “phone home”** without requiring complex network infrastructure or manual intervention.
+
+---
+
+## Primary Utility
+
+The repository provides a **managed SSH reverse tunneling system** that:
+
+* Maintains continuous connectivity between remote systems and a central hub
+* Makes connectivity **observable, auditable, and controllable**
+* Exposes this capability as both a **CLI tool and an MCP-accessible service**
+
+It transforms raw SSH port-forwarding into a **first-class operational primitive**.
+
+---
+
+## Intended Users
+
+* Human operators (`adm`) managing infrastructure and connectivity
+* LLM-based agents (`agt`) requiring stable access to local services
+* Deterministic automations (`atm`) coordinating distributed workloads
+
+---
+
+## Strategic Role in the System
+
+This repository acts as the **connectivity backbone** of the custodian ecosystem:
+
+* It enables remote agents and services to participate in a **locally anchored control plane**
+* It decouples **execution location** from **control location**
+* It supports a **hub-and-spoke topology** where the Custodian State Hub remains central
+
+---
+
+## Strategic Boundaries
+
+This repository is **not** intended to:
+
+* Replace SSH as a general-purpose access mechanism
+* Act as a credential authority or security policy engine
+* Provide full network virtualization (e.g., VPN, mesh networking)
+* Host or orchestrate application workloads
+
+Its responsibility ends at **secure, observable, and managed connectivity via tunnels**.
+
+---
+
+## Design Principles
+
+* **Continuity over convenience**
+  Connectivity must persist across failures without manual recovery
+
+* **Observability as a first-class concern**
+  All lifecycle events must be traceable and attributable
+
+* **Actor-aware operations**
+  Every action is tied to a clearly defined actor type (`adm`, `agt`, `atm`)
+
+* **Pluggable security integration**
+  Works with both static keys and external certificate authorities without owning them
+
+* **Toolability**
+  All capabilities should be accessible programmatically (MCP) and operationally (CLI)
+
+---
+
+## Maturity Target
+
+A mature version of this repository should:
+
+* Provide **fully autonomous tunnel lifecycle management** across heterogeneous environments
+* Integrate seamlessly with **centralized access control and certificate systems**
+* Serve as a **standardized connectivity primitive** across all Custodian-managed systems
+* Offer **complete operational transparency** for all connectivity-related actions
+* Be robust enough to act as the **default connectivity layer** for distributed agent systems
+
+---
+
+## Stability Note
+
+Changes to this file represent a **deliberate shift in repository purpose or role** within the system architecture.
+
+Such changes should be rare and made with explicit intent.
+
+
--- a/31
+++ b/31
@@ -1,10 +1,31 @@
-.PHONY: test lint install
+.DEFAULT_GOAL := help

-test:
+.PHONY: help setup test lint install mcp-http mcp-stop cron-install-cron cron-uninstall-cron
+
+help: ## List available make targets
+	@awk 'BEGIN {FS = ":.*## "}; /^[a-zA-Z0-9_.-]+:.*## / {printf "  %-16s %s\n", $$1, $$2}' $(MAKEFILE_LIST)
+
+setup: ## Sync dependencies and install the bridge CLI wrapper
+	uv sync --all-groups
+	uv tool install -e . --force
+
+test: ## Run the test suite
 	uv run pytest

-lint:
+lint: ## Run ruff lint checks
 	uv run ruff check .

-install:
-	uv tool install -e .
+install: ## Install the bridge CLI wrapper
+	uv tool install -e . --force
+
+mcp-http: ## Start MCP server in SSE mode (default port 8002)
+	BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
+
+mcp-stop: ## Stop MCP server running on port 8002
+	@lsof -ti:$${BRIDGE_MCP_PORT:-8002} | xargs -r kill -TERM && echo "MCP server stopped" || echo "No MCP server running on port $${BRIDGE_MCP_PORT:-8002}"
+
+cron-install-cron: ## Install 03:00 nightly stale-forward cleanup cron
+	bridge maintenance install-cron
+
+cron-uninstall-cron: ## Remove nightly stale-forward cleanup cron
+	bridge maintenance uninstall-cron
--- a/README.txt
+++ b/README.txt
@@ -243,6 +243,31 @@ has not yet cleaned up the socket), so the next reconnect attempt hits
 "remote port forwarding failed" and exits with code 255. With ClientAlive
 enabled, sshd evicts stale sessions within ~90 seconds and frees the port.

+NIGHTLY STALE-FORWARD CLEANUP
+------------------------------
+
+When a bridge client dies without tearing down its SSH session, the remote
+host can keep port 18000 (etc.) bound to a zombie sshd listener. The port
+accepts connections but never forwards them, which breaks in-cluster proxies
+such as actcore-state-hub-bridge on railiance01.
+
+Install a 03:00 local-time cron job that probes each reverse tunnel's remote
+forward, kills stale listeners when the local service is healthy but the
+remote forward is not, and restarts the tunnel:
+
+  bridge maintenance install-cron
+
+Manual run:
+
+  bridge maintenance cleanup --restart
+
+Inspect or remove the cron entry:
+
+  bridge maintenance show-cron
+  bridge maintenance uninstall-cron
+
+Logs append to ~/.local/state/bridge/cleanup.log
+
 Apply and reload (no disconnect):

  sudo sed -i 's/#ClientAliveInterval 0/ClientAliveInterval 30/' /etc/ssh/sshd_config
--- a/SCOPE.md
+++ b/SCOPE.md
@@ -8,7 +8,7 @@

 ## One-liner

-SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards.
+SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.

 ---

@@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## In Scope

- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`)
+- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
 - Auto-reconnect with exponential backoff and configurable retry policy
 - Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
 - Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
- Actor attribution: per-tunnel actor class (human / automation) for audit traceability
+- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
+  with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
+- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
+  works without any CA or external tooling
+- **cert_command mode** (optional): pluggable shell command that issues a short-lived
+  CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
+  `cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
 - PID + state file management in `~/.local/state/bridge/`
 - MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
 - OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
@@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## Out of Scope

- Identity/credential management (uses existing SSH keys)
+- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
+  certs via the `cert_command` interface but never signs anything itself)
+- SSH key generation for human admins (self-service: `ssh-keygen`)
+- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
 - Long-running application hosting on remote machines (port-forward only, not deployment)
 - VPN or layer-3 connectivity
 - Monitoring/alerting beyond JSON audit logs
@@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
 ## Relevant When

 - Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
- Need audit trail of which actor (human vs. automation) started/stopped tunnels
+- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
 - Setting up a new machine in the Railiance ecosystem that must phone home to the hub
 - Diagnosing connectivity issues between local hub and remote services
+- Checking certificate validity for active tunnels (`bridge cert-status`)
+- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials

 ---

@@ -60,8 +71,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## Current State

- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped)
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated
+- Status: active (v0.1 core complete; AccessManagementDirective alignment done — BRIDGE-WP-0004)
+- Implementation: ~80% — CLI tunneling fully functional, MCP integration working, health
+  checks and audit logging complete; ActorType enum (adm/agt/atm) enforced; cert_command
+  mode implemented with TTL-aware refresh and cert_identity audit logging; OpsCatalog
+  framework present but not yet populated
 - Stability: stable tunnel lifecycle; tested under network drops and SSH failures
 - Usage: running in lab for daily Railiance/Temporal connectivity

@@ -77,17 +91,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## Terminology

- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check
+- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
+  cert_command, cert_identity
+- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
 - Also known as: "the bridge"
- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
+- Potentially confusing: "bridge state" is a tunnel-specific state machine
+  (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
+- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)

 ---

-## Related / Overlapping Repositories
+## Related / Overlapping

 - `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
+- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
+  `cert_command` when short-lived certificates are required
 - `activity-core` — Temporal server on remote reached via ops-bridge tunnel
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home
+- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
+  host-side principal deployment (`/etc/ssh/auth_principals/`)

 ---

@@ -105,5 +126,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge
 ## Getting Oriented

 - Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files)
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; MCP: `bridge_status()`
+- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
+  `~/.local/state/bridge/` (PID/state/cert files)
+- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
+  MCP: `bridge_status()`
+- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
+- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -11,7 +11,7 @@ dependencies = [
    "typer>=0.12",
    "pyyaml>=6.0",
    "httpx>=0.27",
-    "fastmcp>=2.0.0",
+    "fastmcp>=2.0.0,<3.1.0",
 ]

 [project.scripts]
--- a/registry/README.md
+++ b/registry/README.md
@@ -0,0 +1,12 @@
+# Capability Registry
+
+Markdown-first capability index for federation and reuse planning.
+
+## Authoring
+
+1. Copy a capability entry template (see reuse-surface `templates/capability-entry.template.md`).
+2. Add the row to `indexes/capabilities.yaml`.
+3. Run `reuse-surface validate` from a checkout with the CLI installed.
+4. Merge to `main` and verify publish with `reuse-surface establish --publish-check`.
+
+Federation contract: reuse-surface `docs/RegistryFederation.md`.
--- a/registry/capabilities/.gitkeep
+++ b/registry/capabilities/.gitkeep
--- a/registry/indexes/capabilities.yaml
+++ b/registry/indexes/capabilities.yaml
@@ -0,0 +1,4 @@
+version: 1
+updated: '2026-06-16'
+domain: helix_forge
+capabilities: []
--- a/src/bridge/audit.py
+++ b/src/bridge/audit.py
@@ -16,6 +16,7 @@ class AuditEvent(str, Enum):
    HEALTH_CHECK_FAILED = "health_check_failed"
    HEALTH_CHECK_RECOVERED = "health_check_recovered"
    BRIDGE_STOPPED = "bridge_stopped"
+    CERT_EXPIRING = "cert_expiring"


 def _default_state_dir() -> Path:
@@ -34,19 +35,22 @@ class AuditLogger:
        tunnel: str,
        event: AuditEvent,
        actor: str,
-        actor_class: str,
+        actor_type: str,
        detail: str = "",
+        cert_identity: Optional[str] = None,
    ) -> None:
        self._dir.mkdir(parents=True, exist_ok=True)
        entry: Dict[str, Any] = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "tunnel": tunnel,
            "actor": actor,
-            "actor_class": actor_class,
+            "actor_type": actor_type,
            "event": event.value,
        }
        if detail:
            entry["detail"] = detail
+        if cert_identity:
+            entry["cert_identity"] = cert_identity
        with self._log_path(tunnel).open("a") as f:
            f.write(json.dumps(entry) + "\n")

--- a/src/bridge/capabilities.py
+++ b/src/bridge/capabilities.py
@@ -73,6 +73,11 @@ CAPABILITIES: list[Capability] = [
        description="End-to-end tunnel diagnostics via SSH: SSH PID alive + remote port listening",
        required_access_modes=frozenset({"cli", "mcp"}),
    ),
+    Capability(
+        name="bridge_cert_status",
+        description="Show certificate status for tunnels using cert_command mode",
+        required_access_modes=frozenset({"cli"}),
+    ),
 ]

 CAPABILITIES_BY_NAME: dict[str, Capability] = {c.name: c for c in CAPABILITIES}
--- a/src/bridge/cleanup.py
+++ b/src/bridge/cleanup.py
@@ -0,0 +1,328 @@
+"""Nightly maintenance: detect and clear stale SSH remote port forwards."""
+from __future__ import annotations
+
+import subprocess
+from dataclasses import dataclass
+from typing import Optional
+from urllib.parse import urlparse, urlunparse
+
+import httpx
+
+from bridge.diagnostics import _remote_port_probe_command, check_tunnel
+from bridge.manager import TunnelManager
+from bridge.models import TunnelConfig
+from bridge.state import StateManager
+
+
+@dataclass
+class CleanupAction:
+    tunnel: str
+    action: str  # skipped | healthy | cleaned | cleaned_and_restarted | error
+    detail: str = ""
+
+
+@dataclass
+class CleanupReport:
+    actions: list[CleanupAction]
+
+    @property
+    def cleaned_count(self) -> int:
+        return sum(1 for a in self.actions if a.action.startswith("cleaned"))
+
+
+def remote_forward_health_url(cfg: TunnelConfig) -> Optional[str]:
+    """Map the local health_check URL to the remote forwarded port."""
+    if cfg.health_check is None or cfg.direction == "local":
+        return None
+    parsed = urlparse(cfg.health_check.url)
+    if not parsed.hostname:
+        return None
+    netloc = f"{parsed.hostname}:{cfg.remote_port}"
+    return urlunparse(parsed._replace(netloc=netloc))
+
+
+def _ssh_base_cmd(cfg: TunnelConfig) -> list[str]:
+    from pathlib import Path
+
+    return [
+        "ssh",
+        "-i",
+        str(Path(cfg.ssh_key).expanduser()),
+        "-o",
+        "BatchMode=yes",
+        "-o",
+        "ConnectTimeout=10",
+        "-o",
+        "StrictHostKeyChecking=accept-new",
+        f"{cfg.ssh_user}@{cfg.host}",
+    ]
+
+
+def _run_ssh(cfg: TunnelConfig, remote_command: str, *, timeout: float = 30) -> subprocess.CompletedProcess[str]:
+    return subprocess.run(
+        [*_ssh_base_cmd(cfg), remote_command],
+        capture_output=True,
+        text=True,
+        timeout=timeout,
+    )
+
+
+def remote_port_listening(cfg: TunnelConfig) -> bool:
+    proc = _run_ssh(cfg, _remote_port_probe_command(cfg.remote_port), timeout=15)
+    return proc.stdout.strip() == "ok"
+
+
+def probe_remote_forward(cfg: TunnelConfig) -> tuple[bool, str]:
+    """Return (healthy, detail) for the remote forwarded service."""
+    url = remote_forward_health_url(cfg)
+    if url is None:
+        return True, "no remote health url configured"
+    timeout = cfg.health_check.timeout_seconds if cfg.health_check else 5
+    remote_cmd = (
+        f"curl -sf --max-time {timeout} {url!r} >/dev/null "
+        "&& echo ok || echo fail"
+    )
+    try:
+        proc = _run_ssh(cfg, remote_cmd, timeout=timeout + 15)
+    except subprocess.TimeoutExpired:
+        return False, "remote health probe timed out"
+    output = proc.stdout.strip()
+    if output == "ok":
+        return True, "remote forward healthy"
+    if proc.returncode != 0 and proc.stderr.strip():
+        return False, proc.stderr.strip()
+    return False, "remote forward unhealthy"
+
+
+def local_service_healthy(cfg: TunnelConfig) -> Optional[bool]:
+    if cfg.health_check is None:
+        return None
+    try:
+        resp = httpx.get(
+            cfg.health_check.url,
+            timeout=cfg.health_check.timeout_seconds,
+        )
+        return resp.is_success
+    except Exception:
+        return False
+
+
+def _remote_cleanup_script(port: int) -> str:
+    return f"""set -eu
+port={port}
+pids=""
+if command -v lsof >/dev/null 2>&1; then
+  pids=$(sudo -n lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
+  if [ -z "$pids" ]; then
+    pids=$(lsof -t -iTCP:$port -sTCP:LISTEN 2>/dev/null || true)
+  fi
+fi
+if [ -z "$pids" ] && command -v fuser >/dev/null 2>&1; then
+  pids=$(fuser -n tcp $port 2>/dev/null | tr -s ' ' '\\n' | grep -E '^[0-9]+$' || true)
+fi
+if [ -z "$pids" ]; then
+  echo "no_listeners"
+  exit 0
+fi
+echo "killing:$pids"
+for pid in $pids; do
+  kill "$pid" 2>/dev/null || sudo -n kill "$pid" 2>/dev/null || true
+done
+sleep 1
+if ss -tln 2>/dev/null | grep -q ":$port "; then
+  echo "still_listening"
+else
+  echo "cleared"
+fi
+"""
+
+
+def clear_stale_remote_binding(cfg: TunnelConfig) -> tuple[bool, str]:
+    try:
+        proc = _run_ssh(cfg, _remote_cleanup_script(cfg.remote_port), timeout=30)
+    except subprocess.TimeoutExpired:
+        return False, "remote cleanup timed out"
+    output = proc.stdout.strip()
+    if "cleared" in output:
+        return True, output
+    if "no_listeners" in output:
+        return True, "no listeners found"
+    if "still_listening" in output:
+        return False, output
+    detail = output or proc.stderr.strip() or f"exit {proc.returncode}"
+    return False, detail
+
+
+def should_cleanup_tunnel(
+    cfg: TunnelConfig,
+    state_mgr: StateManager,
+) -> tuple[bool, str]:
+    """Decide whether a reverse tunnel's remote binding looks stale."""
+    if cfg.direction == "local":
+        return False, "local tunnel"
+
+    if not remote_port_listening(cfg):
+        return False, "remote port closed"
+
+    remote_ok, remote_detail = probe_remote_forward(cfg)
+    if remote_ok:
+        return False, remote_detail
+
+    check = check_tunnel(cfg, state_mgr)
+    local_ok = local_service_healthy(cfg)
+
+    if local_ok is True and not remote_ok:
+        return True, f"stale forward: {remote_detail}"
+
+    if check.ssh_process != "ok" and check.remote_port == "listening":
+        return True, f"orphan forward while ssh {check.ssh_process}: {remote_detail}"
+
+    if check.ssh_process == "ok" and not remote_ok:
+        return True, f"broken forward with live client: {remote_detail}"
+
+    return False, remote_detail
+
+
+def cleanup_tunnel(
+    cfg: TunnelConfig,
+    state_mgr: StateManager,
+    *,
+    restart: bool,
+) -> CleanupAction:
+    name = cfg.name
+    try:
+        needed, reason = should_cleanup_tunnel(cfg, state_mgr)
+        if not needed:
+            return CleanupAction(name, "healthy", reason)
+
+        ok, detail = clear_stale_remote_binding(cfg)
+        if not ok:
+            return CleanupAction(name, "error", f"cleanup failed: {detail}")
+
+        if not restart:
+            return CleanupAction(name, "cleaned", f"{reason}; {detail}")
+
+        mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
+        was_running = mgr.is_running()
+        if was_running:
+            mgr.stop()
+        mgr.start()
+        action = "cleaned_and_restarted"
+        verb = "restarted" if was_running else "started"
+        return CleanupAction(name, action, f"{reason}; {verb} tunnel; {detail}")
+    except Exception as exc:
+        return CleanupAction(name, "error", str(exc))
+
+
+def restart_tunnel(
+    cfg: TunnelConfig,
+    state_mgr: StateManager,
+) -> CleanupAction:
+    """Restart one tunnel with blank-slate recovery for reverse tunnels."""
+    if cfg.direction == "local":
+        mgr = TunnelManager(cfg, state_dir=state_mgr._dir)
+        mgr.stop()
+        mgr.start()
+        return CleanupAction(cfg.name, "restarted", "local tunnel stop/start")
+    return cleanup_tunnel(cfg, state_mgr, restart=True)
+
+
+def restart_all_tunnels(
+    cfg,
+    state_mgr: StateManager,
+) -> list[CleanupAction]:
+    """Restart every inline tunnel (reverse via cleanup path, local via stop/start)."""
+    return [restart_tunnel(tcfg, state_mgr) for tcfg in cfg.tunnels.values()]
+
+
+def cleanup_all_tunnels(
+    cfg,
+    state_mgr: StateManager,
+    *,
+    restart: bool,
+    tunnel_name: Optional[str] = None,
+) -> CleanupReport:
+    tunnels = cfg.tunnels.values()
+    if tunnel_name is not None:
+        if tunnel_name not in cfg.tunnels:
+            raise KeyError(tunnel_name)
+        tunnels = [cfg.tunnels[tunnel_name]]
+
+    actions = [
+        cleanup_tunnel(tcfg, state_mgr, restart=restart)
+        for tcfg in tunnels
+        if tcfg.direction != "local"
+    ]
+    return CleanupReport(actions=actions)
+
+
+CRON_MARKER = "# ops-bridge: maintenance cleanup"
+CRON_SCHEDULE = "0 3 * * *"
+CRON_LOG = "~/.local/state/bridge/cleanup.log"
+
+
+def build_cron_line() -> str:
+    bridge_bin = "~/.local/bin/bridge"
+    return (
+        f"{CRON_SCHEDULE} BRIDGE_CONFIG=~/.config/bridge/tunnels.yaml "
+        f"{bridge_bin} maintenance cleanup --restart "
+        f">> {CRON_LOG} 2>&1 {CRON_MARKER}"
+    )
+
+
+def read_installed_cron() -> Optional[str]:
+    proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
+    if proc.returncode != 0:
+        return None
+    for line in proc.stdout.splitlines():
+        if CRON_MARKER in line:
+            return line.strip()
+    return None
+
+
+def install_cleanup_cron() -> tuple[bool, str]:
+    existing = read_installed_cron()
+    if existing:
+        return False, f"cron already installed: {existing}"
+
+    proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
+    current = proc.stdout if proc.returncode == 0 else ""
+    new_line = build_cron_line()
+    body = current.rstrip("\n")
+    if body:
+        body += "\n"
+    body += new_line + "\n"
+    write = subprocess.run(
+        ["crontab", "-"],
+        input=body,
+        capture_output=True,
+        text=True,
+    )
+    if write.returncode != 0:
+        return False, write.stderr.strip() or "crontab write failed"
+    return True, new_line
+
+
+def uninstall_cleanup_cron() -> tuple[bool, str]:
+    proc = subprocess.run(["crontab", "-l"], capture_output=True, text=True)
+    if proc.returncode != 0:
+        return False, "no crontab installed"
+    kept = [
+        line
+        for line in proc.stdout.splitlines()
+        if CRON_MARKER not in line
+    ]
+    if len(kept) == len(proc.stdout.splitlines()):
+        return False, "cleanup cron not found"
+    body = "\n".join(kept).rstrip("\n")
+    if body:
+        body += "\n"
+    write = subprocess.run(
+        ["crontab", "-"],
+        input=body,
+        capture_output=True,
+        text=True,
+    )
+    if write.returncode != 0:
+        return False, write.stderr.strip() or "crontab write failed"
+    return True, "removed cleanup cron entry"
--- a/src/bridge/cli.py
+++ b/src/bridge/cli.py
@@ -4,12 +4,24 @@ from __future__ import annotations
 import dataclasses
 import json
 import os
+import subprocess
+from datetime import datetime
 from pathlib import Path
 from typing import Optional

 import typer

 from bridge.audit import AuditLogger
+from bridge.cleanup import (
+    CleanupAction,
+    build_cron_line,
+    cleanup_all_tunnels,
+    install_cleanup_cron,
+    read_installed_cron,
+    restart_all_tunnels,
+    restart_tunnel,
+    uninstall_cleanup_cron,
+)
 from bridge.config import ConfigError, load_config
 from bridge.diagnostics import check_all_tunnels, check_tunnel
 from bridge.manager import TunnelManager
@@ -23,9 +35,11 @@ app = typer.Typer(

 targets_app = typer.Typer(help="Inspect infrastructure targets from the OpsCatalog.")
 catalog_app = typer.Typer(help="Inspect and validate the OpsCatalog.")
+maintenance_app = typer.Typer(help="Scheduled maintenance for tunnel hygiene.")

 app.add_typer(targets_app, name="targets")
 app.add_typer(catalog_app, name="catalog")
+app.add_typer(maintenance_app, name="maintenance")


 def _state_dir() -> Path:
@@ -142,27 +156,37 @@ def down(
            raise typer.Exit(2)


+def _emit_restart_actions(actions: list[CleanupAction]) -> None:
+    any_error = False
+    for action in actions:
+        typer.echo(f"{action.tunnel}: {action.action} — {action.detail}")
+        if action.action == "error":
+            any_error = True
+    if any_error:
+        raise typer.Exit(1)
+
+
@app.command()
 def restart(
    tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
 ):
-    """Restart one or all tunnels."""
+    """Restart one or all tunnels.
+
+    Reverse tunnels run conditional remote stale-forward cleanup before
+    reconnecting; healthy forwards are left running. Local-direction tunnels
+    use local stop/start only.
+    """
    cfg = _load_or_exit()
    sd = _state_dir()
+    state_mgr = StateManager(state_dir=sd)

    if tunnel:
        tcfg = _resolve_tunnel(cfg, tunnel)
-        mgr = TunnelManager(tcfg, state_dir=sd)
-        mgr.stop()
-        mgr.start()
-        typer.echo(f"Restarted tunnel '{tunnel}'.")
+        actions = [restart_tunnel(tcfg, state_mgr)]
    else:
-        for name in _all_tunnel_names(cfg):
-            tcfg = cfg.tunnels[name]
-            mgr = TunnelManager(tcfg, state_dir=sd)
-            mgr.stop()
-            mgr.start()
-            typer.echo(f"Restarted tunnel '{name}'.")
+        actions = restart_all_tunnels(cfg, state_mgr)
+
+    _emit_restart_actions(actions)


@app.command()
@@ -357,6 +381,84 @@ def _print_check_table(results):
        typer.echo(_fmt(row))


+@app.command("cert-status")
+def cert_status(
+    tunnel: Optional[str] = typer.Argument(None, help="Tunnel name (omit for all inline)"),
+    as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
+):
+    """Show certificate status for tunnels using cert_command mode."""
+    cfg = _load_or_exit()
+    sd = _state_dir()
+
+    names = [tunnel] if tunnel else list(cfg.tunnels.keys())
+    rows = []
+    any_expired = False
+
+    for name in names:
+        cert_file = sd / f"{name}-cert.pub"
+        if not cert_file.exists():
+            rows.append({"tunnel": name, "mode": "static-key", "cert_file": None})
+            continue
+
+        try:
+            result = subprocess.run(
+                ["ssh-keygen", "-L", "-f", str(cert_file)],
+                capture_output=True, text=True, check=False,
+            )
+            info = {"tunnel": name, "mode": "cert", "cert_file": str(cert_file)}
+            for line in result.stdout.splitlines():
+                line = line.strip()
+                if line.startswith("Key ID:"):
+                    info["key_id"] = line.split(":", 1)[1].strip().strip('"')
+                elif line.startswith("Valid:"):
+                    parts = line.split()
+                    if len(parts) >= 5 and parts[1] == "from" and parts[3] == "to":
+                        info["valid_from"] = parts[2]
+                        info["valid_until"] = parts[4]
+                        try:
+                            expires = datetime.fromisoformat(parts[4])
+                            now = datetime.now()
+                            remaining = expires - now
+                            if remaining.total_seconds() <= 0:
+                                info["expired"] = True
+                                any_expired = True
+                            else:
+                                info["expired"] = False
+                                mins = int(remaining.total_seconds() // 60)
+                                info["ttl_remaining"] = f"{mins}m"
+                        except ValueError:
+                            pass
+            rows.append(info)
+        except FileNotFoundError:
+            rows.append({"tunnel": name, "mode": "cert", "error": "ssh-keygen not found"})
+
+    if as_json:
+        typer.echo(json.dumps(rows, indent=2))
+    else:
+        for row in rows:
+            mode = row.get("mode", "unknown")
+            if mode == "static-key":
+                typer.echo(f"{row['tunnel']}  static-key / no cert")
+            elif "error" in row:
+                typer.echo(f"{row['tunnel']}  ERROR: {row['error']}")
+            else:
+                parts = [row["tunnel"]]
+                if "key_id" in row:
+                    parts.append(f"id={row['key_id']}")
+                if "valid_from" in row:
+                    parts.append(f"from={row['valid_from']}")
+                if "valid_until" in row:
+                    parts.append(f"until={row['valid_until']}")
+                if row.get("expired"):
+                    parts.append("EXPIRED")
+                elif "ttl_remaining" in row:
+                    parts.append(f"ttl={row['ttl_remaining']}")
+                typer.echo("  ".join(parts))
+
+    if any_expired:
+        raise typer.Exit(1)
+
+
 # ─── targets commands ─────────────────────────────────────────────────────────

@targets_app.callback(invoke_without_command=True)
@@ -553,3 +655,119 @@ def catalog_show(
    if b.target in cat.targets:
        t = cat.targets[b.target]
        typer.echo(f"Target:         {t.description or t.id} ({t.kind})")
+
+
+_CONVENTIONS_TEXT = """\
+Actor Naming Conventions (from AccessManagementDirective.md §2)
+
+Every actor declared under `actors:` in ~/.config/bridge/tunnels.yaml must have
+a `class` field, and the actor name must start with the class-specific prefix:
+
+  class   prefix   purpose
+  -----   ------   ------------------------------------------------------------
+  adm     adm-     Human operator (interactive shell when needed)
+  agt     agt-     LLM-powered autonomous agent (Claude Code, etc.)
+  atm     atm-     Deterministic script / cron job / pipeline
+
+Legacy class aliases (deprecated, still accepted with a warning):
+  human       -> adm
+  automation  -> atm
+
+Examples:
+  adm-bernd:              { class: adm, description: Bernd Worsch }
+  agt-claude-coulombcore: { class: agt, description: Claude Code on CoulombCore }
+  atm-backup-daily:       { class: atm, description: Nightly DB backup }
+
+Full specification:
+  <ops-bridge repo>/wiki/AccessManagementDirective.md
+"""
+
+
+@maintenance_app.command("cleanup")
+def maintenance_cleanup(
+    tunnel: Optional[str] = typer.Argument(
+        None,
+        help="Tunnel name (omit for all reverse tunnels)",
+    ),
+    restart: bool = typer.Option(
+        False,
+        "--restart",
+        help="Restart tunnels after clearing stale remote bindings",
+    ),
+    as_json: bool = typer.Option(False, "--json", help="Output as JSON"),
+):
+    """Clear stale SSH remote port forwards that block tunnel reconnects."""
+    cfg = _load_or_exit()
+    sd = _state_dir()
+    state_mgr = StateManager(state_dir=sd)
+
+    try:
+        report = cleanup_all_tunnels(
+            cfg,
+            state_mgr,
+            restart=restart,
+            tunnel_name=tunnel,
+        )
+    except KeyError:
+        typer.echo(f"Error: tunnel '{tunnel}' not found in config", err=True)
+        raise typer.Exit(1)
+
+    if as_json:
+        payload = {
+            "cleaned_count": report.cleaned_count,
+            "actions": [
+                {"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
+                for a in report.actions
+            ],
+        }
+        typer.echo(json.dumps(payload, indent=2))
+        return
+
+    if not report.actions:
+        typer.echo("No reverse tunnels configured.")
+        return
+
+    for action in report.actions:
+        typer.echo(f"{action.tunnel}: {action.action} — {action.detail}")
+    typer.echo(f"done ({report.cleaned_count} cleaned)")
+
+
+@maintenance_app.command("install-cron")
+def maintenance_install_cron():
+    """Install a 03:00 daily cron job for `bridge maintenance cleanup --restart`."""
+    installed, message = install_cleanup_cron()
+    if installed:
+        typer.echo("Installed nightly cleanup cron:")
+        typer.echo(f"  {message}")
+    else:
+        typer.echo(message)
+        raise typer.Exit(2)
+
+
+@maintenance_app.command("uninstall-cron")
+def maintenance_uninstall_cron():
+    """Remove the nightly cleanup cron job."""
+    removed, message = uninstall_cleanup_cron()
+    if removed:
+        typer.echo(message)
+    else:
+        typer.echo(message)
+        raise typer.Exit(2)
+
+
+@maintenance_app.command("show-cron")
+def maintenance_show_cron():
+    """Show the configured nightly cleanup cron line."""
+    existing = read_installed_cron()
+    if existing:
+        typer.echo(existing)
+    else:
+        typer.echo("Nightly cleanup cron is not installed.")
+        typer.echo("Would install:")
+        typer.echo(f"  {build_cron_line()}")
+
+
+@app.command()
+def conventions():
+    """Show the actor naming conventions enforced by tunnels.yaml."""
+    typer.echo(_CONVENTIONS_TEXT)
--- a/src/bridge/config.py
+++ b/src/bridge/config.py
@@ -2,13 +2,14 @@
 from __future__ import annotations

 import os
+import warnings
 from dataclasses import dataclass
 from pathlib import Path
 from typing import Dict, Optional

 import yaml

-from bridge.models import ActorInfo, HealthCheckConfig, ReconnectPolicy, TunnelConfig
+from bridge.models import ActorInfo, ActorType, HealthCheckConfig, ReconnectPolicy, TunnelConfig


 class ConfigError(Exception):
@@ -91,6 +92,10 @@ def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
    if direction not in ("reverse", "local"):
        raise ConfigError(f"Tunnel '{name}' direction must be 'reverse' or 'local', got: {direction!r}")

+    cert_command = data.get("cert_command") or None
+    if cert_command is not None:
+        cert_command = str(cert_command)
+
    return TunnelConfig(
        name=name,
        host=str(data["host"]),
@@ -102,9 +107,42 @@ def _parse_tunnel(name: str, data: dict) -> TunnelConfig:
        reconnect=reconnect,
        health_check=health_check,
        direction=direction,
+        remote_host=str(data.get("remote_host", "127.0.0.1")),
+        cert_command=cert_command,
    )


+_LEGACY_CLASS_MAP = {
+    "human": ActorType.ADM,
+    "automation": ActorType.ATM,
+}
+
+_ACTOR_TYPE_PREFIXES = {
+    ActorType.ADM: "adm-",
+    ActorType.AGT: "agt-",
+    ActorType.ATM: "atm-",
+}
+
+
+def _parse_actor_type(name: str, raw_class: str) -> ActorType:
+    if raw_class in _LEGACY_CLASS_MAP:
+        warnings.warn(
+            f"Actor '{name}': class '{raw_class}' is deprecated; "
+            f"use '{_LEGACY_CLASS_MAP[raw_class].value}' instead.",
+            DeprecationWarning,
+            stacklevel=4,
+        )
+        return _LEGACY_CLASS_MAP[raw_class]
+    try:
+        return ActorType(raw_class)
+    except ValueError:
+        raise ConfigError(
+            f"Actor '{name}' has unknown class '{raw_class}'; "
+            f"must be one of: adm, agt, atm (or legacy: human, automation). "
+            f"Run `bridge conventions` for the full naming rules."
+        )
+
+
 def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
    actors = {}
    for name, data in raw.items():
@@ -112,9 +150,17 @@ def _parse_actors(raw: dict) -> Dict[str, ActorInfo]:
            raise ConfigError(f"Actor '{name}' must be a mapping")
        if "class" not in data:
            raise ConfigError(f"Actor '{name}' missing required field: class")
+        actor_type = _parse_actor_type(name, str(data["class"]))
+        required_prefix = _ACTOR_TYPE_PREFIXES[actor_type]
+        if not name.startswith(required_prefix):
+            raise ConfigError(
+                f"Actor '{name}' has type '{actor_type.value}' but name must start "
+                f"with '{required_prefix}' (got '{name}'). "
+                f"Run `bridge conventions` for the full naming rules."
+            )
        actors[name] = ActorInfo(
            name=name,
-            actor_class=str(data["class"]),
+            actor_type=actor_type,
            description=str(data.get("description", "")),
        )
    return actors
--- a/src/bridge/diagnostics.py
+++ b/src/bridge/diagnostics.py
@@ -1,6 +1,7 @@
 """End-to-end tunnel diagnostics for OpsBridge."""
 from __future__ import annotations

+import socket
 import subprocess
 import time
 from dataclasses import dataclass
@@ -13,6 +14,38 @@ from bridge.models import BridgeState, TunnelConfig
 from bridge.state import StateManager, _pid_alive


+def _remote_port_probe_command(remote_port: int) -> str:
+    """Build a portable remote shell probe for a listening TCP port."""
+    return (
+        f"port={remote_port}; "
+        "if command -v ss >/dev/null 2>&1; then "
+        "ss -tnlp 2>/dev/null | grep -q \":$port \" && echo ok || echo closed; "
+        "elif command -v netstat >/dev/null 2>&1; then "
+        "netstat -tnlp 2>/dev/null | "
+        "grep -q \"[.:]$port[[:space:]]\" && echo ok || echo closed; "
+        "else "
+        "hex=$(printf '%04X' \"$port\"); "
+        "awk -v p=\":$hex\" "
+        "'NR > 1 && $4 == \"0A\" && index($2, p) { found = 1 } "
+        "END { print found ? \"ok\" : \"closed\" }' "
+        "/proc/net/tcp /proc/net/tcp6 2>/dev/null; "
+        "fi"
+    )
+
+
+def _probe_local_port(local_port: int) -> str:
+    """Check whether the local side of an SSH -L tunnel is accepting TCP."""
+    try:
+        with socket.create_connection(("127.0.0.1", local_port), timeout=5):
+            return "listening"
+    except ConnectionRefusedError:
+        return "closed"
+    except socket.timeout:
+        return "error:timeout"
+    except OSError as e:
+        return f"error:{e}"
+
+
@dataclass
 class TunnelCheckResult:
    tunnel: str
@@ -52,35 +85,38 @@ def check_tunnel(cfg: TunnelConfig, state_mgr: StateManager) -> TunnelCheckResul
        and ssh_process != "ok"
    )

-    # 3. SSH probe for remote port
-    key_path = str(Path(cfg.ssh_key).expanduser())
-    cmd = [
-        "ssh",
-        "-i", key_path,
-        "-o", "BatchMode=yes",
-        "-o", "ConnectTimeout=5",
-        "-o", "StrictHostKeyChecking=accept-new",
-        f"{cfg.ssh_user}@{cfg.host}",
-        f"ss -tnlp 2>/dev/null | grep -q ':{cfg.remote_port} ' && echo ok || echo closed",
-    ]
-    try:
-        proc = subprocess.run(
-            cmd,
-            capture_output=True,
-            text=True,
-            timeout=10,
-        )
-        output = proc.stdout.strip()
-        if output == "ok":
-            remote_port = "listening"
-        elif output == "closed":
-            remote_port = "closed"
-        else:
-            remote_port = f"error:{proc.stderr.strip() or 'unknown'}"
-    except subprocess.TimeoutExpired:
-        remote_port = "error:timeout"
-    except Exception as e:
-        remote_port = f"error:{e}"
+    # 3. Port probe: reverse tunnels listen remotely; local tunnels listen here.
+    if cfg.direction == "local":
+        remote_port = _probe_local_port(cfg.local_port)
+    else:
+        key_path = str(Path(cfg.ssh_key).expanduser())
+        cmd = [
+            "ssh",
+            "-i", key_path,
+            "-o", "BatchMode=yes",
+            "-o", "ConnectTimeout=5",
+            "-o", "StrictHostKeyChecking=accept-new",
+            f"{cfg.ssh_user}@{cfg.host}",
+            _remote_port_probe_command(cfg.remote_port),
+        ]
+        try:
+            proc = subprocess.run(
+                cmd,
+                capture_output=True,
+                text=True,
+                timeout=10,
+            )
+            output = proc.stdout.strip()
+            if output == "ok":
+                remote_port = "listening"
+            elif output == "closed":
+                remote_port = "closed"
+            else:
+                remote_port = f"error:{proc.stderr.strip() or 'unknown'}"
+        except subprocess.TimeoutExpired:
+            remote_port = "error:timeout"
+        except Exception as e:
+            remote_port = f"error:{e}"

    # 4. Local API health check (optional)
    local_api: Optional[str] = None
--- a/src/bridge/manager.py
+++ b/src/bridge/manager.py
@@ -6,35 +6,102 @@ import os
 import signal
 import subprocess
 import time
+from datetime import datetime, timedelta
 from pathlib import Path
 from typing import List, Optional

 from bridge.audit import AuditEvent, AuditLogger
 from bridge.health import HealthChecker
-from bridge.models import BridgeState, TunnelConfig
+from bridge.models import BridgeState, CertAcquisitionError, TunnelConfig
 from bridge.state import StateManager

 log = logging.getLogger(__name__)


-def build_ssh_command(cfg: TunnelConfig) -> List[str]:
+def _actor_type_from_name(name: str) -> str:
+    for prefix in ("adm", "agt", "atm"):
+        if name.startswith(f"{prefix}-"):
+            return prefix
+    return "unknown"
+
+
+def build_ssh_command(cfg: TunnelConfig, cert_path: Optional[Path] = None) -> List[str]:
    """Build the SSH tunnel command (reverse -R or local -L)."""
    key = os.path.expanduser(cfg.ssh_key)
    if cfg.direction == "local":
-        forward_flag = ["-L", f"{cfg.local_port}:127.0.0.1:{cfg.remote_port}"]
+        forward_flag = ["-L", f"{cfg.local_port}:{cfg.remote_host}:{cfg.remote_port}"]
    else:
-        forward_flag = ["-R", f"{cfg.remote_port}:127.0.0.1:{cfg.local_port}"]
-    return [
+        forward_flag = ["-R", f"{cfg.remote_port}:{cfg.remote_host}:{cfg.local_port}"]
+    cmd = [
        "ssh",
        "-N",
        *forward_flag,
        "-i", key,
+    ]
+    if cert_path is not None:
+        cmd += ["-i", str(cert_path)]
+    cmd += [
        "-o", "ServerAliveInterval=10",
        "-o", "ServerAliveCountMax=3",
        "-o", "ExitOnForwardFailure=yes",
        "-o", "StrictHostKeyChecking=accept-new",
        f"{cfg.ssh_user}@{cfg.host}",
    ]
+    return cmd
+
+
+def _run_cert_command(cfg: TunnelConfig, state_dir: Path) -> Optional[Path]:
+    """Run cert_command and write cert to state dir. Returns cert path or None."""
+    if cfg.cert_command is None:
+        return None
+    result = subprocess.run(
+        cfg.cert_command,
+        shell=True,
+        capture_output=True,
+        text=True,
+    )
+    if result.returncode != 0:
+        raise CertAcquisitionError(result.stderr.strip())
+    cert_path = state_dir / f"{cfg.name}-cert.pub"
+    cert_path.write_text(result.stdout)
+    return cert_path
+
+
+def _parse_cert_identity(cert_path: Path) -> Optional[str]:
+    """Parse Key ID from ssh-keygen -L output."""
+    try:
+        result = subprocess.run(
+            ["ssh-keygen", "-L", "-f", str(cert_path)],
+            capture_output=True,
+            text=True,
+        )
+        for line in result.stdout.splitlines():
+            line = line.strip()
+            if line.startswith("Key ID:"):
+                return line.split(":", 1)[1].strip().strip('"')
+    except Exception:
+        pass
+    return None
+
+
+def _parse_cert_expiry(cert_path: Path) -> Optional[datetime]:
+    """Parse Valid-before datetime from ssh-keygen -L output."""
+    try:
+        result = subprocess.run(
+            ["ssh-keygen", "-L", "-f", str(cert_path)],
+            capture_output=True,
+            text=True,
+        )
+        for line in result.stdout.splitlines():
+            line = line.strip()
+            if line.startswith("Valid:"):
+                # "Valid: from 2026-05-15T10:00:00 to 2026-05-15T22:00:00"
+                parts = line.split()
+                if len(parts) >= 5 and parts[3] == "to":
+                    return datetime.fromisoformat(parts[4])
+    except Exception:
+        pass
+    return None


 class TunnelManager:
@@ -56,7 +123,8 @@ class TunnelManager:
        return self._state.is_running(self._cfg.name)

    def _actor_info(self):
-        return self._cfg.actor, "unknown"
+        actor = self._cfg.actor
+        return actor, _actor_type_from_name(actor)

    def _next_backoff(self, attempt: int) -> int:
        initial = self._cfg.reconnect.backoff_initial
@@ -71,12 +139,12 @@ class TunnelManager:
            return

        self._state.write_state(self._cfg.name, BridgeState.STARTING)
-        actor, actor_class = self._actor_info()
+        actor, actor_type = self._actor_info()
        self._audit.log(
            tunnel=self._cfg.name,
            event=AuditEvent.BRIDGE_STARTED,
            actor=actor,
-            actor_class=actor_class,
+            actor_type=actor_type,
        )

        pid = os.fork()
@@ -99,7 +167,7 @@ class TunnelManager:
                tunnel=self._cfg.name,
                event=AuditEvent.BRIDGE_STOPPED,
                actor=actor,
-                actor_class=actor_class,
+                actor_type=actor_type,
            )

        os._exit(0)
@@ -131,12 +199,12 @@ class TunnelManager:

        self._state.clear_pid(self._cfg.name)
        self._state.write_state(self._cfg.name, BridgeState.STOPPED)
-        actor, actor_class = self._actor_info()
+        actor, actor_type = self._actor_info()
        self._audit.log(
            tunnel=self._cfg.name,
            event=AuditEvent.BRIDGE_STOPPED,
            actor=actor,
-            actor_class=actor_class,
+            actor_type=actor_type,
        )

    def _run_loop(self) -> None:
@@ -144,11 +212,11 @@ class TunnelManager:
        import asyncio

        cfg = self._cfg
-        actor, actor_class = self._actor_info()
+        actor, actor_type = self._actor_info()
        attempt = 0
        max_attempts = cfg.reconnect.max_attempts  # 0 = infinite
+        state_dir = self._state._dir

-        # Setup signal handler for graceful shutdown
        _stop = [False]

        def _on_term(signum, frame):
@@ -162,7 +230,31 @@ class TunnelManager:
                self._state.write_state(cfg.name, BridgeState.FAILED)
                break

-            cmd = build_ssh_command(cfg)
+            # Acquire cert before each SSH launch (T3, T7)
+            try:
+                cert_path = _run_cert_command(cfg, state_dir)
+            except CertAcquisitionError as e:
+                self._audit.log(
+                    tunnel=cfg.name,
+                    event=AuditEvent.BRIDGE_DISCONNECTED,
+                    actor=actor,
+                    actor_type=actor_type,
+                    detail=f"cert acquisition failed: {e}",
+                )
+                attempt += 1
+                if max_attempts > 0 and attempt >= max_attempts:
+                    self._state.write_state(cfg.name, BridgeState.FAILED)
+                    break
+                backoff = self._next_backoff(attempt - 1)
+                self._state.write_state(cfg.name, BridgeState.RECONNECTING)
+                log.info("Cert acquisition failed, retrying in %ds", backoff)
+                time.sleep(backoff)
+                continue
+
+            cert_identity = _parse_cert_identity(cert_path) if cert_path else None
+            cert_expires_at = _parse_cert_expiry(cert_path) if cert_path else None
+
+            cmd = build_ssh_command(cfg, cert_path=cert_path)
            log.info("Starting SSH: %s", " ".join(cmd))
            self._state.write_state(cfg.name, BridgeState.STARTING)

@@ -174,24 +266,30 @@ class TunnelManager:
                    tunnel=cfg.name,
                    event=AuditEvent.BRIDGE_DISCONNECTED,
                    actor=actor,
-                    actor_class=actor_class,
+                    actor_type=actor_type,
                    detail="ssh binary not found",
                )
                break

-            # Wait briefly then assume connected if still running
            time.sleep(2)
+            _ttl_refresh = False
            if proc.poll() is None:
                self._state.write_state(cfg.name, BridgeState.CONNECTED)
                self._audit.log(
                    tunnel=cfg.name,
                    event=AuditEvent.BRIDGE_CONNECTED,
                    actor=actor,
-                    actor_class=actor_class,
+                    actor_type=actor_type,
+                    cert_identity=cert_identity,
                )
                attempt = 0

-                # Health check loop
+                def _check_ttl() -> bool:
+                    """Return True if cert is within 5 min of expiry and SSH should restart."""
+                    if cert_expires_at is None:
+                        return False
+                    return datetime.now() >= cert_expires_at - timedelta(minutes=5)
+
                if cfg.health_check:
                    checker = HealthChecker(
                        url=cfg.health_check.url,
@@ -199,6 +297,18 @@ class TunnelManager:
                    )
                    health_failing = False
                    while not _stop[0] and proc.poll() is None:
+                        if _check_ttl():
+                            self._audit.log(
+                                tunnel=cfg.name,
+                                event=AuditEvent.CERT_EXPIRING,
+                                actor=actor,
+                                actor_type=actor_type,
+                                cert_identity=cert_identity,
+                                detail=str(cert_expires_at),
+                            )
+                            proc.terminate()
+                            _ttl_refresh = True
+                            break
                        result = asyncio.run(checker.check())
                        if result.ok:
                            if health_failing:
@@ -208,7 +318,7 @@ class TunnelManager:
                                    tunnel=cfg.name,
                                    event=AuditEvent.HEALTH_CHECK_RECOVERED,
                                    actor=actor,
-                                    actor_class=actor_class,
+                                    actor_type=actor_type,
                                )
                        else:
                            if not health_failing:
@@ -218,21 +328,36 @@ class TunnelManager:
                                    tunnel=cfg.name,
                                    event=AuditEvent.HEALTH_CHECK_FAILED,
                                    actor=actor,
-                                    actor_class=actor_class,
+                                    actor_type=actor_type,
                                    detail=result.error or f"HTTP {result.status_code}",
                                )
                        time.sleep(cfg.health_check.interval_seconds)
                else:
                    while not _stop[0] and proc.poll() is None:
+                        if _check_ttl():
+                            self._audit.log(
+                                tunnel=cfg.name,
+                                event=AuditEvent.CERT_EXPIRING,
+                                actor=actor,
+                                actor_type=actor_type,
+                                cert_identity=cert_identity,
+                                detail=str(cert_expires_at),
+                            )
+                            proc.terminate()
+                            _ttl_refresh = True
+                            break
                        time.sleep(1)

-            # SSH exited
+            if _ttl_refresh:
+                # Planned cert refresh — don't count as failure, no backoff
+                continue
+
            if proc.poll() is not None:
                self._audit.log(
                    tunnel=cfg.name,
                    event=AuditEvent.BRIDGE_DISCONNECTED,
                    actor=actor,
-                    actor_class=actor_class,
+                    actor_type=actor_type,
                    detail=f"exit code {proc.returncode}",
                )

@@ -248,7 +373,7 @@ class TunnelManager:
                tunnel=cfg.name,
                event=AuditEvent.BRIDGE_RECONNECTING,
                actor=actor,
-                actor_class=actor_class,
+                actor_type=actor_type,
                detail=f"retry {attempt}, backoff {backoff}s",
            )
            log.info("Reconnecting in %ds (attempt %d)", backoff, attempt)
--- a/src/bridge/mcp_server/server.py
+++ b/src/bridge/mcp_server/server.py
@@ -169,19 +169,22 @@ def bridge_down(tunnel: Optional[str] = None) -> dict:
 def bridge_restart(tunnel: Optional[str] = None) -> dict:
    """Restart one or all configured tunnels.

+    Reverse tunnels run conditional remote stale-forward cleanup before
+    reconnecting; healthy forwards are left running.
+
    Args:
        tunnel: Tunnel name to restart. If omitted, restarts all inline tunnels.

    Returns:
-        {"restarted": [...]} or {"error": "..."}
+        {"actions": [{"tunnel", "action", "detail"}, ...]} or {"error": "..."}
    """
    cfg, err = _load_cfg_or_error()
    if err:
        return err

-    from bridge.manager import TunnelManager
+    from bridge.cleanup import restart_all_tunnels, restart_tunnel
    sd = _state_dir()
-    restarted = []
+    state_mgr = StateManager(state_dir=sd)

    if tunnel:
        from bridge.catalog.loader import load_catalog
@@ -196,18 +199,19 @@ def bridge_restart(tunnel: Optional[str] = None) -> dict:
            tcfg = resolve(tunnel, catalog=catalog, inline_tunnels=cfg.tunnels)
        except BridgeNotFound:
            return {"error": f"Tunnel '{tunnel}' not found in config or catalog"}
-        mgr = TunnelManager(tcfg, state_dir=sd)
-        mgr.stop()
-        mgr.start()
-        restarted.append(tunnel)
+        actions = [restart_tunnel(tcfg, state_mgr)]
    else:
-        for name, tcfg in cfg.tunnels.items():
-            mgr = TunnelManager(tcfg, state_dir=sd)
-            mgr.stop()
-            mgr.start()
-            restarted.append(name)
+        actions = restart_all_tunnels(cfg, state_mgr)

-    return {"restarted": restarted}
+    payload = {
+        "actions": [
+            {"tunnel": a.tunnel, "action": a.action, "detail": a.detail}
+            for a in actions
+        ],
+    }
+    if any(a.action == "error" for a in actions):
+        payload["error"] = "one or more tunnels failed to restart"
+    return payload


@mcp.tool()
@@ -513,4 +517,13 @@ def resource_catalog_targets() -> str:
 # ---------------------------------------------------------------------------

 if __name__ == "__main__":
-    mcp.run(transport="stdio")
+    import argparse
+    parser = argparse.ArgumentParser(description="OpsBridge MCP server")
+    parser.add_argument("--http", action="store_true", help="Run in SSE/HTTP mode instead of stdio")
+    args = parser.parse_args()
+
+    if args.http:
+        port = int(os.environ.get("BRIDGE_MCP_PORT", "8002"))
+        mcp.run(transport="sse", host="127.0.0.1", port=port)
+    else:
+        mcp.run(transport="stdio")
--- a/src/bridge/models.py
+++ b/src/bridge/models.py
@@ -15,6 +15,16 @@ class BridgeState(str, Enum):
    FAILED = "failed"


+class ActorType(str, Enum):
+    ADM = "adm"  # human operator
+    AGT = "agt"  # LLM-powered autonomous agent
+    ATM = "atm"  # deterministic script / pipeline
+
+
+class CertAcquisitionError(Exception):
+    """Raised when cert_command fails to produce a certificate."""
+
+
@dataclass
 class ReconnectPolicy:
    max_attempts: int = 0  # 0 = infinite
@@ -41,10 +51,15 @@ class TunnelConfig:
    reconnect: ReconnectPolicy = field(default_factory=ReconnectPolicy)
    health_check: Optional[HealthCheckConfig] = None
    direction: str = "reverse"  # "reverse" (-R) or "local" (-L)
+    # Forward-destination host as seen from the remote end (direction "local")
+    # or from this workstation (direction "reverse"). Defaults to loopback;
+    # set e.g. a k3s ClusterIP to tunnel to an in-cluster Service.
+    remote_host: str = "127.0.0.1"
+    cert_command: Optional[str] = None


@dataclass
 class ActorInfo:
    name: str
-    actor_class: str  # "human" or "automation"
+    actor_type: ActorType
    description: str = ""
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -23,10 +23,10 @@ VALID_CONFIG = textwrap.dedent("""\
        local_port: 8000
        ssh_user: ubuntu
        ssh_key: ~/.ssh/id_ops
-        actor: operator.bernd
+        actor: adm-bernd
    actors:
-      operator.bernd:
-        class: human
+      adm-bernd:
+        class: adm
        description: Bernd
 """)

@@ -38,10 +38,10 @@ VALID_CONFIG_WITH_CATALOG = textwrap.dedent("""\
        local_port: 8000
        ssh_user: ubuntu
        ssh_key: ~/.ssh/id_ops
-        actor: operator.bernd
+        actor: adm-bernd
    actors:
-      operator.bernd:
-        class: human
+      adm-bernd:
+        class: adm
        description: Bernd
    catalog_path: {catalog_path}
 """)
--- a/tests/test_audit.py
+++ b/tests/test_audit.py
@@ -22,7 +22,7 @@ class TestAuditLogger:
            tunnel="my-tunnel",
            event=AuditEvent.BRIDGE_STARTED,
            actor="operator.bernd",
-            actor_class="human",
+            actor_type="adm",
        )
        log_file = log_dir / "my-tunnel.log"
        assert log_file.exists()
@@ -32,7 +32,7 @@ class TestAuditLogger:
            tunnel="my-tunnel",
            event=AuditEvent.BRIDGE_STARTED,
            actor="operator.bernd",
-            actor_class="human",
+            actor_type="adm",
        )
        lines = (log_dir / "my-tunnel.log").read_text().strip().splitlines()
        assert len(lines) == 1
@@ -40,12 +40,12 @@ class TestAuditLogger:
        assert entry["tunnel"] == "my-tunnel"
        assert entry["event"] == "bridge_started"
        assert entry["actor"] == "operator.bernd"
-        assert entry["actor_class"] == "human"
+        assert entry["actor_type"] == "adm"
        assert "timestamp" in entry

    def test_multiple_events_append(self, logger, log_dir):
        for event in [AuditEvent.BRIDGE_STARTED, AuditEvent.BRIDGE_CONNECTED, AuditEvent.BRIDGE_STOPPED]:
-            logger.log(tunnel="t", event=event, actor="a", actor_class="human")
+            logger.log(tunnel="t", event=event, actor="a", actor_type="adm")
        lines = (log_dir / "t.log").read_text().strip().splitlines()
        assert len(lines) == 3

@@ -54,7 +54,7 @@ class TestAuditLogger:
            tunnel="t",
            event=AuditEvent.HEALTH_CHECK_FAILED,
            actor="a",
-            actor_class="automation",
+            actor_type="atm",
            detail="connection refused",
        )
        entry = json.loads((log_dir / "t.log").read_text().strip())
@@ -72,15 +72,15 @@ class TestAuditLogger:

    def test_timestamp_is_iso8601(self, logger, log_dir):
        from datetime import datetime
-        logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_class="human")
+        logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
        entry = json.loads((log_dir / "t.log").read_text().strip())
        # Should parse without error
        dt = datetime.fromisoformat(entry["timestamp"])
        assert dt.tzinfo is not None or True  # UTC or naive both acceptable

    def test_read_events(self, logger, log_dir):
-        logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_class="human")
-        logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_class="human")
+        logger.log(tunnel="t", event=AuditEvent.BRIDGE_STARTED, actor="a", actor_type="adm")
+        logger.log(tunnel="t", event=AuditEvent.BRIDGE_STOPPED, actor="a", actor_type="adm")
        events = logger.read_events("t")
        assert len(events) == 2
        assert events[0]["event"] == "bridge_started"
--- a/tests/test_cleanup.py
+++ b/tests/test_cleanup.py
@@ -0,0 +1,130 @@
+"""Tests for stale SSH forward cleanup."""
+from __future__ import annotations
+
+import textwrap
+from unittest.mock import MagicMock, patch
+
+from typer.testing import CliRunner
+
+from bridge.cleanup import (
+    CleanupAction,
+    build_cron_line,
+    cleanup_all_tunnels,
+    remote_forward_health_url,
+    should_cleanup_tunnel,
+)
+from bridge.cli import app
+from bridge.config import load_config
+from bridge.models import HealthCheckConfig, TunnelConfig
+from bridge.state import StateManager
+
+
+def _tunnel(**overrides) -> TunnelConfig:
+    base = dict(
+        name="state-hub-railiance01",
+        host="92.205.62.239",
+        remote_port=18000,
+        local_port=8000,
+        ssh_user="tegwick",
+        ssh_key="~/.ssh/id_ops",
+        actor="agt-claude-railiance01",
+        health_check=HealthCheckConfig(
+            url="http://127.0.0.1:8000/state/health",
+            timeout_seconds=5,
+        ),
+    )
+    base.update(overrides)
+    return TunnelConfig(**base)
+
+
+class TestRemoteForwardHealthUrl:
+    def test_maps_local_port_to_remote(self):
+        cfg = _tunnel()
+        assert remote_forward_health_url(cfg) == "http://127.0.0.1:18000/state/health"
+
+    def test_returns_none_for_local_tunnel(self):
+        cfg = _tunnel(direction="local")
+        assert remote_forward_health_url(cfg) is None
+
+
+class TestShouldCleanupTunnel:
+    def test_skips_healthy_remote_forward(self, tmp_path):
+        cfg = _tunnel()
+        state_mgr = StateManager(state_dir=tmp_path)
+        with (
+            patch("bridge.cleanup.remote_port_listening", return_value=True),
+            patch("bridge.cleanup.probe_remote_forward", return_value=(True, "ok")),
+        ):
+            needed, reason = should_cleanup_tunnel(cfg, state_mgr)
+        assert needed is False
+
+    def test_detects_stale_forward_when_local_ok_remote_fails(self, tmp_path):
+        cfg = _tunnel()
+        state_mgr = StateManager(state_dir=tmp_path)
+        with (
+            patch("bridge.cleanup.remote_port_listening", return_value=True),
+            patch("bridge.cleanup.probe_remote_forward", return_value=(False, "timeout")),
+            patch("bridge.cleanup.local_service_healthy", return_value=True),
+            patch(
+                "bridge.cleanup.check_tunnel",
+                return_value=MagicMock(ssh_process="ok", remote_port="listening"),
+            ),
+        ):
+            needed, reason = should_cleanup_tunnel(cfg, state_mgr)
+        assert needed is True
+        assert "stale forward" in reason
+
+
+class TestCleanupAllTunnels:
+    def test_reports_cleaned_tunnel(self, tmp_path, monkeypatch):
+        monkeypatch.setenv("BRIDGE_CONFIG", str(tmp_path / "tunnels.yaml"))
+        (tmp_path / "tunnels.yaml").write_text(
+            textwrap.dedent(
+                """\
+                tunnels:
+                  state-hub-railiance01:
+                    host: 92.205.62.239
+                    remote_port: 18000
+                    local_port: 8000
+                    ssh_user: tegwick
+                    ssh_key: ~/.ssh/id_ops
+                    actor: agt-claude-railiance01
+                    health_check:
+                      url: http://127.0.0.1:8000/state/health
+                actors:
+                  agt-claude-railiance01:
+                    class: agt
+                """
+            )
+        )
+        cfg = load_config()
+        state_mgr = StateManager(state_dir=tmp_path / "state")
+        with patch(
+            "bridge.cleanup.cleanup_tunnel",
+            return_value=CleanupAction("state-hub-railiance01", "cleaned", "cleared"),
+        ):
+            report = cleanup_all_tunnels(cfg, state_mgr, restart=False)
+        assert report.cleaned_count == 1
+        assert report.actions[0].action == "cleaned"
+
+
+class TestMaintenanceCli:
+    def test_cleanup_help(self):
+        runner = CliRunner()
+        result = runner.invoke(app, ["maintenance", "cleanup", "--help"])
+        assert result.exit_code == 0
+        assert "restart" in result.output.lower()
+
+    def test_show_cron_prints_template_when_not_installed(self):
+        runner = CliRunner()
+        with patch("bridge.cli.read_installed_cron", return_value=None):
+            result = runner.invoke(app, ["maintenance", "show-cron"])
+        assert result.exit_code == 0
+        assert "0 3 * * *" in result.output
+
+
+def test_build_cron_line_contains_marker():
+    line = build_cron_line()
+    assert "0 3 * * *" in line
+    assert "maintenance cleanup --restart" in line
+    assert "ops-bridge: maintenance cleanup" in line
--- a/tests/test_cli.py
+++ b/tests/test_cli.py
@@ -17,10 +17,10 @@ VALID_CONFIG = textwrap.dedent("""\
        local_port: 8000
        ssh_user: ubuntu
        ssh_key: ~/.ssh/id_ops
-        actor: operator.bernd
+        actor: adm-bernd
    actors:
-      operator.bernd:
-        class: human
+      adm-bernd:
+        class: adm
        description: Bernd
 """)

@@ -266,22 +266,146 @@ class TestCheckCommand:
        assert result.exit_code == 1


+REVERSE_CONFIG = VALID_CONFIG
+
+LOCAL_TUNNEL_CONFIG = textwrap.dedent("""\
+    tunnels:
+      k3s-api:
+        host: host.local
+        remote_port: 6443
+        local_port: 6443
+        ssh_user: ubuntu
+        ssh_key: ~/.ssh/id_ops
+        actor: adm-bernd
+        direction: local
+    actors:
+      adm-bernd:
+        class: adm
+        description: Bernd
+""")
+
+
 class TestRestartCommand:
    def test_restart_unknown_tunnel_exit_1(self, env):
        result = runner.invoke(app, ["restart", "nonexistent"], env=env)
        assert result.exit_code == 1

+    def test_restart_help_mentions_remote_cleanup(self):
+        result = runner.invoke(app, ["restart", "--help"])
+        assert result.exit_code == 0
+        assert "stale-forward" in result.output.lower() or "remote" in result.output.lower()
+
    @pytest.mark.capability("bridge_restart")
    @pytest.mark.access_mode("cli")
-    def test_restart_calls_stop_then_start(self, env):
-        with patch("bridge.cli.TunnelManager") as mock_mgr_cls:
+    def test_restart_reverse_tunnel_delegates_to_cleanup(self, env):
+        from bridge.cleanup import CleanupAction
+
+        with patch("bridge.cli.restart_tunnel") as mock_restart:
+            mock_restart.return_value = CleanupAction(
+                "test-tunnel", "healthy", "remote forward healthy"
+            )
+            result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
+
+        assert result.exit_code == 0
+        mock_restart.assert_called_once()
+        assert "test-tunnel: healthy" in result.output
+
+    def test_restart_reverse_tunnel_reports_cleaned_and_restarted(self, env):
+        from bridge.cleanup import CleanupAction
+
+        with patch("bridge.cli.restart_tunnel") as mock_restart:
+            mock_restart.return_value = CleanupAction(
+                "test-tunnel",
+                "cleaned_and_restarted",
+                "stale forward; restarted tunnel; cleared",
+            )
+            result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
+
+        assert result.exit_code == 0
+        assert "cleaned_and_restarted" in result.output
+
+    def test_restart_reverse_tunnel_error_exit_1(self, env):
+        from bridge.cleanup import CleanupAction
+
+        with patch("bridge.cli.restart_tunnel") as mock_restart:
+            mock_restart.return_value = CleanupAction(
+                "test-tunnel", "error", "cleanup failed: still_listening"
+            )
+            result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
+
+        assert result.exit_code == 1
+        assert "error" in result.output
+
+    def test_restart_local_tunnel_uses_stop_start(self, tmp_path, state_dir):
+        config_file = tmp_path / "tunnels.yaml"
+        config_file.write_text(LOCAL_TUNNEL_CONFIG)
+        env = {
+            "BRIDGE_CONFIG": str(config_file),
+            "BRIDGE_STATE_DIR": str(state_dir),
+        }
+
+        with patch("bridge.cleanup.TunnelManager") as mock_mgr_cls:
            mock_mgr = MagicMock()
            mock_mgr_cls.return_value = mock_mgr
            call_order = []
            mock_mgr.stop.side_effect = lambda: call_order.append("stop")
            mock_mgr.start.side_effect = lambda: call_order.append("start")

-            result = runner.invoke(app, ["restart", "test-tunnel"], env=env)
+            result = runner.invoke(app, ["restart", "k3s-api"], env=env)

        assert result.exit_code == 0
        assert call_order == ["stop", "start"]
+        assert "k3s-api: restarted" in result.output
+
+
+class TestCertStatusCommand:
+    @pytest.mark.capability("bridge_cert_status")
+    @pytest.mark.access_mode("cli")
+    def test_cert_status_no_cert_shows_static_key(self, env, state_dir):
+        result = runner.invoke(app, ["cert-status"], env=env)
+        assert result.exit_code == 0
+        assert "static-key" in result.output
+
+    def test_cert_status_json_no_cert(self, env, state_dir):
+        result = runner.invoke(app, ["cert-status", "--json"], env=env)
+        assert result.exit_code == 0
+        data = json.loads(result.output)
+        assert data[0]["mode"] == "static-key"
+
+    def test_cert_status_exit_1_on_expired(self, env, state_dir, tmp_path):
+        # Write a fake cert file in state dir; mock ssh-keygen to report expired
+        state_dir.mkdir(parents=True, exist_ok=True)
+        cert_file = state_dir / "test-tunnel-cert.pub"
+        cert_file.write_text("fake cert")
+        with patch("subprocess.run") as mock_run:
+            mock_run.return_value = MagicMock(
+                stdout=(
+                    "test-tunnel-cert.pub:\n"
+                    "        Key ID: \"agt-test\"\n"
+                    "        Valid: from 2026-01-01T00:00:00 to 2026-01-02T00:00:00\n"
+                ),
+                returncode=0,
+            )
+            result = runner.invoke(app, ["cert-status"], env=env)
+        assert result.exit_code == 1
+        assert "EXPIRED" in result.output
+
+    def test_cert_status_json_with_cert(self, env, state_dir):
+        state_dir.mkdir(parents=True, exist_ok=True)
+        cert_file = state_dir / "test-tunnel-cert.pub"
+        cert_file.write_text("fake cert")
+        with patch("subprocess.run") as mock_run:
+            mock_run.return_value = MagicMock(
+                stdout=(
+                    "test-tunnel-cert.pub:\n"
+                    "        Key ID: \"agt-test\"\n"
+                    "        Valid: from 2030-01-01T00:00:00 to 2030-01-02T00:00:00\n"
+                ),
+                returncode=0,
+            )
+            result = runner.invoke(app, ["cert-status", "--json"], env=env)
+        assert result.exit_code == 0
+        data = json.loads(result.output)
+        assert data[0]["mode"] == "cert"
+        assert data[0]["key_id"] == "agt-test"
+        assert data[0]["expired"] is False
--- a/tests/test_config.py
+++ b/tests/test_config.py
@@ -1,9 +1,11 @@
 """Tests for config loading."""
 import textwrap
+import warnings

 import pytest

 from bridge.config import ConfigError, load_config
+from bridge.models import ActorType


 VALID_YAML = textwrap.dedent("""\
@@ -14,7 +16,7 @@ VALID_YAML = textwrap.dedent("""\
        local_port: 8000
        ssh_user: ubuntu
        ssh_key: ~/.ssh/id_ops
-        actor: agent.claude-coulombcore
+        actor: agt-claude-coulombcore
        health_check:
          url: http://127.0.0.1:18000/health
          interval_seconds: 30
@@ -25,11 +27,11 @@ VALID_YAML = textwrap.dedent("""\
          backoff_max: 60

    actors:
-      agent.claude-coulombcore:
-        class: automation
+      agt-claude-coulombcore:
+        class: agt
        description: Claude Code agent on CoulombCore
-      operator.bernd:
-        class: human
+      adm-bernd:
+        class: adm
        description: Bernd Worsch
 """)

@@ -50,7 +52,7 @@ def test_load_valid_config(config_file, monkeypatch):
    assert t.remote_port == 18000
    assert t.local_port == 8000
    assert t.ssh_user == "ubuntu"
-    assert t.actor == "agent.claude-coulombcore"
+    assert t.actor == "agt-claude-coulombcore"


 def test_health_check_loaded(config_file, monkeypatch):
@@ -74,10 +76,10 @@ def test_reconnect_policy_loaded(config_file, monkeypatch):
 def test_actors_loaded(config_file, monkeypatch):
    monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
    cfg = load_config()
-    assert "agent.claude-coulombcore" in cfg.actors
-    a = cfg.actors["agent.claude-coulombcore"]
-    assert a.actor_class == "automation"
-    assert "operator.bernd" in cfg.actors
+    assert "agt-claude-coulombcore" in cfg.actors
+    a = cfg.actors["agt-claude-coulombcore"]
+    assert a.actor_type == ActorType.AGT
+    assert "adm-bernd" in cfg.actors


 def test_missing_required_field_raises(tmp_path, monkeypatch):
@@ -118,12 +120,180 @@ def test_tunnel_without_health_check(tmp_path, monkeypatch):
            local_port: 8000
            ssh_user: ubuntu
            ssh_key: ~/.ssh/id_rsa
-            actor: operator.bernd
+            actor: adm-bernd
        actors:
-          operator.bernd:
-            class: human
+          adm-bernd:
+            class: adm
            description: Bernd
    """))
    monkeypatch.setenv("BRIDGE_CONFIG", str(f))
    cfg = load_config()
    assert cfg.tunnels["simple"].health_check is None
+
+
+class TestActorTypeValidation:
+    def test_canonical_agt_accepted(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: agt-claude
+            actors:
+              agt-claude:
+                class: agt
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        cfg = load_config()
+        assert cfg.actors["agt-claude"].actor_type == ActorType.AGT
+
+    def test_canonical_atm_accepted(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: atm-backup
+            actors:
+              atm-backup:
+                class: atm
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        cfg = load_config()
+        assert cfg.actors["atm-backup"].actor_type == ActorType.ATM
+
+    def test_wrong_prefix_raises_config_error(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: adm-bernd
+            actors:
+              adm-bernd:
+                class: agt
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        with pytest.raises(ConfigError, match="must start with 'agt-'"):
+            load_config()
+
+    def test_missing_prefix_raises_config_error(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: operator.bernd
+            actors:
+              operator.bernd:
+                class: adm
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        with pytest.raises(ConfigError, match="must start with 'adm-'"):
+            load_config()
+
+    def test_unknown_class_raises_config_error(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: adm-bernd
+            actors:
+              adm-bernd:
+                class: wizard
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        with pytest.raises(ConfigError, match="unknown class"):
+            load_config()
+
+    def test_legacy_human_maps_to_adm_with_warning(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: adm-bernd
+            actors:
+              adm-bernd:
+                class: human
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            cfg = load_config()
+        assert cfg.actors["adm-bernd"].actor_type == ActorType.ADM
+        assert any("deprecated" in str(x.message).lower() for x in w)
+
+    def test_legacy_automation_maps_to_atm_with_warning(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: atm-cron
+            actors:
+              atm-cron:
+                class: automation
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        with warnings.catch_warnings(record=True) as w:
+            warnings.simplefilter("always")
+            cfg = load_config()
+        assert cfg.actors["atm-cron"].actor_type == ActorType.ATM
+        assert any("deprecated" in str(x.message).lower() for x in w)
+
+
+class TestCertCommandConfig:
+    def test_cert_command_parsed(self, tmp_path, monkeypatch):
+        f = tmp_path / "t.yaml"
+        f.write_text(textwrap.dedent("""\
+            tunnels:
+              t:
+                host: h
+                remote_port: 1
+                local_port: 2
+                ssh_user: u
+                ssh_key: ~/.ssh/k
+                actor: agt-bridge
+                cert_command: "warden sign agt-bridge --pubkey /tmp/k.pub"
+            actors:
+              agt-bridge:
+                class: agt
+        """))
+        monkeypatch.setenv("BRIDGE_CONFIG", str(f))
+        cfg = load_config()
+        assert cfg.tunnels["t"].cert_command == "warden sign agt-bridge --pubkey /tmp/k.pub"
+
+    def test_no_cert_command_is_none(self, config_file, monkeypatch):
+        monkeypatch.setenv("BRIDGE_CONFIG", str(config_file))
+        cfg = load_config()
+        assert cfg.tunnels["state-hub-coulombcore"].cert_command is None
--- a/tests/test_diagnostics.py
+++ b/tests/test_diagnostics.py
@@ -6,7 +6,11 @@ from unittest.mock import MagicMock, patch

 import pytest

-from bridge.diagnostics import TunnelCheckResult, check_all_tunnels, check_tunnel
+from bridge.diagnostics import (
+    _remote_port_probe_command,
+    check_all_tunnels,
+    check_tunnel,
+)
 from bridge.models import BridgeState, TunnelConfig
 from bridge.state import StateManager

@@ -20,7 +24,7 @@ def tcfg():
        local_port=8000,
        ssh_user="ubuntu",
        ssh_key="~/.ssh/id_ops",
-        actor="operator.bernd",
+        actor="adm-bernd",
    )


@@ -32,6 +36,14 @@ def state_mgr(tmp_path):


 class TestCheckTunnel:
+    def test_remote_port_probe_has_minimal_host_fallback(self):
+        """Remote probe supports minimal hosts without ss/netstat."""
+        command = _remote_port_probe_command(18000)
+        assert "command -v ss" in command
+        assert "command -v netstat" in command
+        assert "/proc/net/tcp" in command
+        assert "/proc/net/tcp6" in command
+
    def test_no_pid(self, tcfg, state_mgr):
        """No PID file → ssh_process='no_pid', ok=False."""
        with patch("bridge.diagnostics.subprocess.run") as mock_run:
@@ -83,6 +95,29 @@ class TestCheckTunnel:
        assert result.remote_port == "closed"
        assert result.ok is False

+    def test_local_direction_checks_local_port(self, tcfg, state_mgr):
+        """Local tunnels verify the local listener instead of a remote -R port."""
+        local_cfg = TunnelConfig(
+            name="local-tunnel",
+            host="haskelseed.local",
+            remote_port=1234,
+            local_port=11234,
+            ssh_user="root",
+            ssh_key="~/.ssh/id_ops",
+            actor="adm-bernd",
+            direction="local",
+        )
+        state_mgr.write_pid("local-tunnel", 12345)
+        with (
+            patch("bridge.diagnostics._pid_alive", return_value=True),
+            patch("bridge.diagnostics._probe_local_port", return_value="listening"),
+            patch("bridge.diagnostics.subprocess.run") as mock_run,
+        ):
+            result = check_tunnel(local_cfg, state_mgr)
+        mock_run.assert_not_called()
+        assert result.remote_port == "listening"
+        assert result.ok is True
+
    def test_ssh_timeout(self, tcfg, state_mgr):
        """SSH probe timeout → remote_port='error:timeout'."""
        state_mgr.write_pid("test-tunnel", 12345)
@@ -114,7 +149,7 @@ class TestCheckTunnel:
            local_port=8000,
            ssh_user="ubuntu",
            ssh_key="~/.ssh/id_ops",
-            actor="operator.bernd",
+            actor="adm-bernd",
            health_check=HealthCheckConfig(url="http://127.0.0.1:8000/health"),
        )
        state_mgr.write_pid("test-tunnel", 12345)
@@ -135,7 +170,8 @@ class TestCheckAllTunnels:
    def test_check_all_iterates_tunnels(self, tmp_path):
        """check_all_tunnels returns one result per tunnel in cfg."""
        from bridge.config import load_config
-        import textwrap, os
+        import textwrap
+        import os

        cfg_file = tmp_path / "tunnels.yaml"
        cfg_file.write_text(textwrap.dedent("""\
@@ -146,17 +182,17 @@ class TestCheckAllTunnels:
                local_port: 8001
                ssh_user: ubuntu
                ssh_key: ~/.ssh/id_ops
-                actor: operator.bernd
+                actor: adm-bernd
              t2:
                host: h2.local
                remote_port: 18002
                local_port: 8002
                ssh_user: ubuntu
                ssh_key: ~/.ssh/id_ops
-                actor: operator.bernd
+                actor: adm-bernd
            actors:
-              operator.bernd:
-                class: human
+              adm-bernd:
+                class: adm
                description: Bernd
        """))
        os.environ["BRIDGE_CONFIG"] = str(cfg_file)
--- a/tests/test_integration.py
+++ b/tests/test_integration.py
@@ -18,14 +18,14 @@ MINIMAL_CONFIG = textwrap.dedent("""\
        local_port: 8000
        ssh_user: testuser
        ssh_key: ~/.ssh/id_rsa
-        actor: operator.bernd
+        actor: adm-bernd
        reconnect:
          max_attempts: 2
          backoff_initial: 1
          backoff_max: 2
    actors:
-      operator.bernd:
-        class: human
+      adm-bernd:
+        class: adm
        description: Bernd
 """)

@@ -51,7 +51,7 @@ def tunnel_cfg():
        local_port=8000,
        ssh_user="testuser",
        ssh_key="~/.ssh/id_rsa",
-        actor="operator.bernd",
+        actor="adm-bernd",
        reconnect=ReconnectPolicy(max_attempts=2, backoff_initial=1, backoff_max=2),
    )

@@ -142,7 +142,7 @@ class TestHealthCheckDegradedPath:
            local_port=8001,
            ssh_user="u",
            ssh_key="k",
-            actor="operator.bernd",
+            actor="adm-bernd",
            reconnect=ReconnectPolicy(max_attempts=1, backoff_initial=1, backoff_max=1),
            health_check=hc_cfg,
        )
--- a/tests/test_manager.py
+++ b/tests/test_manager.py
@@ -3,6 +3,8 @@ import os
 import signal
 from unittest.mock import MagicMock, patch

+from dataclasses import replace
+
 import pytest

 from bridge.models import BridgeState, ReconnectPolicy, TunnelConfig
@@ -38,6 +40,16 @@ class TestBuildSshCommand:
        assert "-i" in cmd
        assert "ubuntu@host.local" in cmd

+    def test_remote_host_override_local(self, tunnel_cfg):
+        cfg = replace(tunnel_cfg, direction="local", remote_host="10.43.103.154")
+        cmd = build_ssh_command(cfg)
+        assert "-L" in cmd
+        assert f"{cfg.local_port}:10.43.103.154:{cfg.remote_port}" in cmd
+
+    def test_remote_host_default_loopback(self, tunnel_cfg):
+        cmd = build_ssh_command(tunnel_cfg)
+        assert "18000:127.0.0.1:8000" in cmd
+
    def test_server_alive_options(self, tunnel_cfg):
        cmd = build_ssh_command(tunnel_cfg)
        assert "-o" in cmd
@@ -105,3 +117,99 @@ class TestTunnelManager:
    def test_is_running_false_initially(self, tunnel_cfg, state_dir):
        mgr = TunnelManager(tunnel_cfg, state_dir=state_dir)
        assert not mgr.is_running()
+
+
+class TestBuildSshCommandWithCert:
+    def test_no_cert_path_omits_extra_i(self, tunnel_cfg):
+        cmd = build_ssh_command(tunnel_cfg)
+        assert cmd.count("-i") == 1
+
+    def test_cert_path_appends_after_key(self, tunnel_cfg, tmp_path):
+        cert = tmp_path / "test-cert.pub"
+        cert.write_text("cert")
+        cmd = build_ssh_command(tunnel_cfg, cert_path=cert)
+        i_indices = [i for i, x in enumerate(cmd) if x == "-i"]
+        assert len(i_indices) == 2
+        key_idx, cert_idx = i_indices
+        assert not cmd[key_idx + 1].endswith("-cert.pub")  # key comes first
+        assert cmd[cert_idx + 1] == str(cert)
+
+
+class TestRunCertCommand:
+    def test_returns_none_when_no_cert_command(self, tunnel_cfg, tmp_path):
+        from bridge.manager import _run_cert_command
+        assert _run_cert_command(tunnel_cfg, tmp_path) is None
+
+    def test_writes_cert_and_returns_path(self, tunnel_cfg, tmp_path):
+        from bridge.manager import _run_cert_command
+        tunnel_cfg.cert_command = "echo 'ssh-rsa-cert AAAA'"
+        path = _run_cert_command(tunnel_cfg, tmp_path)
+        assert path is not None
+        assert path.exists()
+        assert "ssh-rsa-cert" in path.read_text()
+
+    def test_raises_on_nonzero_exit(self, tunnel_cfg, tmp_path):
+        from bridge.manager import _run_cert_command
+        from bridge.models import CertAcquisitionError
+        tunnel_cfg.cert_command = "exit 1"
+        with pytest.raises(CertAcquisitionError):
+            _run_cert_command(tunnel_cfg, tmp_path)
+
+
+class TestActorTypeFromName:
+    def test_adm_prefix(self):
+        from bridge.manager import _actor_type_from_name
+        assert _actor_type_from_name("adm-bernd") == "adm"
+
+    def test_agt_prefix(self):
+        from bridge.manager import _actor_type_from_name
+        assert _actor_type_from_name("agt-claude") == "agt"
+
+    def test_atm_prefix(self):
+        from bridge.manager import _actor_type_from_name
+        assert _actor_type_from_name("atm-cron") == "atm"
+
+    def test_unknown_prefix(self):
+        from bridge.manager import _actor_type_from_name
+        assert _actor_type_from_name("operator.bernd") == "unknown"
+
+
+class TestTtlRefresh:
+    def test_parse_cert_expiry_returns_none_for_missing_file(self, tmp_path):
+        from bridge.manager import _parse_cert_expiry
+        missing = tmp_path / "no.pub"
+        result = _parse_cert_expiry(missing)
+        assert result is None
+
+    def test_parse_cert_identity_returns_none_for_missing_file(self, tmp_path):
+        from bridge.manager import _parse_cert_identity
+        missing = tmp_path / "no.pub"
+        result = _parse_cert_identity(missing)
+        assert result is None
+
+    def test_parse_cert_identity_from_keygen_output(self, tmp_path):
+        from unittest.mock import patch, MagicMock
+        from bridge.manager import _parse_cert_identity
+        cert = tmp_path / "test.pub"
+        cert.write_text("fake")
+        with patch("subprocess.run") as mock_run:
+            mock_run.return_value = MagicMock(
+                stdout='test.pub:\n        Key ID: "agt-bridge"\n',
+                returncode=0,
+            )
+            result = _parse_cert_identity(cert)
+        assert result == "agt-bridge"
+
+    def test_parse_cert_expiry_from_keygen_output(self, tmp_path):
+        from unittest.mock import patch, MagicMock
+        from bridge.manager import _parse_cert_expiry
+        cert = tmp_path / "test.pub"
+        cert.write_text("fake")
+        with patch("subprocess.run") as mock_run:
+            mock_run.return_value = MagicMock(
+                stdout="test.pub:\n        Valid: from 2026-05-15T10:00:00 to 2030-05-15T22:00:00\n",
+                returncode=0,
+            )
+            result = _parse_cert_expiry(cert)
+        assert result is not None
+        assert result.year == 2030
--- a/tests/test_mcp.py
+++ b/tests/test_mcp.py
@@ -49,10 +49,10 @@ def _simple_config(tmp_path: Path) -> Path:
            local_port: 8000
            ssh_user: ubuntu
            ssh_key: ~/.ssh/id_ops
-            actor: operator.bernd
+            actor: adm-bernd
        actors:
-          operator.bernd:
-            class: human
+          adm-bernd:
+            class: adm
            description: Bernd
    """))

@@ -66,10 +66,10 @@ def _catalog_config(tmp_path: Path, catalog_dir: Path) -> Path:
            local_port: 8000
            ssh_user: ubuntu
            ssh_key: ~/.ssh/id_ops
-            actor: operator.bernd
+            actor: adm-bernd
        actors:
-          operator.bernd:
-            class: human
+          adm-bernd:
+            class: adm
            description: Bernd
        catalog_path: {catalog_dir}
    """))
@@ -237,22 +237,22 @@ class TestMcpBridgeDown:
 class TestMcpBridgeRestart:
    @pytest.mark.capability("bridge_restart")
    @pytest.mark.access_mode("mcp")
-    async def test_bridge_restart_calls_stop_then_start(self, env_simple):
-        with patch("bridge.manager.TunnelManager") as mock_cls:
-            mock_mgr = MagicMock()
-            call_order = []
-            mock_mgr.stop.side_effect = lambda: call_order.append("stop")
-            mock_mgr.start.side_effect = lambda: call_order.append("start")
-            mock_cls.return_value = mock_mgr
+    async def test_bridge_restart_delegates_to_cleanup(self, env_simple):
+        from bridge.cleanup import CleanupAction
+
+        with patch("bridge.cleanup.restart_tunnel") as mock_restart:
+            mock_restart.return_value = CleanupAction(
+                "test-tunnel", "healthy", "remote forward healthy"
+            )

            from fastmcp import Client
            async with Client(mcp) as c:
                result = await c.call_tool("bridge_restart", {"tunnel": "test-tunnel"})

        data = _data(result)
-        assert "restarted" in data
-        assert "test-tunnel" in data["restarted"]
-        assert call_order == ["stop", "start"]
+        assert data["actions"][0]["tunnel"] == "test-tunnel"
+        assert data["actions"][0]["action"] == "healthy"
+        mock_restart.assert_called_once()

    async def test_bridge_restart_unknown_tunnel(self, env_simple):
        from fastmcp import Client
@@ -278,8 +278,8 @@ class TestMcpBridgeLogs:
            _json.dumps({
                "timestamp": "2026-01-01T00:00:00+00:00",
                "tunnel": "test-tunnel",
-                "actor": "operator.bernd",
-                "actor_class": "human",
+                "actor": "adm-bernd",
+                "actor_type": "adm",
                "event": "bridge_started",
            }) + "\n"
        )
--- a/tests/test_models.py
+++ b/tests/test_models.py
@@ -69,6 +69,7 @@ class TestTunnelConfig:

 class TestActorInfo:
    def test_fields(self):
-        a = ActorInfo(name="operator.bernd", actor_class="human", description="Bernd")
-        assert a.name == "operator.bernd"
-        assert a.actor_class == "human"
+        from bridge.models import ActorType
+        a = ActorInfo(name="adm-bernd", actor_type=ActorType.ADM, description="Bernd")
+        assert a.name == "adm-bernd"
+        assert a.actor_type == ActorType.ADM
--- a/uv.lock
+++ b/uv.lock
@@ -345,7 +345,7 @@ wheels = [

 [[package]]
 name = "fastmcp"
-version = "3.1.0"
+version = "3.0.2"
 source = { registry = "https://pypi.org/simple" }
 dependencies = [
    { name = "authlib" },
@@ -365,14 +365,13 @@ dependencies = [
    { name = "python-dotenv" },
    { name = "pyyaml" },
    { name = "rich" },
-    { name = "uncalled-for" },
    { name = "uvicorn" },
    { name = "watchfiles" },
    { name = "websockets" },
 ]
-sdist = { url = "https://files.pythonhosted.org/packages/0a/70/862026c4589441f86ad3108f05bfb2f781c6b322ad60a982f40b303b47d7/fastmcp-3.1.0.tar.gz", hash = "sha256:e25264794c734b9977502a51466961eeecff92a0c2f3b49c40c070993628d6d0", size = 17347083 }
+sdist = { url = "https://files.pythonhosted.org/packages/11/6b/1a7ec89727797fb07ec0928e9070fa2f45e7b35718e1fe01633a34c35e45/fastmcp-3.0.2.tar.gz", hash = "sha256:6bd73b4a3bab773ee6932df5249dcbcd78ed18365ed0aeeb97bb42702a7198d7", size = 17239351 }
 wheels = [
-    { url = "https://files.pythonhosted.org/packages/17/07/516f5b20d88932e5a466c2216b628e5358a71b3a9f522215607c3281de05/fastmcp-3.1.0-py3-none-any.whl", hash = "sha256:b1f73b56fd3b0cb2bd9e2a144fc650d5cc31587ed129d996db7710e464ae8010", size = 633749 },
+    { url = "https://files.pythonhosted.org/packages/0a/5a/f410a9015cfde71adf646dab4ef2feae49f92f34f6050fcfb265eb126b30/fastmcp-3.0.2-py3-none-any.whl", hash = "sha256:f513d80d4b30b54749fe8950116b1aab843f3c293f5cb971fc8665cb48dbb028", size = 606268 },
 ]

 [[package]]
@@ -664,7 +663,7 @@ dev = [

 [package.metadata]
 requires-dist = [
-    { name = "fastmcp", specifier = ">=2.0.0" },
+    { name = "fastmcp", specifier = ">=2.0.0,<3.1.0" },
    { name = "httpx", specifier = ">=0.27" },
    { name = "pyyaml", specifier = ">=6.0" },
    { name = "typer", specifier = ">=0.12" },
@@ -1297,15 +1296,6 @@ wheels = [
    { url = "https://files.pythonhosted.org/packages/dc/9b/47798a6c91d8bdb567fe2698fe81e0c6b7cb7ef4d13da4114b41d239f65d/typing_inspection-0.4.2-py3-none-any.whl", hash = "sha256:4ed1cacbdc298c220f1bd249ed5287caa16f34d44ef4e9c3d0cbad5b521545e7", size = 14611 },
 ]

-[[package]]
-name = "uncalled-for"
-version = "0.2.0"
-source = { registry = "https://pypi.org/simple" }
-sdist = { url = "https://files.pythonhosted.org/packages/02/7c/b5b7d8136f872e3f13b0584e576886de0489d7213a12de6bebf29ff6ebfc/uncalled_for-0.2.0.tar.gz", hash = "sha256:b4f8fdbcec328c5a113807d653e041c5094473dd4afa7c34599ace69ccb7e69f", size = 49488 }
-wheels = [
-    { url = "https://files.pythonhosted.org/packages/ff/7f/4320d9ce3be404e6310b915c3629fe27bf1e2f438a1a7a3cb0396e32e9a9/uncalled_for-0.2.0-py3-none-any.whl", hash = "sha256:2c0bd338faff5f930918f79e7eb9ff48290df2cb05fcc0b40a7f334e55d4d85f", size = 11351 },
-]
-
 [[package]]
 name = "uvicorn"
 version = "0.41.0"
--- a/wiki/AccessManagementDirective.md
+++ b/wiki/AccessManagementDirective.md
@@ -0,0 +1,203 @@
+AccessManagementDirective
+
+*Practical host access control management *
+
+# AccessManagementDirective
+
+**Document Title:** SSH Access Management Directive  
+**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements)  
+**Date:** 28 March 2026  
+**Audience:** Operations Department  
+**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).  
+**Author:** Grok (on behalf of the team)  
+**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.  
+**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
+
+## 0. Prerequisites
+
+Before bootstrapping, the following must be in place:
+- Ansible (or equivalent config-management tool) with a central inventory.
+- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
+- GitOps repository containing the authoritative principals inventory.
+- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
+- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
+
+If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
+
+## 1. Concept Overview
+
+This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
+
+**Why this model?**  
+- A central CA signs short-lived certificates for every login.  
+- No more manual key copying, key sprawl, or painful revocation.  
+- Built-in expiration, role-based principals, and auditability.  
+- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.  
+- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
+
+**Core Principles**  
+- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.  
+- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).  
+- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host.  
+- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).  
+- **Separation of concerns** –  
+  - **Admins (adm)**: Human operators (full interactive shell when needed).  
+  - **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.  
+  - **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
+
+## 2. Actor Definitions & Access Model
+
+| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
+|------------|-------------------|-------------|------------------------------|---------------------------|
+| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
+| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
+| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
+
+**Certificate Naming Convention**  
+- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`  
+- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
+
+**LLM-Agent Risk Clarification**  
+Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
+
+## 3. Bootstrapping the System (One-Time Setup)
+
+### 3.1. Create the CA (do this once, offline)
+```bash
+ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
+```
+- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.  
+- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.  
+- Public key: `ca_user.pub`
+
+### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
+- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).  
+- Update `/etc/ssh/sshd_config`:
+  ```bash
+  TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
+  AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
+  PubkeyAuthentication yes
+  PasswordAuthentication no
+  PermitRootLogin no
+  ```
+- Create principals directory and files from the central Git inventory.  
+- `systemctl restart sshd`
+
+### 3.3. Initial Admin Access
+First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
+
+## 4. Automatic Management of Access Rights
+
+### 4.1. Daily / On-Demand Workflow
+1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)  
+   - **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.  
+   - **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).  
+   - **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
+
+2. **Ansible-Driven Host Updates** (run hourly via CI/CD)  
+   - `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).  
+   - Example inventory snippet:
+     ```yaml
+     hosts:
+       - name: prod-db-01
+         allowed_principals:
+           adm: [adm-full]
+           agt: [agt-incident-resolver-v2]
+           atm: [atm-backup-daily, atm-logrotate]
+     ```
+
+3. **Revocation & Rotation**  
+   - Short expiry = automatic revocation.  
+   - For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).  
+   - Agents/automations never store long-lived private keys on disk.
+
+4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`)
+   ```python
+   #!/usr/bin/env python3
+   import subprocess, os, tempfile
+   # Request short-lived cert from Vault
+   cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
+   with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
+       f.write(cert.encode())
+       cert_path = f.name
+   # Load into ssh-agent and exec the real command
+   subprocess.run(["ssh-add", cert_path])
+   os.execvp(sys.argv[1], sys.argv[1:])
+   ```
+   Agents call this wrapper; it auto-refreshes the cert on every wake-up.
+
+### 4.2. Human UX Guidance
+Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
+
+### 4.3. Emergency Break-Glass Procedure
+In case of total lockout (CA offline, misconfigured Ansible push, etc.):
+1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).  
+2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).  
+3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.  
+4. After recovery, immediately rotate the CA and run a full scorecard.
+
+## 5. AccessManagement Scorecard (Checklist)
+
+Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
+
+| Category | Check | Target | Tool |
+|----------|-------|--------|------|
+| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
+| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
+| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
+| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
+| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
+| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
+| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
+| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
+| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
+| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
+| **Score** | ≥ 10/10 = **Operational** | - | - |
+
+**Scorecard Execution Command** (run from ops laptop):
+```bash
+ansible all -m command -a "ssh-access-scorecard.sh" --become
+```
+
+## 6. Scope & Operational Boundaries
+
+### 6.1. When Bootstrapping Is Officially Closed
+The system is **fully operational** when **ALL** of the following are true:
+- Scorecard passes 10/10 on every host.
+- Central Git repo contains the authoritative principals inventory.
+- First three admins have successfully used signed certificates for 7 consecutive days.
+- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
+- CI/CD pipeline for host config updates is green and runs hourly.
+- Emergency break-glass procedure has been tested once.
+
+**Declaration:** Ops Lead signs off with date in the Git commit message.
+
+### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling
+Stay with **native OpenSSH CA + Ansible + Vault** while:
+- ≤ 200 hosts
+- ≤ 50 distinct agent/automation identities
+- No regulatory requirement for SSO or full session recording
+
+**Switch triggers** (any one):
+- > 200 hosts OR rapid daily growth
+- Need for human SSO (Okta/Google) integration
+- Requirement for audited web-based SSH sessions or just-in-time access approval
+- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
+- Audit/compliance demands central policy engine or session recording
+
+**Recommended next-level tools** (in order):
+1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID).  
+2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily.  
+3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC.
+
+**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
+
+## 7. Enforcement & Review
+- **Quarterly review** of this directive and scorecard results.  
+- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.  
+- **Questions / improvements** → create PR against this file in the ops repo.
+
+**End of Document**  
+Approved for immediate use across all production and staging environments.
+
+xxx
--- a/wiki/OpsBridge.md
+++ b/wiki/OpsBridge.md
@@ -157,31 +157,82 @@ Just controlled operational access when you need it.
 Start a bridge:

 ```
-ob up hostA=hostB
+bridge up state-hub-railiance01
 ```

 Check active bridges:

 ```
-ob status
+bridge status
 ```

 Investigate infrastructure targets:

 ```
-ob targets
+bridge targets
 ```

 Stop the bridge when finished:

 ```
-ob down hostA=hostB
+bridge down state-hub-railiance01
 ```

 OpsBridge handles the lifecycle so operators can focus on solving the problem.

 ---

+# Tunnel lifecycle commands
+
+| Command | Purpose |
+|---------|---------|
+| `bridge up` | Start tunnel(s) that are not already running |
+| `bridge down` | Stop tunnel(s) that are running |
+| `bridge restart` | Blank-slate recovery — get tunnel(s) operational again |
+| `bridge maintenance cleanup` | Proactive hygiene sweep without implying restart |
+
+## `bridge restart` — blank-slate recovery
+
+`bridge restart` means *operational again*, not merely cycling the local manager
+PID while a broken remote listener still holds the port.
+
+For **reverse** tunnels (State Hub exposure on remote hosts), restart:
+
+1. Runs `should_cleanup_tunnel` to detect stale SSH remote forwards
+2. Clears orphan listeners on the remote host when needed
+3. Reconnects the tunnel (stop + start) only when cleanup was required
+
+When the remote forward is already healthy, restart reports `healthy` and leaves
+the working tunnel running — no unnecessary disruption.
+
+For **local-direction** tunnels (`direction: local` in `tunnels.yaml`, e.g.
+`k3s-api-coulombcore`), restart uses local stop/start only; no remote cleanup.
+
+Use `bridge maintenance cleanup` for scheduled or manual hygiene without the
+restart contract. The nightly cron (`bridge maintenance install-cron`) runs
+`maintenance cleanup --restart` at 03:00.
+
+**Incident context:** stale orphan `sshd` remote forwards after laptop sleep
+blocked `bridge restart` until operators discovered the maintenance subcommand.
+See `state-hub/history/20260621-weekend-automation-assessment.md` and
+`BRIDGE-WP-0005` in this repo.
+
+## Host roles
+
+Tunnels in `~/.config/bridge/tunnels.yaml` serve three host roles:
+
+| Role | Hosts | Behaviour |
+|------|-------|-----------|
+| **Workstation origin** | WSL laptop | Shutdown, sleep, and network changes kill local bridge processes without graceful remote SSH teardown. Orphan forwards on all remotes are common after wake. |
+| **VPS remotes** | coulombcore, railiance01 | Normally always-on. Maintenance reboots clear kernel state, but laptop return can leave orphan forwards from the previous session if the VPS did not reboot. |
+| **LAN builder** | haskelseed | Intermittently offline; same orphan-forward pattern when the workstation-side tunnel dies uncleanly. |
+
+Conditional remote cleanup before restart benefits all reverse tunnels.
+`should_cleanup_tunnel` skips healthy forwards — VPS tunnels with live working
+forwards are untouched.
+
+---
+
 # The Philosophy Behind OpsBridge

 Infrastructure teams succeed or fail based on how effectively they bridge the gaps between:
--- a/workplans/ADHOC-2026-06-14.md
+++ b/workplans/ADHOC-2026-06-14.md
@@ -0,0 +1,56 @@
+---
+id: ADHOC-2026-06-14
+type: workplan
+title: "Ad hoc ops-bridge fixes for 2026-06-14"
+domain: custodian
+repo: ops-bridge
+status: finished
+owner: codex
+topic_slug: ops-bridge
+created: "2026-06-14"
+updated: "2026-06-14"
+state_hub_workstream_id: "fbc2ef7e-626f-4c6a-bdf8-c69bf29097ce"
+---
+
+## Fix haskelseed bridge diagnostics
+
+```task
+id: ADHOC-2026-06-14-T01
+status: done
+priority: medium
+state_hub_task_id: "ffe6b8d8-889c-4ec4-8b64-00b77f86e39f"
+```
+
+`haskelseed` is an Alpine host without `ss`, so `bridge check` reported
+reverse tunnel ports as closed even while SSH reverse listeners were present.
+Updated diagnostics to fall back from `ss` to `netstat` and then
+`/proc/net/tcp`/`tcp6`. Also fixed local-direction diagnostics so
+`nix-daemon-haskelseed` checks the local `-L` listener instead of probing a
+remote reverse port.
+
+Verification:
+
+- `state-hub-haskelseed` responded through `127.0.0.1:18000/state/health`.
+- `bridge check --json` reported all configured tunnels `ok: true`.
+- `python3 -m pytest tests/test_cli.py tests/test_diagnostics.py` passed.
+
+## Make default target safe and add setup
+
+```task
+id: ADHOC-2026-06-14-T02
+status: done
+priority: medium
+state_hub_task_id: "3b932955-0d75-4b95-9821-92bfa2dadbd0"
+```
+
+Changed `make` to default to a help listing that only shows targets with
+`##` comments. Added `make setup` to run `uv sync --all-groups` and reinstall
+the editable `bridge` CLI wrapper through `uv tool install -e . --force`.
+
+Verification:
+
+- `uv sync --all-groups` succeeded and installed the project environment.
+- `make` listed targets only and did not run tests or setup.
+- `make setup` succeeded and installed the `bridge` executable.
+- `make test` passed all 235 tests.
+- `make lint` passed.
--- a/workplans/BRIDGE-WP-0001-initial-implementation.md
+++ b/workplans/BRIDGE-WP-0001-initial-implementation.md
@@ -2,7 +2,7 @@
 id: BRIDGE-WP-0001
 type: workplan
 title: "OpsBridge Initial Implementation"
-domain: custodian
+domain: infotech
 repo: ops-bridge
 status: completed
 owner: Bernd
--- a/workplans/BRIDGE-WP-0002-opscatalog-extension.md
+++ b/workplans/BRIDGE-WP-0002-opscatalog-extension.md
@@ -2,7 +2,7 @@
 id: BRIDGE-WP-0002
 type: workplan
 title: "OpsCatalog Extension"
-domain: custodian
+domain: infotech
 repo: ops-bridge
 status: completed
 owner: Bernd
--- a/workplans/BRIDGE-WP-0003-mcp-skill-cross-mode-tests.md
+++ b/workplans/BRIDGE-WP-0003-mcp-skill-cross-mode-tests.md
@@ -2,7 +2,7 @@
 id: BRIDGE-WP-0003
 type: workplan
 title: "OpsBridge MCP Server, Skill, and Cross-Mode Test Coverage"
-domain: custodian
+domain: infotech
 repo: ops-bridge
 status: done
 owner: Bernd
--- a/workplans/BRIDGE-WP-0004-directive-alignment.md
+++ b/workplans/BRIDGE-WP-0004-directive-alignment.md
@@ -0,0 +1,340 @@
+---
+id: BRIDGE-WP-0004
+type: workplan
+title: "AccessManagementDirective Alignment"
+domain: infotech
+repo: ops-bridge
+status: done
+owner: Bernd
+topic_slug: custodian
+created: "2026-03-28"
+updated: "2026-03-28"
+state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7"
+---
+
+# BRIDGE-WP-0004 — AccessManagementDirective Alignment
+
+**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
+optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
+preserving full backward compatibility with the existing static-key mode.
+
+**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
+deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
+
+---
+
+## Goal
+
+After this workplan:
+
+1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
+2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
+   `cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
+   handled transparently by the tunnel manager.
+3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
+   the directive, with config validation that enforces naming conventions.
+4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
+   §5 SIEM traceability requirement.
+
+---
+
+## Reference Documents
+
+| Document | Location |
+|---|---|
+| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
+| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
+| PRD | `wiki/OpsBridgePrd.md` |
+| FRS | `wiki/OpsBridgeFrs.md` |
+
+---
+
+## Design Decisions
+
+### Static key mode stays first-class
+
+If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
+`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
+explicitly supported for:
+- Lab/dev environments without a CA
+- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
+- Environments below the directive's complexity threshold
+
+### cert_command interface
+
+```yaml
+# tunnels.yaml — optional cert_command field
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519   # private key (always required)
+    actor: agt-state-hub-bridge
+    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
+```
+
+When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
+captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
+`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
+on tunnel stop.
+
+`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
+`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
+dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
+
+### TTL-aware cert refresh
+
+After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
+determine `cert_expires_at`. It schedules a pre-emptive cert refresh
+(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
+fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
+failure, no reconnect backoff triggered.
+
+If `cert_command` is absent, no TTL logic runs.
+
+### Actor type model
+
+`actor_class: str  # "human" | "automation"` is replaced by:
+
+```python
+class ActorType(str, Enum):
+    ADM = "adm"   # human operator
+    AGT = "agt"   # LLM-powered autonomous agent
+    ATM = "atm"   # deterministic script / pipeline
+```
+
+Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`.
+The mapping is a one-way migration aid with a deprecation warning; new configs must use the
+canonical values.
+
+Config validation: if `actor` name is set, it must start with the prefix matching its type
+(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
+SIEM auditability.
+
+---
+
+## Tasks
+
+### T1 — ActorType enum
+
+```task
+id: BRIDGE-WP-0004-T1
+state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
+status: done
+priority: high
+```
+
+- [x] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
+- [x] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` →
+      `ActorType.ATM` with a `DeprecationWarning`; reject unknown values
+- [x] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
+      `atm-*` for ATM; raise `ConfigError` on mismatch
+- [x] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value`
+- [x] Update tests
+
+### T2 — cert_command config field
+
+```task
+id: BRIDGE-WP-0004-T2
+state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
+status: done
+priority: high
+```
+
+- [x] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
+- [x] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
+      content (shell-level freedom intentional)
+- [x] Document in config example / SCOPE.md
+
+### T3 — Cert acquisition in manager
+
+```task
+id: BRIDGE-WP-0004-T3
+state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
+status: done
+priority: high
+```
+
+- [x] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
+      - If `cfg.cert_command` is None: return None (static key mode)
+      - Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
+      - Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
+      - Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
+- [x] `build_ssh_command`: accept optional `cert_path`; when set, insert
+      `-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
+- [x] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
+      so every reconnect gets a fresh cert
+
+### T4 — cert_identity in audit log
+
+```task
+id: BRIDGE-WP-0004-T4
+state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
+status: done
+priority: high
+```
+
+- [x] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
+      extract `Key ID` (the `-I` value from signing time)
+- [x] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
+      JSON entry when present
+- [x] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
+- [x] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
+
+### T5 — TTL-aware cert refresh
+
+```task
+id: BRIDGE-WP-0004-T5
+state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
+status: done
+priority: high
+```
+
+- [x] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
+      from `ssh-keygen -L` output → `cert_expires_at: datetime`
+- [x] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
+      on each iteration
+- [x] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
+      reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
+      next iteration)
+- [x] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
+      `AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
+- [x] If `cert_command` is absent, skip all TTL logic entirely
+
+### T6 — `bridge cert-status` command
+
+```task
+id: BRIDGE-WP-0004-T6
+state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
+status: done
+priority: medium
+```
+
+- [x] `cli.py`: add `cert-status [TUNNEL]` subcommand
+- [x] For each tunnel (or the named one): read cert file from state dir if present,
+      run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
+      time-to-expiry (or "static key / no cert" if absent)
+- [x] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
+- [x] `--json` flag for machine-readable output
+
+### T7 — CertAcquisitionError handling
+
+```task
+id: BRIDGE-WP-0004-T7
+state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
+status: done
+priority: high
+```
+
+- [x] New exception `CertAcquisitionError` in `models.py`
+- [x] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
+      with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
+      (cert failures are transient — e.g., Vault briefly unreachable)
+- [x] After `max_attempts` consecutive cert failures, transition to `FAILED` state
+
+### T8 — SCOPE.md and documentation updates
+
+```task
+id: BRIDGE-WP-0004-T8
+state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
+status: done
+priority: medium
+```
+
+- [x] Update `SCOPE.md`: Current State updated to reflect completion; directive alignment done
+- [x] `wiki/OpsBridgeFrs.md` §5.7 already covers actor attribution abstractly — no changes needed
+- [x] `.claude/rules/architecture.md` already documents cert_command mode and actor vocab
+- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency (deferred)
+
+### T9 — Tests
+
+```task
+id: BRIDGE-WP-0004-T9
+state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
+status: done
+priority: high
+```
+
+- [x] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
+      cert_command parse
+- [x] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
+      args; verify `CertAcquisitionError` on non-zero exit; TTL logic helpers
+- [x] `test_audit.py`: `cert_identity` field; actor_type rename
+- [x] `test_cli.py`: `cert-status` exit codes; JSON output shape
+- [x] 233 tests, 0 failures
+
+---
+
+## Config Schema — Before / After
+
+### Before
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: ops-agent
+    ssh_key: ~/.ssh/id_ed25519
+    actor: automation-agent
+
+actors:
+  automation-agent:
+    class: automation
+    description: "state hub bridge agent"
+```
+
+### After (static key mode — unchanged behavior)
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
+    actor: agt-state-hub-bridge
+
+actors:
+  agt-state-hub-bridge:
+    class: agt
+    description: "state hub bridge agent"
+```
+
+### After (cert_command mode — ops-warden or any CA)
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
+    actor: agt-state-hub-bridge
+    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
+
+actors:
+  agt-state-hub-bridge:
+    class: agt
+    description: "state hub bridge agent"
+```
+
+---
+
+## Acceptance Criteria
+
+- [x] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
+      warning only); tunnel behaves identically
+- [x] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
+- [x] Config with `cert_command` set: SSH process launched with both `-i key` and
+      `-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
+- [x] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
+      no TTL logic runs
+- [x] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
+      logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
+- [x] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
+- [x] `bridge cert-status` shows valid cert info; exits 1 on expired cert
+- [x] All tests pass: `uv run pytest` (233 passed)
+- [x] All lints pass: `uv run ruff check .`
--- a/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
+++ b/workplans/BRIDGE-WP-0005-restart-includes-remote-cleanup.md
@@ -0,0 +1,194 @@
+---
+id: BRIDGE-WP-0005
+type: workplan
+title: "Restart includes remote cleanup (blank-slate recovery)"
+domain: infotech
+repo: ops-bridge
+status: finished
+owner: codex
+topic_slug: custodian
+created: "2026-06-21"
+updated: "2026-06-21"
+state_hub_workstream_id: "9565491f-e664-4add-bea4-27c4fb015ee0"
+---
+
+# BRIDGE-WP-0005 — Restart includes remote cleanup
+
+**Origin:** `STATE-WP-0063` weekend automation repair (2026-06-21). A stale orphan
+`sshd` remote forward on Railiance01 port `18000` blocked
+`bridge restart state-hub-railiance01` from producing a working tunnel. Operators
+had to discover `bridge maintenance cleanup <tunnel> --restart` separately.
+
+**Operator expectation:** `bridge restart` should mean *operational again* — a
+blank-slate recovery — not merely "cycle the local manager PID while a broken
+remote listener still holds the port."
+
+## Topology and failure modes (refined)
+
+Tunnels in `~/.config/bridge/tunnels.yaml` serve three distinct host roles.
+Cleanup policy must respect all of them.
+
+### A. Workstation (laptop WSL) — tunnel **origin**
+
+The State Hub API runs locally (`127.0.0.1:8000`). Reverse tunnels expose it on
+remote hosts:
+
+| Remote host | Tunnels (reverse) | Role |
+|-------------|-------------------|------|
+| **coulombcore** (`92.205.130.254`) | `state-hub-coulombcore`, `state-hub-mcp-coulombcore` | VPS — stable, occasional maintenance reboot |
+| **railiance01** (`92.205.62.239`) | `state-hub-railiance01`, `state-hub-mcp-railiance01` | VPS — stable, occasional maintenance reboot |
+| **haskelseed** (`192.168.178.135`) | `state-hub-haskelseed`, `state-hub-mcp-haskelseed` | LAN builder — may sleep/reboot when moved |
+
+**Laptop behaviour:** shutdown, sleep, and location changes (home ↔ office) kill
+local bridge processes without graceful remote SSH teardown. Orphan `sshd`
+listeners on **all three remotes** are common after wake — especially
+`18000`/`18001` on VPS hosts that activity-core and remote agents depend on.
+
+### B. Haskelseed — also intermittently offline
+
+Haskelseed is not a datacenter VPS; it may be powered down or unreachable on
+different networks. The same orphan-forward pattern applies to its reverse ports
+when the workstation-side tunnel dies uncleanly.
+
+### C. VPS remotes (coulombcore, railiance01)
+
+Normally always-on. Maintenance reboots clear remote kernel state, but:
+
+- a VPS reboot does **not** fix a workstation that is still in `reconnecting`
+  with a dead local SSH child;
+- when the laptop returns, orphan forwards from the **previous** session may
+  still block new `-R` binds if the VPS did not reboot.
+
+**Conclusion:** conditional remote cleanup before restart benefits **all reverse
+tunnels**, not only laptop-adjacent hosts. `should_cleanup_tunnel()` already
+skips healthy forwards — VPS tunnels with live working forwards are untouched.
+
+### D. Local-direction tunnels — no remote cleanup
+
+`direction: local` tunnels (`k3s-api-coulombcore`, `nix-daemon-haskelseed`) use
+forward mode from workstation to remote services. They do not bind remote reverse
+ports for State Hub. **`restart` stays local stop/start only** for these.
+
+## Design (decided)
+
+| Command | Behaviour after this workplan |
+|---------|-------------------------------|
+| `bridge restart [tunnel]` | For each **reverse** tunnel: `cleanup_tunnel(..., restart=True)` — run `should_cleanup_tunnel`; clear stale remote listener if needed; then start. For **local** tunnels: existing `stop()` + `start()`. |
+| `bridge maintenance cleanup` | Unchanged — proactive hygiene cron / manual sweep without implying user-facing "restart". |
+| `bridge up` | Out of scope here (see T4 optional follow-up). |
+
+Implementation sketch: replace the body of `cli.restart()` with a call to
+`cleanup_all_tunnels(..., restart=True, tunnel_name=...)` for reverse tunnels,
+or per-tunnel `cleanup_tunnel` when a single tunnel is named.
+
+Emit the same action summary strings cleanup already uses (`healthy`,
+`cleaned_and_restarted`, `error`) so operators see whether remote hygiene ran.
+
+## Out of scope
+
+- Changing `should_cleanup_tunnel` heuristics (unless tests expose a VPS false
+  positive during T2).
+- Auto-cleanup inside the reconnect backoff loop (stretch — T4).
+- Renaming tunnels or changing `tunnels.yaml` host entries.
+
+---
+
+## T1 — Wire restart through cleanup path
+
+```task
+id: BRIDGE-WP-0005-T01
+status: done
+priority: high
+state_hub_task_id: "b61c5d45-1198-416d-aa15-f2063fc5eb14"
+```
+
+Refactor `bridge/cli.py` `restart()` so reverse tunnels call
+`cleanup_tunnel(cfg, state_mgr, restart=True)` instead of bare
+`TunnelManager.stop()` + `start()`.
+
+Requirements:
+
+- Single-tunnel and all-tunnel restart both work.
+- Local-direction tunnels keep stop/start only.
+- Exit codes: preserve today’s semantics where practical; exit non-zero if any
+  named tunnel ends in `CleanupAction.action == "error"`.
+- Stdout tells the operator what happened (`healthy`, `cleaned_and_restarted`,
+  etc.), not only "Restarted tunnel".
+
+## T2 — Tests and regression coverage
+
+```task
+id: BRIDGE-WP-0005-T02
+status: done
+priority: high
+state_hub_task_id: "b4ad0525-6936-4799-bead-3603d05c49af"
+```
+
+Update `tests/test_cli.py`:
+
+- `test_restart_calls_stop_then_start` → assert restart delegates to cleanup for
+  reverse tunnels.
+- Add cases: healthy forward (no remote kill), stale forward (remote cleanup
+  invoked), local-direction tunnel (no cleanup call).
+- Reuse mocks from `tests/test_cleanup.py` patterns.
+
+`make test` and `make lint` pass.
+
+## T3 — Operator docs and CLI help
+
+```task
+id: BRIDGE-WP-0005-T03
+status: done
+priority: medium
+state_hub_task_id: "60586375-b0b4-4d4c-ba87-0699e76bf30c"
+```
+
+Document the blank-slate restart contract:
+
+- `wiki/OpsBridge.md` — restart vs maintenance cleanup vs up/down.
+- `bridge restart --help` — mention conditional remote stale-forward cleanup.
+- Short "host roles" subsection: laptop origin, haskelseed intermittency, VPS
+  maintenance — matching this workplan's topology section.
+- Cross-link from `state-hub` `STATE-WP-0063` / `history/20260621-weekend-automation-assessment.md`
+  incident note (one line each way).
+
+## T4 — Optional: reconnect-loop hygiene (stretch)
+
+```task
+id: BRIDGE-WP-0005-T04
+status: cancel
+priority: low
+state_hub_task_id: "518f1b5e-3098-42aa-9662-bdab1d7d269b"
+```
+
+Evaluate whether `TunnelManager` reconnect backoff should invoke remote cleanup
+once after repeated exit-255 bind failures (laptop wake without operator running
+`bridge restart`). Defer unless T1–T3 are done; mark `cancel` if heuristic risk
+outweighs benefit.
+
+**Decision (2026-06-21): cancelled for now.** Auto-cleanup inside the reconnect
+loop risks killing a legitimately healthy orphan forward owned by another session
+or operator. `bridge restart` now covers the operator-facing blank-slate path;
+nightly `maintenance cleanup --restart` covers unattended hygiene. Revisit only if
+wake-from-sleep reconnect failures remain frequent after a month of observation.
+
+## T5 — Live verification on workstation + VPS
+
+```task
+id: BRIDGE-WP-0005-T05
+status: done
+priority: medium
+state_hub_task_id: "b5d305ef-5b5d-4afe-a992-e0960d07af79"
+```
+
+After T1–T2 ship, verify on real config:
+
+1. **railiance01** — `state-hub-mcp-railiance01` was `reconnecting` with stale
+   forward; `bridge restart` reported `cleaned_and_restarted` and tunnel reached
+   `connected`.
+2. **haskelseed** — not exercised (all tunnels already healthy); Alpine netstat
+   path unchanged from ADHOC-2026-06-14 and covered by existing cleanup tests.
+3. **coulombcore** — `bridge restart state-hub-coulombcore` reported `healthy`,
+   PID unchanged (4116), forward undisturbed.
+
+State Hub progress logged (2026-06-21). Workplan marked `finished`.
--- a/workplans/OPS-WP-0001-diagnostics.md
+++ b/workplans/OPS-WP-0001-diagnostics.md
@@ -2,7 +2,7 @@
 id: OPS-WP-0001
 type: workplan
 title: "ops-bridge diagnostics and flow improvements"
-domain: custodian
+domain: infotech
 repo: ops-bridge
 status: done
 owner: claude
--- a/workplans/OPS-WP-0002-agent-usability.md
+++ b/workplans/OPS-WP-0002-agent-usability.md
@@ -0,0 +1,221 @@
+---
+id: OPS-WP-0002
+type: workplan
+title: "Agent Usability — MCP Registration, Skill, and Worker Orientation"
+domain: infotech
+repo: ops-bridge
+status: done
+owner: custodian
+topic_slug: custodian
+created: "2026-03-21"
+updated: "2026-03-26"
+depends_on: OPS-WP-0001
+state_hub_workstream_id: "c195cc40-8be7-462e-be26-a7d6bda34cd5"
+---
+
+# OPS-WP-0002 — Agent Usability: MCP Registration, Skill, and Worker Orientation
+
+## Problem
+
+The ops-bridge MCP server (`src/bridge/mcp_server/server.py`) is fully
+implemented with tools for `bridge_up/down/restart/status/check/logs` and
+catalog operations. But no agent can use it because:
+
+1. **Not registered** — the server isn't in `~/.claude.json` and has no
+   persistent transport mode. It only runs on stdio today.
+2. **No slash command** — agents working ad-hoc (not via MCP) have no
+   quick way to check or restore tunnels.
+3. **No worker orientation** — agents on remote machines (CoulombCore,
+   Railiance) don't know that bridge is available or how to use it when
+   their state-hub connection drops.
+
+## Goal
+
+Any agent — on the workstation or a remote machine — can:
+- Check tunnel health in one call
+- Bring up a dropped tunnel without manual intervention
+- Recover the state-hub connection if it goes down mid-session
+
+## Design
+
+### MCP server (workstation, persistent)
+
+Run as an SSE service on port 8002 (same pattern as state-hub on 8001).
+Registered at user scope in `~/.claude.json` so it's available to all
+Claude Code sessions.
+
+The SSE transport is already supported by FastMCP — just change the
+`mcp.run()` call to accept an `--http` flag or read a `BRIDGE_MCP_PORT`
+env var.
+
+### Slash command skill (all machines)
+
+A `/bridge` skill at `~/.claude/commands/bridge.md` (global scope) that:
+- Reads `bridge status` output
+- Surfaces any tunnel that is down or stale
+- Offers to bring it up
+- Useful on machines that don't have the MCP server registered
+
+### Worker agent orientation (remote machines)
+
+Update `CLAUDE.md` (global) and `ops-bridge` session protocol to tell
+worker agents:
+- Check `bridge status` at session start when on a machine with
+  ops-bridge installed
+- If state-hub tunnel is down: run `bridge up state-hub-<machine>` to
+  restore it before making any state-hub API calls
+- If no bridge command: fall back to direct API URL if reachable
+
+---
+
+## Tasks
+
+### T01 — SSE transport mode for MCP server
+
+```task
+id: OPS-WP-0002-T01
+status: done
+priority: high
+state_hub_task_id: "27fc6fa1-6d0e-438a-b4a3-c6091931da88"
+```
+
+Add `--http` flag and `BRIDGE_MCP_PORT` env var to `server.py` entry
+point. When `--http` is set, run `mcp.run(transport="sse", port=PORT)`
+instead of stdio.
+
+Add `make mcp-http` target to `Makefile`:
+```makefile
+mcp-http: ## Start MCP server in SSE mode (default port 8002)
+    BRIDGE_MCP_PORT=$${BRIDGE_MCP_PORT:-8002} uv run python src/bridge/mcp_server/server.py --http
+```
+
+Add `make mcp-stop` target that kills any running MCP server on port
+8002.
+
+Gate: `bridge_status()` tool callable via SSE on localhost:8002 after
+`make mcp-http`.
+
+---
+
+### T02 — Register MCP server in ~/.claude.json
+
+```task
+id: OPS-WP-0002-T02
+status: done
+priority: high
+state_hub_task_id: "2216457d-035e-4804-b685-18975f3c6d1f"
+```
+
+Register the ops-bridge MCP server at user scope:
+```bash
+claude mcp add-json -s user ops-bridge \
+  '{"type":"sse","url":"http://127.0.0.1:8002/sse"}'
+```
+
+Document in `ops-bridge` CLAUDE.md:
+```
+To start the MCP server:
+    cd ~/ops-bridge && make mcp-http
+
+To verify registration:
+    python3 -c "import json,os; d=json.load(open(os.path.expanduser('~/.claude.json'))); print(list(d.get('mcpServers',{}).keys()))"
+```
+
+Update global `~/.claude/CLAUDE.md` to list `ops-bridge` MCP server
+alongside `state-hub`.
+
+Gate: `ops-bridge` appears in Claude Code MCP tool list after `make
+mcp-http`.
+
+---
+
+### T03 — `/bridge` slash command skill
+
+```task
+id: OPS-WP-0002-T03
+status: done
+priority: medium
+state_hub_task_id: "4b2e39eb-4585-4e60-ab16-9e7909eced74"
+```
+
+Create `~/.claude/commands/bridge.md` — a global Claude Code skill for
+tunnel management.
+
+**Behaviour:**
+1. Run `bridge status` and parse output
+2. Report each tunnel: name, state, LIVE column
+3. For any tunnel that is `stopped`, `reconnecting`, or `[STALE]`:
+   - Offer to run `bridge up <tunnel-name>`
+   - After `bridge up`, re-check with `bridge check <tunnel-name>`
+4. If all tunnels are `connected` and LIVE: report green and exit
+
+**Skill definition:**
+```yaml
+---
+description: >
+  Check ops-bridge tunnel health and restore any dropped tunnels.
+  Reports status of all configured tunnels and offers to bring up
+  any that are stopped or stale.
+argument-hint: "[tunnel-name]"
+allowed-tools:
+  - Bash(bridge status)
+  - Bash(bridge up*)
+  - Bash(bridge down*)
+  - Bash(bridge check*)
+  - Bash(bridge logs*)
+---
+```
+
+If an optional tunnel name is passed as `$ARGUMENTS`, scope all
+operations to that tunnel only.
+
+Gate: `/bridge` skill runs cleanly when all tunnels are up; correctly
+identifies and recovers a manually-stopped tunnel.
+
+---
+
+### T04 — Worker agent orientation in CLAUDE.md
+
+```task
+id: OPS-WP-0002-T04
+status: done
+priority: medium
+state_hub_task_id: "cc64bb07-ea5d-498a-8c14-bb653581efe7"
+```
+
+Update global `~/.claude/CLAUDE.md` — add a **Worker Agent — Bridge
+Protocol** section:
+
+```markdown
+## Worker Agent — Bridge Protocol
+
+When working on a remote machine (CoulombCore, Railiance nodes):
+
+1. At session start, check if `bridge` is installed:
+   `which bridge && bridge status`
+2. If state-hub tunnel is down: `bridge up state-hub-<machine-slug>`
+   Wait for state `connected` before making state-hub API calls.
+3. If `bridge` is not installed, check if the state-hub API is directly
+   reachable: `curl -s http://127.0.0.1:8000/state/health`
+4. Only proceed without state-hub if absolutely necessary — log a
+   progress note about the outage when connectivity is restored.
+```
+
+Also add a one-liner reminder to the ops-bridge session protocol in
+`.claude/rules/session-protocol.md`:
+> At session start: `bridge status` — bring up any stopped tunnels
+> before accessing remote services.
+
+Gate: `~/.claude/CLAUDE.md` contains the Worker Agent section; ops-bridge
+session protocol references bridge status check.
+
+---
+
+## Done Criteria
+
+- [x] `make mcp-http` starts the MCP server on port 8002 (SSE)
+- [x] `bridge_status` and `bridge_check` callable as MCP tools from Claude Code
+- [x] `ops-bridge` registered in `~/.claude.json` at user scope
+- [x] `/bridge` skill surfaces tunnel states and recovers a stopped tunnel
+- [x] Global CLAUDE.md has worker agent bridge protocol
+- [x] All existing tests pass after T01 changes (`make test`)
Author	SHA1	Message	Date
tegwick	6572a2ac99	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-07-03: - update .custodian-brief.md for ops-bridge	2026-07-03 18:52:51 +02:00
tegwick	ce0aa728b1	tunnels: optional remote_host forward destination (default 127.0.0.1) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>	2026-07-02 14:18:18 +02:00
tegwick	00671f5133	Normalize agent instructions and workplan frontmatter (STATE-WP-0067) - Align agent files with on-disk workplan prefixes (infer from workplan ids) - Set workplan domain to registered domain_slug; add topic_slug where applicable - Repair frontmatter delimiter formatting; migrate legacy task status literals - Regenerate AGENTS.md, CLAUDE.md, and .claude/rules from State Hub templates	2026-06-22 23:16:27 +02:00
tegwick	09f2cd4b7a	Mark .repo-classification.yaml human-reviewed (CUST-WP-0050 T02) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 11:40:44 +02:00
tegwick	c3b4fb9d55	Reclassify as tooling (CUST-WP-0050 T02) Apply the new 'tooling' category (reusable internal tooling/infrastructure) from the Repo Classification Standard. First-pass agent classification. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 03:06:02 +02:00
tegwick	fab7409c66	Add repo classification (CUST-WP-0050 T02) First-pass agent classification per the Repo Classification Standard v1.0 (canon-repo-classification); pending human review. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-22 02:44:47 +02:00
tegwick	1dd664c792	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-21: - update .custodian-brief.md for ops-bridge	2026-06-21 20:12:38 +02:00
tegwick	10c6fdaec9	feat(restart): route reverse tunnels through stale-forward cleanup bridge restart now means blank-slate recovery: reverse tunnels run should_cleanup_tunnel and clear orphan remote listeners before reconnecting; healthy forwards are left running. Local-direction tunnels keep stop/start only. CLI and MCP report per-tunnel actions (healthy, cleaned_and_restarted, restarted, error) and exit non-zero on cleanup failure. Closes BRIDGE-WP-0005.	2026-06-21 20:12:13 +02:00
tegwick	8c11acc00c	docs(ops-bridge): BRIDGE-WP-0005 restart includes remote cleanup Add workplan to make bridge restart perform conditional stale-forward cleanup before start (blank-slate recovery). Refines topology for laptop workstation origin, intermittently offline haskelseed, and stable VPS remotes (coulombcore, railiance01). Origin: STATE-WP-0063 tunnel incident. Registered in State Hub via fix-consistency.	2026-06-21 20:02:18 +02:00
tegwick	499b8781cc	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-21: - update .custodian-brief.md for ops-bridge	2026-06-21 20:02:10 +02:00
tegwick	4e9882909f	feat(maintenance): nightly stale SSH forward cleanup at 03:00 Add bridge maintenance cleanup to detect reverse tunnels whose remote port is bound but no longer forwards (zombie sshd sessions), kill the stale listeners on the remote host, and optionally restart the tunnel. Includes install-cron/uninstall-cron/show-cron helpers and README notes for the actcore-state-hub-bridge failure mode we hit on railiance01.	2026-06-19 15:59:27 +02:00
tegwick	a6857fb8f7	Add credential routing instructions for all agent runtimes Propagate shared credential-routing section (Codex, Claude, Grok, llm-connect) from state-hub template via scripts/propagate_credential_routing.py.	2026-06-18 22:48:39 +02:00
tegwick	675772ab3b	Add capability registry scaffold (REUSE-WP-0014-T06 B04)	2026-06-16 01:55:58 +02:00
tegwick	6eb0b1c52f	Fixing bridge to haskelseed	2026-06-14 19:46:06 +02:00
tegwick	d949f3e93e	Refresh agent instruction files	2026-05-18 16:55:47 +02:00
tegwick	de984736ca	feat(cli): add `bridge conventions` and link from actor errors Surfaces the actor naming rules (adm-/agt-/atm- prefixes, legacy class aliases) so users hitting a ConfigError have an in-CLI way to read the spec without grepping the wiki. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-16 23:21:37 +02:00
tegwick	28ecef121e	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-15: - update .custodian-brief.md for ops-bridge	2026-05-15 12:19:50 +02:00
tegwick	860c08f1db	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-15: - update .custodian-brief.md for ops-bridge	2026-05-15 09:39:01 +02:00
tegwick	bd169a07e2	feat(directive): implement BRIDGE-WP-0004 AccessManagementDirective alignment - ActorType enum (adm/agt/atm) replaces actor_class string; config validates naming convention (adm-/agt-/atm-*) with hard ConfigError on mismatch; legacy 'human'/'automation' values accepted with DeprecationWarning - cert_command: pluggable shell string run before each SSH launch; cert written to state dir; -i cert appended to SSH command alongside -i key - TTL-aware cert refresh: parses Valid-to via ssh-keygen -L; pre-emptive restart 5 min before expiry (no backoff, no attempt increment); CERT_EXPIRING logged - CertAcquisitionError: cert failures trigger normal backoff/retry loop - cert_identity: Key ID parsed from cert and recorded in BRIDGE_CONNECTED event - bridge cert-status: new CLI command; exit 1 on expired cert; --json flag - 233 tests passing, ruff clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 09:38:29 +02:00
tegwick	22601ef3e6	chore(workplans): sync BRIDGE-WP-0004 and WARDEN-WP-0001 tasks to state hub Both workplans had been registered as active workstreams but tasks were never ingested — the markdown checkbox format was invisible to the consistency checker, which requires task code blocks. Activated both workplans (draft→active) and added task blocks with state_hub_task_id for all 19 tasks (9 + 10). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-05-15 00:29:51 +02:00
tegwick	569de1497c	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-06: - update .custodian-brief.md for ops-bridge	2026-05-06 04:24:17 +02:00
tegwick	fafd04ed2e	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-06: - update .custodian-brief.md for ops-bridge	2026-05-06 02:41:26 +02:00
tegwick	c1d87b47df	Added INTENT.md file	2026-05-02 23:17:22 +02:00
tegwick	204bf48bc8	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-01: - update .custodian-brief.md for ops-bridge	2026-05-01 23:22:08 +02:00
tegwick	595c495f7c	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-01: - update .custodian-brief.md for ops-bridge	2026-05-01 23:07:50 +02:00
tegwick	90eda27a14	Scope update from repo-scoping refactor	2026-05-01 12:28:27 +02:00
tegwick	1361727e15	Added untracked workplans	2026-04-25 17:06:05 +02:00
tegwick	18e3c118dd	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-04-21: - update .custodian-brief.md for ops-bridge	2026-04-21 02:14:25 +02:00
Bernd Worsch	621de64ee0	chore: merge origin/main — reconcile divergent branches Integrates remote changes (session protocol, .custodian-brief.md, MCP SSE/HTTP mode, workplan OPS-WP-0002 completion) with local changes (AccessManagementDirective alignment, architecture docs, BRIDGE-WP-0004 and WARDEN-WP-0001 workplans). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 01:05:11 +00:00
Bernd Worsch	f3a7236c5d	docs: align architecture and scope with AccessManagementDirective Expands architecture constraints and SCOPE.md to reflect the three-actor vocabulary (adm/agt/atm), two credential modes (static key + cert_command), and ops-warden boundary. Adds directive wiki doc and two new workplans (BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-28 00:59:38 +00:00
tegwick	4f3c8646b3	feat(mcp): SSE/HTTP mode, workplan OPS-WP-0002 done - Add --http flag to MCP server for SSE transport on port 8002 - Add make mcp-http / mcp-stop targets - Pin fastmcp<3.1.0 to stabilize dependency - Update session-protocol: Step 0 tunnel health check before orient - Mark OPS-WP-0002 and all its tasks done Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-27 14:10:49 +01:00
tegwick	431beef31b	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-03-26: - update .custodian-brief.md for ops-bridge	2026-03-26 22:46:07 +01:00
tegwick	1c7c6eedf8	chore(session): read .custodian-brief.md before MCP call in session init Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-26 17:48:52 +01:00
tegwick	75a559780e	New workplan	2026-03-21 15:27:02 +01:00
tegwick	d73b7be45d	docs(workplan): OPS-WP-0002 — agent usability via MCP registration and /bridge skill Plan to make ops-bridge fully usable by worker agents: - T01: SSE transport mode + make mcp-http target - T02: register in ~/.claude.json at user scope - T03: /bridge global slash command skill - T04: worker agent bridge protocol in global CLAUDE.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-21 15:15:42 +01:00