docs: align architecture and scope with AccessManagementDirective

Expands architecture constraints and SCOPE.md to reflect the three-actor vocabulary (adm/agt/atm), two credential modes (static key + cert_command), and ops-warden boundary. Adds directive wiki doc and two new workplans (BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 00:59:38 +00:00
parent 75a559780e
commit f3a7236c5d
5 changed files with 773 additions and 15 deletions
--- a/.claude/rules/architecture.md
+++ b/.claude/rules/architecture.md
@@ -17,11 +17,18 @@ The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
 targets/, bridges/, docs/}`.

 Key design constraints:
- OpsBridge owns lifecycle management only; it does not own identity/credentials
+- OpsBridge owns lifecycle management only; it does not own credential issuance or CA
+  operations (those belong to `ops-warden`)
 - Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
  in config, CLI args, and log filenames must stay consistent
- Actor attribution (human operator vs. automation agent) is tracked per bridge
-  for audit log traceability (FRS §5.7)
+- Actor attribution is tracked per bridge using the three-actor vocabulary from the
+  AccessManagementDirective: `adm` (human), `agt` (LLM agent), `atm` (automation);
+  actor names must carry the matching prefix (`adm-*`, `agt-*`, `atm-*`) (FRS §5.7)
+- Two credential modes are first-class and must remain independently functional:
+  1. **Static key mode** (default) — `ssh_key` only; no TTL, no cert logic
+  2. **cert_command mode** — a pluggable shell command that issues a CA-signed cert
+     before each SSH launch; TTL parsed from the cert; pre-emptive refresh ~5 min
+     before expiry; `cert_identity` logged in every `BRIDGE_CONNECTED` event

 Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
 (`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
--- a/SCOPE.md
+++ b/SCOPE.md
@@ -8,7 +8,7 @@

 ## One-liner

-SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards.
+SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.

 ---

@@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## In Scope

- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`)
+- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
 - Auto-reconnect with exponential backoff and configurable retry policy
 - Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
 - Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
- Actor attribution: per-tunnel actor class (human / automation) for audit traceability
+- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
+  with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
+- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
+  works without any CA or external tooling
+- **cert_command mode** (optional): pluggable shell command that issues a short-lived
+  CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
+  `cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
 - PID + state file management in `~/.local/state/bridge/`
 - MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
 - OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
@@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## Out of Scope

- Identity/credential management (uses existing SSH keys)
+- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
+  certs via the `cert_command` interface but never signs anything itself)
+- SSH key generation for human admins (self-service: `ssh-keygen`)
+- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
 - Long-running application hosting on remote machines (port-forward only, not deployment)
 - VPN or layer-3 connectivity
 - Monitoring/alerting beyond JSON audit logs
@@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
 ## Relevant When

 - Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
- Need audit trail of which actor (human vs. automation) started/stopped tunnels
+- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
 - Setting up a new machine in the Railiance ecosystem that must phone home to the hub
 - Diagnosing connectivity issues between local hub and remote services
+- Checking certificate validity for active tunnels (`bridge cert-status`)
+- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials

 ---

@@ -60,8 +71,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## Current State

- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped)
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated
+- Status: active (v0.1 core complete; directive alignment in progress — BRIDGE-WP-0004)
+- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health
+  checks and audit logging complete; OpsCatalog framework present but not populated;
+  cert_command / ActorType alignment not yet implemented
 - Stability: stable tunnel lifecycle; tested under network drops and SSH failures
 - Usage: running in lab for daily Railiance/Temporal connectivity

@@ -77,17 +90,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo

 ## Terminology

- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check
+- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
+  cert_command, cert_identity
+- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
 - Also known as: "the bridge"
- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
+- Potentially confusing: "bridge state" is a tunnel-specific state machine
+  (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
+- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)

 ---

 ## Related / Overlapping Repositories

 - `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
+- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
+  `cert_command` when short-lived certificates are required
 - `activity-core` — Temporal server on remote reached via ops-bridge tunnel
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home
+- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
+  host-side principal deployment (`/etc/ssh/auth_principals/`)

 ---

@@ -105,5 +125,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge
 ## Getting Oriented

 - Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files)
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; MCP: `bridge_status()`
+- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
+  `~/.local/state/bridge/` (PID/state/cert files)
+- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
+  MCP: `bridge_status()`
+- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
+- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)
--- a/wiki/AccessManagementDirective.md
+++ b/wiki/AccessManagementDirective.md
@@ -0,0 +1,203 @@
+AccessManagementDirective
+
+*Practical host access control management *
+
+# AccessManagementDirective
+
+**Document Title:** SSH Access Management Directive  
+**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements)  
+**Date:** 28 March 2026  
+**Audience:** Operations Department  
+**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).  
+**Author:** Grok (on behalf of the team)  
+**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.  
+**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
+
+## 0. Prerequisites
+
+Before bootstrapping, the following must be in place:
+- Ansible (or equivalent config-management tool) with a central inventory.
+- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
+- GitOps repository containing the authoritative principals inventory.
+- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
+- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
+
+If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
+
+## 1. Concept Overview
+
+This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
+
+**Why this model?**  
+- A central CA signs short-lived certificates for every login.  
+- No more manual key copying, key sprawl, or painful revocation.  
+- Built-in expiration, role-based principals, and auditability.  
+- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.  
+- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
+
+**Core Principles**  
+- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.  
+- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).  
+- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host.  
+- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).  
+- **Separation of concerns** –  
+  - **Admins (adm)**: Human operators (full interactive shell when needed).  
+  - **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.  
+  - **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
+
+## 2. Actor Definitions & Access Model
+
+| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
+|------------|-------------------|-------------|------------------------------|---------------------------|
+| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
+| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
+| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
+
+**Certificate Naming Convention**  
+- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`  
+- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
+
+**LLM-Agent Risk Clarification**  
+Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
+
+## 3. Bootstrapping the System (One-Time Setup)
+
+### 3.1. Create the CA (do this once, offline)
+```bash
+ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
+```
+- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.  
+- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.  
+- Public key: `ca_user.pub`
+
+### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
+- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).  
+- Update `/etc/ssh/sshd_config`:
+  ```bash
+  TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
+  AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
+  PubkeyAuthentication yes
+  PasswordAuthentication no
+  PermitRootLogin no
+  ```
+- Create principals directory and files from the central Git inventory.  
+- `systemctl restart sshd`
+
+### 3.3. Initial Admin Access
+First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
+
+## 4. Automatic Management of Access Rights
+
+### 4.1. Daily / On-Demand Workflow
+1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)  
+   - **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.  
+   - **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).  
+   - **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
+
+2. **Ansible-Driven Host Updates** (run hourly via CI/CD)  
+   - `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).  
+   - Example inventory snippet:
+     ```yaml
+     hosts:
+       - name: prod-db-01
+         allowed_principals:
+           adm: [adm-full]
+           agt: [agt-incident-resolver-v2]
+           atm: [atm-backup-daily, atm-logrotate]
+     ```
+
+3. **Revocation & Rotation**  
+   - Short expiry = automatic revocation.  
+   - For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).  
+   - Agents/automations never store long-lived private keys on disk.
+
+4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`)
+   ```python
+   #!/usr/bin/env python3
+   import subprocess, os, tempfile
+   # Request short-lived cert from Vault
+   cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
+   with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
+       f.write(cert.encode())
+       cert_path = f.name
+   # Load into ssh-agent and exec the real command
+   subprocess.run(["ssh-add", cert_path])
+   os.execvp(sys.argv[1], sys.argv[1:])
+   ```
+   Agents call this wrapper; it auto-refreshes the cert on every wake-up.
+
+### 4.2. Human UX Guidance
+Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
+
+### 4.3. Emergency Break-Glass Procedure
+In case of total lockout (CA offline, misconfigured Ansible push, etc.):
+1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).  
+2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).  
+3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.  
+4. After recovery, immediately rotate the CA and run a full scorecard.
+
+## 5. AccessManagement Scorecard (Checklist)
+
+Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
+
+| Category | Check | Target | Tool |
+|----------|-------|--------|------|
+| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
+| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
+| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
+| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
+| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
+| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
+| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
+| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
+| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
+| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
+| **Score** | ≥ 10/10 = **Operational** | - | - |
+
+**Scorecard Execution Command** (run from ops laptop):
+```bash
+ansible all -m command -a "ssh-access-scorecard.sh" --become
+```
+
+## 6. Scope & Operational Boundaries
+
+### 6.1. When Bootstrapping Is Officially Closed
+The system is **fully operational** when **ALL** of the following are true:
+- Scorecard passes 10/10 on every host.
+- Central Git repo contains the authoritative principals inventory.
+- First three admins have successfully used signed certificates for 7 consecutive days.
+- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
+- CI/CD pipeline for host config updates is green and runs hourly.
+- Emergency break-glass procedure has been tested once.
+
+**Declaration:** Ops Lead signs off with date in the Git commit message.
+
+### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling
+Stay with **native OpenSSH CA + Ansible + Vault** while:
+- ≤ 200 hosts
+- ≤ 50 distinct agent/automation identities
+- No regulatory requirement for SSO or full session recording
+
+**Switch triggers** (any one):
+- > 200 hosts OR rapid daily growth
+- Need for human SSO (Okta/Google) integration
+- Requirement for audited web-based SSH sessions or just-in-time access approval
+- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
+- Audit/compliance demands central policy engine or session recording
+
+**Recommended next-level tools** (in order):
+1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID).  
+2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily.  
+3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC.
+
+**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
+
+## 7. Enforcement & Review
+- **Quarterly review** of this directive and scorecard results.  
+- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.  
+- **Questions / improvements** → create PR against this file in the ops repo.
+
+**End of Document**  
+Approved for immediate use across all production and staging environments.
+
+xxx
--- a/workplans/BRIDGE-WP-0004-directive-alignment.md
+++ b/workplans/BRIDGE-WP-0004-directive-alignment.md
@@ -0,0 +1,272 @@
+---
+id: BRIDGE-WP-0004
+type: workplan
+title: "AccessManagementDirective Alignment"
+domain: custodian
+repo: ops-bridge
+status: draft
+owner: Bernd
+topic_slug: custodian
+created: "2026-03-28"
+updated: "2026-03-28"
+---
+
+# BRIDGE-WP-0004 — AccessManagementDirective Alignment
+
+**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
+optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
+preserving full backward compatibility with the existing static-key mode.
+
+**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
+deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
+
+---
+
+## Goal
+
+After this workplan:
+
+1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
+2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
+   `cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
+   handled transparently by the tunnel manager.
+3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
+   the directive, with config validation that enforces naming conventions.
+4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
+   §5 SIEM traceability requirement.
+
+---
+
+## Reference Documents
+
+| Document | Location |
+|---|---|
+| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
+| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
+| PRD | `wiki/OpsBridgePrd.md` |
+| FRS | `wiki/OpsBridgeFrs.md` |
+
+---
+
+## Design Decisions
+
+### Static key mode stays first-class
+
+If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
+`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
+explicitly supported for:
+- Lab/dev environments without a CA
+- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
+- Environments below the directive's complexity threshold
+
+### cert_command interface
+
+```yaml
+# tunnels.yaml — optional cert_command field
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519   # private key (always required)
+    actor: agt-state-hub-bridge
+    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
+```
+
+When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
+captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
+`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
+on tunnel stop.
+
+`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
+`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
+dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
+
+### TTL-aware cert refresh
+
+After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
+determine `cert_expires_at`. It schedules a pre-emptive cert refresh
+(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
+fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
+failure, no reconnect backoff triggered.
+
+If `cert_command` is absent, no TTL logic runs.
+
+### Actor type model
+
+`actor_class: str  # "human" | "automation"` is replaced by:
+
+```python
+class ActorType(str, Enum):
+    ADM = "adm"   # human operator
+    AGT = "agt"   # LLM-powered autonomous agent
+    ATM = "atm"   # deterministic script / pipeline
+```
+
+Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`.
+The mapping is a one-way migration aid with a deprecation warning; new configs must use the
+canonical values.
+
+Config validation: if `actor` name is set, it must start with the prefix matching its type
+(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
+SIEM auditability.
+
+---
+
+## Tasks
+
+### T1 — ActorType enum
+- [ ] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
+- [ ] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` →
+      `ActorType.ATM` with a `DeprecationWarning`; reject unknown values
+- [ ] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
+      `atm-*` for ATM; raise `ConfigError` on mismatch
+- [ ] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value`
+- [ ] Update tests
+
+### T2 — cert_command config field
+- [ ] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
+- [ ] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
+      content (shell-level freedom intentional)
+- [ ] Document in config example / SCOPE.md
+
+### T3 — Cert acquisition in manager
+- [ ] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
+      - If `cfg.cert_command` is None: return None (static key mode)
+      - Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
+      - Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
+      - Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
+- [ ] `build_ssh_command`: accept optional `cert_path`; when set, insert
+      `-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
+- [ ] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
+      so every reconnect gets a fresh cert
+
+### T4 — cert_identity in audit log
+- [ ] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
+      extract `Key ID` (the `-I` value from signing time)
+- [ ] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
+      JSON entry when present
+- [ ] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
+- [ ] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
+
+### T5 — TTL-aware cert refresh
+- [ ] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
+      from `ssh-keygen -L` output → `cert_expires_at: datetime`
+- [ ] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
+      on each iteration
+- [ ] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
+      reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
+      next iteration)
+- [ ] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
+      `AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
+- [ ] If `cert_command` is absent, skip all TTL logic entirely
+
+### T6 — `bridge cert-status` command
+- [ ] `cli.py`: add `cert-status [TUNNEL]` subcommand
+- [ ] For each tunnel (or the named one): read cert file from state dir if present,
+      run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
+      time-to-expiry (or "static key / no cert" if absent)
+- [ ] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
+- [ ] `--json` flag for machine-readable output
+
+### T7 — CertAcquisitionError handling
+- [ ] New exception `CertAcquisitionError` in `models.py`
+- [ ] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
+      with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
+      (cert failures are transient — e.g., Vault briefly unreachable)
+- [ ] After `max_attempts` consecutive cert failures, transition to `FAILED` state
+
+### T8 — SCOPE.md and documentation updates
+- [ ] Update `SCOPE.md`: replace "Identity/credential management (uses existing SSH keys)"
+      with the pluggable cert_command model; add ops-warden as related repo; update
+      actor terminology to adm/agt/atm; update Current State
+- [ ] Update `wiki/OpsBridgeFrs.md` §5.7 (actor attribution): note three-actor model,
+      cert_identity field, cert_command interface
+- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency
+- [ ] Update config example in README / `wiki/` to show both static and cert_command modes
+- [ ] Update `.claude/rules/architecture.md`: add cert lifecycle to architecture description
+
+### T9 — Tests
+- [ ] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
+      cert_command parse
+- [ ] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
+      args; verify `CertAcquisitionError` on non-zero exit
+- [ ] `test_manager.py`: TTL logic — mock `cert_expires_at` in past; verify refresh triggers
+- [ ] `test_audit.py`: `cert_identity` field present in CONNECTED event when cert was used;
+      absent in static-key mode
+- [ ] `test_cli.py`: `cert-status` exit codes; JSON output shape
+
+---
+
+## Config Schema — Before / After
+
+### Before
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: ops-agent
+    ssh_key: ~/.ssh/id_ed25519
+    actor: automation-agent
+
+actors:
+  automation-agent:
+    class: automation
+    description: "state hub bridge agent"
+```
+
+### After (static key mode — unchanged behavior)
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
+    actor: agt-state-hub-bridge
+
+actors:
+  agt-state-hub-bridge:
+    class: agt
+    description: "state hub bridge agent"
+```
+
+### After (cert_command mode — ops-warden or any CA)
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
+    actor: agt-state-hub-bridge
+    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
+
+actors:
+  agt-state-hub-bridge:
+    class: agt
+    description: "state hub bridge agent"
+```
+
+---
+
+## Acceptance Criteria
+
+- [ ] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
+      warning only); tunnel behaves identically
+- [ ] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
+- [ ] Config with `cert_command` set: SSH process launched with both `-i key` and
+      `-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
+- [ ] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
+      no TTL logic runs
+- [ ] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
+      logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
+- [ ] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
+- [ ] `bridge cert-status` shows valid cert info; exits 1 on expired cert
+- [ ] All tests pass: `uv run pytest`
+- [ ] All lints pass: `uv run ruff check .`
--- a/workplans/WARDEN-WP-0001-initial-implementation.md
+++ b/workplans/WARDEN-WP-0001-initial-implementation.md
@@ -0,0 +1,252 @@
+---
+id: WARDEN-WP-0001
+type: workplan
+title: "OpsWarden Initial Implementation"
+domain: custodian
+repo: ops-warden
+status: draft
+owner: Bernd
+topic_slug: custodian
+created: "2026-03-28"
+updated: "2026-03-28"
+---
+
+# WARDEN-WP-0001 — OpsWarden Initial Implementation
+
+> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
+> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
+> first commit action.
+
+**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
+implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
+
+**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
+(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
+the directive when scale requires it).
+
+---
+
+## Goal
+
+Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
+certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
+to sibling repos is a well-defined `cert_command` interface that any tool (principally
+`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
+
+---
+
+## Reference Documents
+
+| Document | Location |
+|---|---|
+| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
+| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
+
+---
+
+## Architecture
+
+```
+ops-warden/
+├── SCOPE.md
+├── CLAUDE.md
+├── pyproject.toml
+├── src/warden/
+│   ├── cli.py          # Typer CLI: sign / issue / status / inventory / scorecard
+│   ├── models.py       # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
+│   ├── ca.py           # LocalCA backend (file-based, for dev / non-Vault)
+│   ├── vault.py        # VaultCA backend (Vault SSH engine, for production)
+│   ├── inventory.py    # YAML principals inventory read/write
+│   ├── scorecard.py    # §5 compliance checks
+│   └── config.py       # ~/.config/warden/warden.yaml loader
+├── tests/
+└── wiki/               # (symlink or copy of AccessManagementDirective.md)
+```
+
+**Backends are swappable.** Config key `backend: local | vault` selects which CA
+implementation is used. This means the tool is fully functional without Vault for local lab
+use, and production-grade with Vault — the same CLI surface, the same `cert_command`
+interface, the same principals inventory format.
+
+**cert_command interface contract:**
+```
+warden sign <actor-name> --pubkey <path>
+```
+Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
+`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
+
+---
+
+## Stack
+
+- **Language:** Python 3.11+
+- **CLI framework:** Typer
+- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
+- **Vault SDK:** `hvac` (optional; only required for vault backend)
+- **Packaging:** `uv tool install`
+
+---
+
+## Tasks
+
+### T1 — Repository bootstrap
+- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
+      `workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
+- [ ] Write `SCOPE.md` (see template in §SCOPE below)
+- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
+- [ ] Register repo with state-hub (`register_repo`)
+- [ ] Create state-hub workstream for this workplan
+
+### T2 — Models and config
+- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
+      ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
+- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
+      `ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
+      `inventory_path`, `state_dir`
+- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
+
+### T3 — LocalCA backend
+- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
+      - Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
+      - Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
+        `Principals`
+      - Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
+- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
+      (overridable per actor in inventory)
+- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
+      actors that do not bring their own key
+
+### T4 — VaultCA backend
+- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
+      - `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
+      - Parse response `signed_key` field; write to state dir; extract metadata via
+        `ssh-keygen -L`
+- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
+- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
+
+### T5 — Principals inventory
+- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
+      ```yaml
+      actors:
+        agt-state-hub-bridge:
+          type: agt
+          principals: [agt-task-bridge]
+          ttl_hours: 24
+          description: "ops-bridge tunnel actor"
+      hosts:
+        coulombcore:
+          allowed_principals:
+            agt: [agt-task-bridge]
+            atm: [atm-backup-daily]
+      ```
+- [ ] `warden inventory list` — print table
+- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
+- [ ] `warden inventory remove <actor-name>`
+
+### T6 — CLI commands
+- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
+      stdout (the `cert_command` interface for ops-bridge)
+- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
+      `privkey`, `cert`, `valid_before`, `identity`
+- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
+      remaining; `--all` flag to show all actors in state dir
+- [ ] `warden scorecard` — run §5 checks (see T7)
+- [ ] `warden inventory <subcommand>` (list / add / remove)
+
+### T7 — Scorecard runner
+- [ ] `scorecard.py`: implement each §5 row as a named check function returning
+      `CheckResult(name, passed, detail)`
+- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
+      - All certs in state dir respect TTL policy for their `ActorType`
+      - No actor in inventory lacks a `principals` entry
+      - Actor name prefix matches declared type
+      - No cert expired by more than 5 min still present in state dir (stale cleanup)
+- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
+      — those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
+- [ ] `warden scorecard --json` for machine-readable output
+
+### T8 — ops-ssh-wrapper script
+- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
+      - Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
+      - Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
+      - Loads cert via `ssh-add`; execs the given command
+- [ ] Install as part of `uv tool install` entry points
+
+### T9 — Tests
+- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
+- [ ] Unit tests for inventory YAML round-trip
+- [ ] Unit tests for actor name prefix validation
+- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
+- [ ] Scorecard unit tests (mock cert records)
+
+### T10 — Documentation
+- [ ] `SCOPE.md` (see below)
+- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
+- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
+- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
+
+---
+
+## SCOPE.md Template
+
+```
+# SCOPE
+
+## One-liner
+SSH Certificate Authority and credential issuance for the ops fleet —
+signs short-lived certs for adm/agt/atm actors; provides the cert_command
+interface consumed by ops-bridge and other tooling.
+
+## Core Idea
+Implements AccessManagementDirective §§1–5. Owns the CA key, actor inventory,
+signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
+or SSH key generation for humans.
+
+## In Scope
+- Local CA backend (ssh-keygen -s) for lab / non-Vault use
+- Vault SSH engine backend for production
+- Actor identity registry (inventory.yaml)
+- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
+- TTL policy enforcement per ActorType (adm/agt/atm)
+- Certificate status and stale-cert cleanup
+- Scorecard checks (local / cert-side only)
+- ops-ssh-wrapper script for agt/atm startup automation
+
+## Out of Scope
+- Host-side principal deployment (railiance-infra Ansible)
+- SSH key generation for human admins (self-service: ssh-keygen)
+- Vault cluster setup / HA
+- Session recording, audit forwarding to SIEM (host-side)
+- Tunnel lifecycle (ops-bridge)
+- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
+
+## Relevant When
+- Issuing or refreshing a cert for any adm/agt/atm actor
+- Checking cert validity / scorecard compliance
+- ops-bridge needs cert_command to be defined
+- Adding a new actor to the principals inventory
+
+## Not Relevant When
+- Managing tunnel lifecycle (ops-bridge)
+- Deploying SSH config to hosts (railiance-infra)
+- All access is via static keys with no TTL (legacy mode)
+
+## Current State
+Status: planned (WARDEN-WP-0001 not yet started)
+
+## Related Repositories
+- ops-bridge — primary consumer of cert_command interface
+- railiance-infra — owns host-side principal deployment
+- the-custodian/state-hub — registers domain/workstreams
+```
+
+---
+
+## Acceptance Criteria
+
+- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
+- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
+- [ ] `warden scorecard` returns 5/5 on a clean test inventory
+- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
+- [ ] All tests pass: `uv run pytest`
+- [ ] All lints pass: `uv run ruff check .`