docs: align architecture and scope with AccessManagementDirective

Expands architecture constraints and SCOPE.md to reflect the three-actor vocabulary (adm/agt/atm), two credential modes (static key + cert_command), and ops-warden boundary. Adds directive wiki doc and two new workplans (BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 00:59:38 +00:00
parent 75a559780e
commit f3a7236c5d
5 changed files with 773 additions and 15 deletions
--- a/workplans/BRIDGE-WP-0004-directive-alignment.md
+++ b/workplans/BRIDGE-WP-0004-directive-alignment.md
@@ -0,0 +1,272 @@
+---
+id: BRIDGE-WP-0004
+type: workplan
+title: "AccessManagementDirective Alignment"
+domain: custodian
+repo: ops-bridge
+status: draft
+owner: Bernd
+topic_slug: custodian
+created: "2026-03-28"
+updated: "2026-03-28"
+---
+
+# BRIDGE-WP-0004 — AccessManagementDirective Alignment
+
+**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
+optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
+preserving full backward compatibility with the existing static-key mode.
+
+**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
+deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
+
+---
+
+## Goal
+
+After this workplan:
+
+1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
+2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
+   `cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
+   handled transparently by the tunnel manager.
+3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
+   the directive, with config validation that enforces naming conventions.
+4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
+   §5 SIEM traceability requirement.
+
+---
+
+## Reference Documents
+
+| Document | Location |
+|---|---|
+| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
+| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
+| PRD | `wiki/OpsBridgePrd.md` |
+| FRS | `wiki/OpsBridgeFrs.md` |
+
+---
+
+## Design Decisions
+
+### Static key mode stays first-class
+
+If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
+`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
+explicitly supported for:
+- Lab/dev environments without a CA
+- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
+- Environments below the directive's complexity threshold
+
+### cert_command interface
+
+```yaml
+# tunnels.yaml — optional cert_command field
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519   # private key (always required)
+    actor: agt-state-hub-bridge
+    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
+```
+
+When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
+captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
+`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
+on tunnel stop.
+
+`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
+`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
+dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
+
+### TTL-aware cert refresh
+
+After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
+determine `cert_expires_at`. It schedules a pre-emptive cert refresh
+(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
+fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
+failure, no reconnect backoff triggered.
+
+If `cert_command` is absent, no TTL logic runs.
+
+### Actor type model
+
+`actor_class: str  # "human" | "automation"` is replaced by:
+
+```python
+class ActorType(str, Enum):
+    ADM = "adm"   # human operator
+    AGT = "agt"   # LLM-powered autonomous agent
+    ATM = "atm"   # deterministic script / pipeline
+```
+
+Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`.
+The mapping is a one-way migration aid with a deprecation warning; new configs must use the
+canonical values.
+
+Config validation: if `actor` name is set, it must start with the prefix matching its type
+(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
+SIEM auditability.
+
+---
+
+## Tasks
+
+### T1 — ActorType enum
+- [ ] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
+- [ ] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` →
+      `ActorType.ATM` with a `DeprecationWarning`; reject unknown values
+- [ ] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
+      `atm-*` for ATM; raise `ConfigError` on mismatch
+- [ ] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value`
+- [ ] Update tests
+
+### T2 — cert_command config field
+- [ ] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
+- [ ] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
+      content (shell-level freedom intentional)
+- [ ] Document in config example / SCOPE.md
+
+### T3 — Cert acquisition in manager
+- [ ] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
+      - If `cfg.cert_command` is None: return None (static key mode)
+      - Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
+      - Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
+      - Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
+- [ ] `build_ssh_command`: accept optional `cert_path`; when set, insert
+      `-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
+- [ ] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
+      so every reconnect gets a fresh cert
+
+### T4 — cert_identity in audit log
+- [ ] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
+      extract `Key ID` (the `-I` value from signing time)
+- [ ] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
+      JSON entry when present
+- [ ] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
+- [ ] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
+
+### T5 — TTL-aware cert refresh
+- [ ] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
+      from `ssh-keygen -L` output → `cert_expires_at: datetime`
+- [ ] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
+      on each iteration
+- [ ] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
+      reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
+      next iteration)
+- [ ] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
+      `AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
+- [ ] If `cert_command` is absent, skip all TTL logic entirely
+
+### T6 — `bridge cert-status` command
+- [ ] `cli.py`: add `cert-status [TUNNEL]` subcommand
+- [ ] For each tunnel (or the named one): read cert file from state dir if present,
+      run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
+      time-to-expiry (or "static key / no cert" if absent)
+- [ ] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
+- [ ] `--json` flag for machine-readable output
+
+### T7 — CertAcquisitionError handling
+- [ ] New exception `CertAcquisitionError` in `models.py`
+- [ ] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
+      with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
+      (cert failures are transient — e.g., Vault briefly unreachable)
+- [ ] After `max_attempts` consecutive cert failures, transition to `FAILED` state
+
+### T8 — SCOPE.md and documentation updates
+- [ ] Update `SCOPE.md`: replace "Identity/credential management (uses existing SSH keys)"
+      with the pluggable cert_command model; add ops-warden as related repo; update
+      actor terminology to adm/agt/atm; update Current State
+- [ ] Update `wiki/OpsBridgeFrs.md` §5.7 (actor attribution): note three-actor model,
+      cert_identity field, cert_command interface
+- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency
+- [ ] Update config example in README / `wiki/` to show both static and cert_command modes
+- [ ] Update `.claude/rules/architecture.md`: add cert lifecycle to architecture description
+
+### T9 — Tests
+- [ ] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
+      cert_command parse
+- [ ] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
+      args; verify `CertAcquisitionError` on non-zero exit
+- [ ] `test_manager.py`: TTL logic — mock `cert_expires_at` in past; verify refresh triggers
+- [ ] `test_audit.py`: `cert_identity` field present in CONNECTED event when cert was used;
+      absent in static-key mode
+- [ ] `test_cli.py`: `cert-status` exit codes; JSON output shape
+
+---
+
+## Config Schema — Before / After
+
+### Before
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: ops-agent
+    ssh_key: ~/.ssh/id_ed25519
+    actor: automation-agent
+
+actors:
+  automation-agent:
+    class: automation
+    description: "state hub bridge agent"
+```
+
+### After (static key mode — unchanged behavior)
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
+    actor: agt-state-hub-bridge
+
+actors:
+  agt-state-hub-bridge:
+    class: agt
+    description: "state hub bridge agent"
+```
+
+### After (cert_command mode — ops-warden or any CA)
+```yaml
+tunnels:
+  state-hub-coulombcore:
+    host: coulombcore
+    remote_port: 8001
+    local_port: 8000
+    ssh_user: agt-state-hub-bridge
+    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
+    actor: agt-state-hub-bridge
+    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
+
+actors:
+  agt-state-hub-bridge:
+    class: agt
+    description: "state hub bridge agent"
+```
+
+---
+
+## Acceptance Criteria
+
+- [ ] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
+      warning only); tunnel behaves identically
+- [ ] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
+- [ ] Config with `cert_command` set: SSH process launched with both `-i key` and
+      `-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
+- [ ] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
+      no TTL logic runs
+- [ ] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
+      logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
+- [ ] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
+- [ ] `bridge cert-status` shows valid cert info; exits 1 on expired cert
+- [ ] All tests pass: `uv run pytest`
+- [ ] All lints pass: `uv run ruff check .`
--- a/workplans/WARDEN-WP-0001-initial-implementation.md
+++ b/workplans/WARDEN-WP-0001-initial-implementation.md
@@ -0,0 +1,252 @@
+---
+id: WARDEN-WP-0001
+type: workplan
+title: "OpsWarden Initial Implementation"
+domain: custodian
+repo: ops-warden
+status: draft
+owner: Bernd
+topic_slug: custodian
+created: "2026-03-28"
+updated: "2026-03-28"
+---
+
+# WARDEN-WP-0001 — OpsWarden Initial Implementation
+
+> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
+> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
+> first commit action.
+
+**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
+implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
+
+**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
+(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
+the directive when scale requires it).
+
+---
+
+## Goal
+
+Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
+certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
+to sibling repos is a well-defined `cert_command` interface that any tool (principally
+`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
+
+---
+
+## Reference Documents
+
+| Document | Location |
+|---|---|
+| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
+| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
+
+---
+
+## Architecture
+
+```
+ops-warden/
+├── SCOPE.md
+├── CLAUDE.md
+├── pyproject.toml
+├── src/warden/
+│   ├── cli.py          # Typer CLI: sign / issue / status / inventory / scorecard
+│   ├── models.py       # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
+│   ├── ca.py           # LocalCA backend (file-based, for dev / non-Vault)
+│   ├── vault.py        # VaultCA backend (Vault SSH engine, for production)
+│   ├── inventory.py    # YAML principals inventory read/write
+│   ├── scorecard.py    # §5 compliance checks
+│   └── config.py       # ~/.config/warden/warden.yaml loader
+├── tests/
+└── wiki/               # (symlink or copy of AccessManagementDirective.md)
+```
+
+**Backends are swappable.** Config key `backend: local | vault` selects which CA
+implementation is used. This means the tool is fully functional without Vault for local lab
+use, and production-grade with Vault — the same CLI surface, the same `cert_command`
+interface, the same principals inventory format.
+
+**cert_command interface contract:**
+```
+warden sign <actor-name> --pubkey <path>
+```
+Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
+`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
+
+---
+
+## Stack
+
+- **Language:** Python 3.11+
+- **CLI framework:** Typer
+- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
+- **Vault SDK:** `hvac` (optional; only required for vault backend)
+- **Packaging:** `uv tool install`
+
+---
+
+## Tasks
+
+### T1 — Repository bootstrap
+- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
+      `workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
+- [ ] Write `SCOPE.md` (see template in §SCOPE below)
+- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
+- [ ] Register repo with state-hub (`register_repo`)
+- [ ] Create state-hub workstream for this workplan
+
+### T2 — Models and config
+- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
+      ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
+- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
+      `ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
+      `inventory_path`, `state_dir`
+- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
+
+### T3 — LocalCA backend
+- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
+      - Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
+      - Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
+        `Principals`
+      - Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
+- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
+      (overridable per actor in inventory)
+- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
+      actors that do not bring their own key
+
+### T4 — VaultCA backend
+- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
+      - `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
+      - Parse response `signed_key` field; write to state dir; extract metadata via
+        `ssh-keygen -L`
+- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
+- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
+
+### T5 — Principals inventory
+- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
+      ```yaml
+      actors:
+        agt-state-hub-bridge:
+          type: agt
+          principals: [agt-task-bridge]
+          ttl_hours: 24
+          description: "ops-bridge tunnel actor"
+      hosts:
+        coulombcore:
+          allowed_principals:
+            agt: [agt-task-bridge]
+            atm: [atm-backup-daily]
+      ```
+- [ ] `warden inventory list` — print table
+- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
+- [ ] `warden inventory remove <actor-name>`
+
+### T6 — CLI commands
+- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
+      stdout (the `cert_command` interface for ops-bridge)
+- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
+      `privkey`, `cert`, `valid_before`, `identity`
+- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
+      remaining; `--all` flag to show all actors in state dir
+- [ ] `warden scorecard` — run §5 checks (see T7)
+- [ ] `warden inventory <subcommand>` (list / add / remove)
+
+### T7 — Scorecard runner
+- [ ] `scorecard.py`: implement each §5 row as a named check function returning
+      `CheckResult(name, passed, detail)`
+- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
+      - All certs in state dir respect TTL policy for their `ActorType`
+      - No actor in inventory lacks a `principals` entry
+      - Actor name prefix matches declared type
+      - No cert expired by more than 5 min still present in state dir (stale cleanup)
+- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
+      — those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
+- [ ] `warden scorecard --json` for machine-readable output
+
+### T8 — ops-ssh-wrapper script
+- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
+      - Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
+      - Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
+      - Loads cert via `ssh-add`; execs the given command
+- [ ] Install as part of `uv tool install` entry points
+
+### T9 — Tests
+- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
+- [ ] Unit tests for inventory YAML round-trip
+- [ ] Unit tests for actor name prefix validation
+- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
+- [ ] Scorecard unit tests (mock cert records)
+
+### T10 — Documentation
+- [ ] `SCOPE.md` (see below)
+- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
+- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
+- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
+
+---
+
+## SCOPE.md Template
+
+```
+# SCOPE
+
+## One-liner
+SSH Certificate Authority and credential issuance for the ops fleet —
+signs short-lived certs for adm/agt/atm actors; provides the cert_command
+interface consumed by ops-bridge and other tooling.
+
+## Core Idea
+Implements AccessManagementDirective §§1–5. Owns the CA key, actor inventory,
+signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
+or SSH key generation for humans.
+
+## In Scope
+- Local CA backend (ssh-keygen -s) for lab / non-Vault use
+- Vault SSH engine backend for production
+- Actor identity registry (inventory.yaml)
+- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
+- TTL policy enforcement per ActorType (adm/agt/atm)
+- Certificate status and stale-cert cleanup
+- Scorecard checks (local / cert-side only)
+- ops-ssh-wrapper script for agt/atm startup automation
+
+## Out of Scope
+- Host-side principal deployment (railiance-infra Ansible)
+- SSH key generation for human admins (self-service: ssh-keygen)
+- Vault cluster setup / HA
+- Session recording, audit forwarding to SIEM (host-side)
+- Tunnel lifecycle (ops-bridge)
+- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
+
+## Relevant When
+- Issuing or refreshing a cert for any adm/agt/atm actor
+- Checking cert validity / scorecard compliance
+- ops-bridge needs cert_command to be defined
+- Adding a new actor to the principals inventory
+
+## Not Relevant When
+- Managing tunnel lifecycle (ops-bridge)
+- Deploying SSH config to hosts (railiance-infra)
+- All access is via static keys with no TTL (legacy mode)
+
+## Current State
+Status: planned (WARDEN-WP-0001 not yet started)
+
+## Related Repositories
+- ops-bridge — primary consumer of cert_command interface
+- railiance-infra — owns host-side principal deployment
+- the-custodian/state-hub — registers domain/workstreams
+```
+
+---
+
+## Acceptance Criteria
+
+- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
+- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
+- [ ] `warden scorecard` returns 5/5 on a clean test inventory
+- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
+- [ ] All tests pass: `uv run pytest`
+- [ ] All lints pass: `uv run ruff check .`