--- id: BRIDGE-WP-0004 type: workplan title: "AccessManagementDirective Alignment" domain: custodian repo: ops-bridge status: done owner: Bernd topic_slug: custodian created: "2026-03-28" updated: "2026-03-28" state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7" --- # BRIDGE-WP-0004 — AccessManagementDirective Alignment **Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model, optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while preserving full backward compatibility with the existing static-key mode. **Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002). --- ## Goal After this workplan: 1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys. 2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible `cert_command`) — cert acquisition, cert rotation, and cert identity logging are all handled transparently by the tunnel manager. 3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from the directive, with config validation that enforces naming conventions. 4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's §5 SIEM traceability requirement. --- ## Reference Documents | Document | Location | |---|---| | AccessManagementDirective | `wiki/AccessManagementDirective.md` | | WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` | | PRD | `wiki/OpsBridgePrd.md` | | FRS | `wiki/OpsBridgeFrs.md` | --- ## Design Decisions ### Static key mode stays first-class If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today: `ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are explicitly supported for: - Lab/dev environments without a CA - Tunnels owned by `adm`-class humans who manage their own cert refresh externally - Environments below the directive's complexity threshold ### cert_command interface ```yaml # tunnels.yaml — optional cert_command field tunnels: state-hub-coulombcore: host: coulombcore remote_port: 8001 local_port: 8000 ssh_user: agt-state-hub-bridge ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required) actor: agt-state-hub-bridge cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub" ``` When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch, captures stdout as the cert text, writes it to a tempfile in the state dir, and adds `-i ` alongside `-i ` to the SSH command. The cert file is cleaned up on tunnel stop. `cert_command` is a raw shell string, intentionally. The caller decides whether it invokes `warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface dependency-free — no Vault SDK, no warden import needed inside ops-bridge. ### TTL-aware cert refresh After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to determine `cert_expires_at`. It schedules a pre-emptive cert refresh (`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth failure, no reconnect backoff triggered. If `cert_command` is absent, no TTL logic runs. ### Actor type model `actor_class: str # "human" | "automation"` is replaced by: ```python class ActorType(str, Enum): ADM = "adm" # human operator AGT = "agt" # LLM-powered autonomous agent ATM = "atm" # deterministic script / pipeline ``` Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`. The mapping is a one-way migration aid with a deprecation warning; new configs must use the canonical values. Config validation: if `actor` name is set, it must start with the prefix matching its type (`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for SIEM auditability. --- ## Tasks ### T1 — ActorType enum ```task id: BRIDGE-WP-0004-T1 state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504 status: done priority: high ``` - [x] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType` - [x] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` → `ActorType.ATM` with a `DeprecationWarning`; reject unknown values - [x] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT, `atm-*` for ATM; raise `ConfigError` on mismatch - [x] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value` - [x] Update tests ### T2 — cert_command config field ```task id: BRIDGE-WP-0004-T2 state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6 status: done priority: high ``` - [x] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig` - [x] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string content (shell-level freedom intentional) - [x] Document in config example / SCOPE.md ### T3 — Cert acquisition in manager ```task id: BRIDGE-WP-0004-T3 state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97 status: done priority: high ``` - [x] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]` - If `cfg.cert_command` is None: return None (static key mode) - Run `cert_command` via `subprocess.run(shell=True, capture_output=True)` - Write stdout to `~/.local/state/bridge/-cert.pub` (overwrite each time) - Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr - [x] `build_ssh_command`: accept optional `cert_path`; when set, insert `-i ` after `-i ` (OpenSSH loads both automatically) - [x] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup) so every reconnect gets a fresh cert ### T4 — cert_identity in audit log ```task id: BRIDGE-WP-0004-T4 state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e status: done priority: high ``` - [x] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f ` output to extract `Key ID` (the `-I` value from signing time) - [x] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in JSON entry when present - [x] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events - [x] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events ### T5 — TTL-aware cert refresh ```task id: BRIDGE-WP-0004-T5 state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91 status: done priority: high ``` - [x] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp from `ssh-keygen -L` output → `cert_expires_at: datetime` - [x] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)` on each iteration - [x] When refresh is due: call `proc.terminate()`, break inner loop, let the outer reconnect loop restart naturally (T3 will re-acquire the cert at the top of the next iteration) - [x] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to `AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field - [x] If `cert_command` is absent, skip all TTL logic entirely ### T6 — `bridge cert-status` command ```task id: BRIDGE-WP-0004-T6 state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd status: done priority: medium ``` - [x] `cli.py`: add `cert-status [TUNNEL]` subcommand - [x] For each tunnel (or the named one): read cert file from state dir if present, run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until, time-to-expiry (or "static key / no cert" if absent) - [x] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable) - [x] `--json` flag for machine-readable output ### T7 — CertAcquisitionError handling ```task id: BRIDGE-WP-0004-T7 state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6 status: done priority: high ``` - [x] New exception `CertAcquisitionError` in `models.py` - [x] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED` with `detail="cert acquisition failed: "`, apply normal backoff and retry (cert failures are transient — e.g., Vault briefly unreachable) - [x] After `max_attempts` consecutive cert failures, transition to `FAILED` state ### T8 — SCOPE.md and documentation updates ```task id: BRIDGE-WP-0004-T8 state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1 status: done priority: medium ``` - [x] Update `SCOPE.md`: Current State updated to reflect completion; directive alignment done - [x] `wiki/OpsBridgeFrs.md` §5.7 already covers actor attribution abstractly — no changes needed - [x] `.claude/rules/architecture.md` already documents cert_command mode and actor vocab - [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency (deferred) ### T9 — Tests ```task id: BRIDGE-WP-0004-T9 state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a status: done priority: high ``` - [x] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping; cert_command parse - [x] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH args; verify `CertAcquisitionError` on non-zero exit; TTL logic helpers - [x] `test_audit.py`: `cert_identity` field; actor_type rename - [x] `test_cli.py`: `cert-status` exit codes; JSON output shape - [x] 233 tests, 0 failures --- ## Config Schema — Before / After ### Before ```yaml tunnels: state-hub-coulombcore: host: coulombcore remote_port: 8001 local_port: 8000 ssh_user: ops-agent ssh_key: ~/.ssh/id_ed25519 actor: automation-agent actors: automation-agent: class: automation description: "state hub bridge agent" ``` ### After (static key mode — unchanged behavior) ```yaml tunnels: state-hub-coulombcore: host: coulombcore remote_port: 8001 local_port: 8000 ssh_user: agt-state-hub-bridge ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 actor: agt-state-hub-bridge actors: agt-state-hub-bridge: class: agt description: "state hub bridge agent" ``` ### After (cert_command mode — ops-warden or any CA) ```yaml tunnels: state-hub-coulombcore: host: coulombcore remote_port: 8001 local_port: 8000 ssh_user: agt-state-hub-bridge ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 actor: agt-state-hub-bridge cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub" actors: agt-state-hub-bridge: class: agt description: "state hub bridge agent" ``` --- ## Acceptance Criteria - [x] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation warning only); tunnel behaves identically - [x] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError` - [x] Config with `cert_command` set: SSH process launched with both `-i key` and `-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event - [x] Config without `cert_command`: no cert file written; `cert_identity` absent in audit; no TTL logic runs - [x] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED` logged with stderr detail; eventually reaches `FAILED` after `max_attempts` - [x] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged - [x] `bridge cert-status` shows valid cert info; exits 1 on expired cert - [x] All tests pass: `uv run pytest` (233 passed) - [x] All lints pass: `uv run ruff check .`