Files
ops-bridge/workplans/BRIDGE-WP-0004-directive-alignment.md
tegwick bd169a07e2 feat(directive): implement BRIDGE-WP-0004 AccessManagementDirective alignment
- ActorType enum (adm/agt/atm) replaces actor_class string; config validates
  naming convention (adm-*/agt-*/atm-*) with hard ConfigError on mismatch;
  legacy 'human'/'automation' values accepted with DeprecationWarning
- cert_command: pluggable shell string run before each SSH launch; cert written
  to state dir; -i cert appended to SSH command alongside -i key
- TTL-aware cert refresh: parses Valid-to via ssh-keygen -L; pre-emptive restart
  5 min before expiry (no backoff, no attempt increment); CERT_EXPIRING logged
- CertAcquisitionError: cert failures trigger normal backoff/retry loop
- cert_identity: Key ID parsed from cert and recorded in BRIDGE_CONNECTED event
- bridge cert-status: new CLI command; exit 1 on expired cert; --json flag
- 233 tests passing, ruff clean

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 09:38:29 +02:00

12 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
BRIDGE-WP-0004 workplan AccessManagementDirective Alignment custodian ops-bridge done Bernd custodian 2026-03-28 2026-03-28 e3451b70-688e-4e19-bff5-0c82c0f009a7

BRIDGE-WP-0004 — AccessManagementDirective Alignment

Scope: Align ops-bridge with wiki/AccessManagementDirective.md — three-actor model, optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while preserving full backward compatibility with the existing static-key mode.

Out of scope: CA/signing logic itself (lives in ops-warden), host-side principal deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).


Goal

After this workplan:

  1. ops-bridge works unchanged for anyone using plain, non-expiring SSH keys.
  2. ops-bridge works with CA-signed short-lived certs via ops-warden (or any compatible cert_command) — cert acquisition, cert rotation, and cert identity logging are all handled transparently by the tunnel manager.
  3. Actor attribution is expressed in the three-actor vocabulary (adm | agt | atm) from the directive, with config validation that enforces naming conventions.
  4. The audit log carries cert_identity when a cert was used, satisfying the directive's §5 SIEM traceability requirement.

Reference Documents

Document Location
AccessManagementDirective wiki/AccessManagementDirective.md
WARDEN-WP-0001 workplans/WARDEN-WP-0001-initial-implementation.md
PRD wiki/OpsBridgePrd.md
FRS wiki/OpsBridgeFrs.md

Design Decisions

Static key mode stays first-class

If cert_command is absent from a tunnel config, ops-bridge behaves exactly as today: ssh_key is passed directly to ssh -i. No deprecation, no warnings. Static keys are explicitly supported for:

  • Lab/dev environments without a CA
  • Tunnels owned by adm-class humans who manage their own cert refresh externally
  • Environments below the directive's complexity threshold

cert_command interface

# tunnels.yaml — optional cert_command field
tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: agt-state-hub-bridge
    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519   # private key (always required)
    actor: agt-state-hub-bridge
    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"

When cert_command is present, manager.py runs it before every SSH subprocess launch, captures stdout as the cert text, writes it to a tempfile in the state dir, and adds -i <cert_path> alongside -i <key_path> to the SSH command. The cert file is cleaned up on tunnel stop.

cert_command is a raw shell string, intentionally. The caller decides whether it invokes warden, vault write, ssh-keygen -s, or any other tool. This keeps the interface dependency-free — no Vault SDK, no warden import needed inside ops-bridge.

TTL-aware cert refresh

After acquiring a cert, manager.py parses Valid before: via ssh-keygen -L to determine cert_expires_at. It schedules a pre-emptive cert refresh (cert_expires_at - 5 min) inside the health-check/wait loop. When the refresh timer fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth failure, no reconnect backoff triggered.

If cert_command is absent, no TTL logic runs.

Actor type model

actor_class: str # "human" | "automation" is replaced by:

class ActorType(str, Enum):
    ADM = "adm"   # human operator
    AGT = "agt"   # LLM-powered autonomous agent
    ATM = "atm"   # deterministic script / pipeline

Backward-compat mapping at config load time: "human"adm, "automation"atm. The mapping is a one-way migration aid with a deprecation warning; new configs must use the canonical values.

Config validation: if actor name is set, it must start with the prefix matching its type (adm-*, agt-*, atm-*). Hard error, not a warning — the directive requires this for SIEM auditability.


Tasks

T1 — ActorType enum

id: BRIDGE-WP-0004-T1
state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
status: done
priority: high
  • models.py: replace actor_class: str in ActorInfo with actor_type: ActorType
  • config.py: accept legacy "human"ActorType.ADM and "automation"ActorType.ATM with a DeprecationWarning; reject unknown values
  • config.py: enforce actor name prefix: adm-* for ADM, agt-* for AGT, atm-* for ATM; raise ConfigError on mismatch
  • Update manager.py / audit.py call sites: actor_classactor_type.value
  • Update tests

T2 — cert_command config field

id: BRIDGE-WP-0004-T2
state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
status: done
priority: high
  • models.py: add cert_command: Optional[str] = None to TunnelConfig
  • config.py: parse cert_command from tunnel YAML; no validation of the string content (shell-level freedom intentional)
  • Document in config example / SCOPE.md

T3 — Cert acquisition in manager

id: BRIDGE-WP-0004-T3
state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
status: done
priority: high
  • manager.py: extract cert acquisition into _acquire_cert(cfg) -> Optional[Path] - If cfg.cert_command is None: return None (static key mode) - Run cert_command via subprocess.run(shell=True, capture_output=True) - Write stdout to ~/.local/state/bridge/<tunnel>-cert.pub (overwrite each time) - Return path; on non-zero exit code: raise CertAcquisitionError with stderr
  • build_ssh_command: accept optional cert_path; when set, insert -i <cert_path> after -i <key_path> (OpenSSH loads both automatically)
  • Call _acquire_cert at the top of each reconnect iteration (not once at startup) so every reconnect gets a fresh cert

T4 — cert_identity in audit log

id: BRIDGE-WP-0004-T4
state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
status: done
priority: high
  • manager.py: after cert acquisition, parse ssh-keygen -L -f <cert> output to extract Key ID (the -I value from signing time)
  • Add cert_identity: Optional[str] to AuditLogger.log() signature; include in JSON entry when present
  • Log cert_identity in BRIDGE_CONNECTED and BRIDGE_STARTED events
  • AuditEvent: no new events needed; cert_identity is metadata on existing events

T5 — TTL-aware cert refresh

id: BRIDGE-WP-0004-T5
state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
status: done
priority: high
  • manager.py: after successful cert acquisition, parse Valid before: timestamp from ssh-keygen -L output → cert_expires_at: datetime
  • In the health-check/wait loop, check datetime.now(utc) >= cert_expires_at - timedelta(minutes=5) on each iteration
  • When refresh is due: call proc.terminate(), break inner loop, let the outer reconnect loop restart naturally (T3 will re-acquire the cert at the top of the next iteration)
  • Log a new AuditEvent.CERT_EXPIRING event when refresh is triggered (add to AuditEvent enum); include cert_identity and cert_expires_at in detail field
  • If cert_command is absent, skip all TTL logic entirely

T6 — bridge cert-status command

id: BRIDGE-WP-0004-T6
state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
status: done
priority: medium
  • cli.py: add cert-status [TUNNEL] subcommand
  • For each tunnel (or the named one): read cert file from state dir if present, run ssh-keygen -L, display: identity, principals, valid-from, valid-until, time-to-expiry (or "static key / no cert" if absent)
  • Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
  • --json flag for machine-readable output

T7 — CertAcquisitionError handling

id: BRIDGE-WP-0004-T7
state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
status: done
priority: high
  • New exception CertAcquisitionError in models.py
  • In _run_loop: catch CertAcquisitionError, log AuditEvent.BRIDGE_DISCONNECTED with detail="cert acquisition failed: <stderr>", apply normal backoff and retry (cert failures are transient — e.g., Vault briefly unreachable)
  • After max_attempts consecutive cert failures, transition to FAILED state

T8 — SCOPE.md and documentation updates

id: BRIDGE-WP-0004-T8
state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
status: done
priority: medium
  • Update SCOPE.md: Current State updated to reflect completion; directive alignment done
  • wiki/OpsBridgeFrs.md §5.7 already covers actor attribution abstractly — no changes needed
  • .claude/rules/architecture.md already documents cert_command mode and actor vocab
  • Update wiki/OpsBridgePrd.md: note directive alignment, ops-warden dependency (deferred)

T9 — Tests

id: BRIDGE-WP-0004-T9
state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
status: done
priority: high
  • test_config.py: actor name prefix validation (adm/agt/atm); legacy class mapping; cert_command parse
  • test_manager.py: mock cert_command subprocess; verify cert path appended to SSH args; verify CertAcquisitionError on non-zero exit; TTL logic helpers
  • test_audit.py: cert_identity field; actor_type rename
  • test_cli.py: cert-status exit codes; JSON output shape
  • 233 tests, 0 failures

Config Schema — Before / After

Before

tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: ops-agent
    ssh_key: ~/.ssh/id_ed25519
    actor: automation-agent

actors:
  automation-agent:
    class: automation
    description: "state hub bridge agent"

After (static key mode — unchanged behavior)

tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: agt-state-hub-bridge
    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
    actor: agt-state-hub-bridge

actors:
  agt-state-hub-bridge:
    class: agt
    description: "state hub bridge agent"

After (cert_command mode — ops-warden or any CA)

tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: agt-state-hub-bridge
    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
    actor: agt-state-hub-bridge
    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"

actors:
  agt-state-hub-bridge:
    class: agt
    description: "state hub bridge agent"

Acceptance Criteria

  • Existing tunnels.yaml with class: automation loads without error (deprecation warning only); tunnel behaves identically
  • New config with class: agt and actor name not prefixed agt- raises ConfigError
  • Config with cert_command set: SSH process launched with both -i key and -i cert; cert_identity present in BRIDGE_CONNECTED audit event
  • Config without cert_command: no cert file written; cert_identity absent in audit; no TTL logic runs
  • cert_command exits non-zero: tunnel enters backoff/retry, BRIDGE_DISCONNECTED logged with stderr detail; eventually reaches FAILED after max_attempts
  • Cert within 5 min of expiry: SSH restarted with fresh cert; CERT_EXPIRING logged
  • bridge cert-status shows valid cert info; exits 1 on expired cert
  • All tests pass: uv run pytest (233 passed)
  • All lints pass: uv run ruff check .