Files
ops-bridge/workplans/BRIDGE-WP-0004-directive-alignment.md
2026-04-25 17:06:05 +02:00

11 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
BRIDGE-WP-0004 workplan AccessManagementDirective Alignment custodian ops-bridge draft Bernd custodian 2026-03-28 2026-03-28 e3451b70-688e-4e19-bff5-0c82c0f009a7

BRIDGE-WP-0004 — AccessManagementDirective Alignment

Scope: Align ops-bridge with wiki/AccessManagementDirective.md — three-actor model, optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while preserving full backward compatibility with the existing static-key mode.

Out of scope: CA/signing logic itself (lives in ops-warden), host-side principal deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).


Goal

After this workplan:

  1. ops-bridge works unchanged for anyone using plain, non-expiring SSH keys.
  2. ops-bridge works with CA-signed short-lived certs via ops-warden (or any compatible cert_command) — cert acquisition, cert rotation, and cert identity logging are all handled transparently by the tunnel manager.
  3. Actor attribution is expressed in the three-actor vocabulary (adm | agt | atm) from the directive, with config validation that enforces naming conventions.
  4. The audit log carries cert_identity when a cert was used, satisfying the directive's §5 SIEM traceability requirement.

Reference Documents

Document Location
AccessManagementDirective wiki/AccessManagementDirective.md
WARDEN-WP-0001 workplans/WARDEN-WP-0001-initial-implementation.md
PRD wiki/OpsBridgePrd.md
FRS wiki/OpsBridgeFrs.md

Design Decisions

Static key mode stays first-class

If cert_command is absent from a tunnel config, ops-bridge behaves exactly as today: ssh_key is passed directly to ssh -i. No deprecation, no warnings. Static keys are explicitly supported for:

  • Lab/dev environments without a CA
  • Tunnels owned by adm-class humans who manage their own cert refresh externally
  • Environments below the directive's complexity threshold

cert_command interface

# tunnels.yaml — optional cert_command field
tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: agt-state-hub-bridge
    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519   # private key (always required)
    actor: agt-state-hub-bridge
    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"

When cert_command is present, manager.py runs it before every SSH subprocess launch, captures stdout as the cert text, writes it to a tempfile in the state dir, and adds -i <cert_path> alongside -i <key_path> to the SSH command. The cert file is cleaned up on tunnel stop.

cert_command is a raw shell string, intentionally. The caller decides whether it invokes warden, vault write, ssh-keygen -s, or any other tool. This keeps the interface dependency-free — no Vault SDK, no warden import needed inside ops-bridge.

TTL-aware cert refresh

After acquiring a cert, manager.py parses Valid before: via ssh-keygen -L to determine cert_expires_at. It schedules a pre-emptive cert refresh (cert_expires_at - 5 min) inside the health-check/wait loop. When the refresh timer fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth failure, no reconnect backoff triggered.

If cert_command is absent, no TTL logic runs.

Actor type model

actor_class: str # "human" | "automation" is replaced by:

class ActorType(str, Enum):
    ADM = "adm"   # human operator
    AGT = "agt"   # LLM-powered autonomous agent
    ATM = "atm"   # deterministic script / pipeline

Backward-compat mapping at config load time: "human"adm, "automation"atm. The mapping is a one-way migration aid with a deprecation warning; new configs must use the canonical values.

Config validation: if actor name is set, it must start with the prefix matching its type (adm-*, agt-*, atm-*). Hard error, not a warning — the directive requires this for SIEM auditability.


Tasks

T1 — ActorType enum

  • models.py: replace actor_class: str in ActorInfo with actor_type: ActorType
  • config.py: accept legacy "human"ActorType.ADM and "automation"ActorType.ATM with a DeprecationWarning; reject unknown values
  • config.py: enforce actor name prefix: adm-* for ADM, agt-* for AGT, atm-* for ATM; raise ConfigError on mismatch
  • Update manager.py / audit.py call sites: actor_classactor_type.value
  • Update tests

T2 — cert_command config field

  • models.py: add cert_command: Optional[str] = None to TunnelConfig
  • config.py: parse cert_command from tunnel YAML; no validation of the string content (shell-level freedom intentional)
  • Document in config example / SCOPE.md

T3 — Cert acquisition in manager

  • manager.py: extract cert acquisition into _acquire_cert(cfg) -> Optional[Path] - If cfg.cert_command is None: return None (static key mode) - Run cert_command via subprocess.run(shell=True, capture_output=True) - Write stdout to ~/.local/state/bridge/<tunnel>-cert.pub (overwrite each time) - Return path; on non-zero exit code: raise CertAcquisitionError with stderr
  • build_ssh_command: accept optional cert_path; when set, insert -i <cert_path> after -i <key_path> (OpenSSH loads both automatically)
  • Call _acquire_cert at the top of each reconnect iteration (not once at startup) so every reconnect gets a fresh cert

T4 — cert_identity in audit log

  • manager.py: after cert acquisition, parse ssh-keygen -L -f <cert> output to extract Key ID (the -I value from signing time)
  • Add cert_identity: Optional[str] to AuditLogger.log() signature; include in JSON entry when present
  • Log cert_identity in BRIDGE_CONNECTED and BRIDGE_STARTED events
  • AuditEvent: no new events needed; cert_identity is metadata on existing events

T5 — TTL-aware cert refresh

  • manager.py: after successful cert acquisition, parse Valid before: timestamp from ssh-keygen -L output → cert_expires_at: datetime
  • In the health-check/wait loop, check datetime.now(utc) >= cert_expires_at - timedelta(minutes=5) on each iteration
  • When refresh is due: call proc.terminate(), break inner loop, let the outer reconnect loop restart naturally (T3 will re-acquire the cert at the top of the next iteration)
  • Log a new AuditEvent.CERT_EXPIRING event when refresh is triggered (add to AuditEvent enum); include cert_identity and cert_expires_at in detail field
  • If cert_command is absent, skip all TTL logic entirely

T6 — bridge cert-status command

  • cli.py: add cert-status [TUNNEL] subcommand
  • For each tunnel (or the named one): read cert file from state dir if present, run ssh-keygen -L, display: identity, principals, valid-from, valid-until, time-to-expiry (or "static key / no cert" if absent)
  • Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
  • --json flag for machine-readable output

T7 — CertAcquisitionError handling

  • New exception CertAcquisitionError in models.py
  • In _run_loop: catch CertAcquisitionError, log AuditEvent.BRIDGE_DISCONNECTED with detail="cert acquisition failed: <stderr>", apply normal backoff and retry (cert failures are transient — e.g., Vault briefly unreachable)
  • After max_attempts consecutive cert failures, transition to FAILED state

T8 — SCOPE.md and documentation updates

  • Update SCOPE.md: replace "Identity/credential management (uses existing SSH keys)" with the pluggable cert_command model; add ops-warden as related repo; update actor terminology to adm/agt/atm; update Current State
  • Update wiki/OpsBridgeFrs.md §5.7 (actor attribution): note three-actor model, cert_identity field, cert_command interface
  • Update wiki/OpsBridgePrd.md: note directive alignment, ops-warden dependency
  • Update config example in README / wiki/ to show both static and cert_command modes
  • Update .claude/rules/architecture.md: add cert lifecycle to architecture description

T9 — Tests

  • test_config.py: actor name prefix validation (adm/agt/atm); legacy class mapping; cert_command parse
  • test_manager.py: mock cert_command subprocess; verify cert path appended to SSH args; verify CertAcquisitionError on non-zero exit
  • test_manager.py: TTL logic — mock cert_expires_at in past; verify refresh triggers
  • test_audit.py: cert_identity field present in CONNECTED event when cert was used; absent in static-key mode
  • test_cli.py: cert-status exit codes; JSON output shape

Config Schema — Before / After

Before

tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: ops-agent
    ssh_key: ~/.ssh/id_ed25519
    actor: automation-agent

actors:
  automation-agent:
    class: automation
    description: "state hub bridge agent"

After (static key mode — unchanged behavior)

tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: agt-state-hub-bridge
    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
    actor: agt-state-hub-bridge

actors:
  agt-state-hub-bridge:
    class: agt
    description: "state hub bridge agent"

After (cert_command mode — ops-warden or any CA)

tunnels:
  state-hub-coulombcore:
    host: coulombcore
    remote_port: 8001
    local_port: 8000
    ssh_user: agt-state-hub-bridge
    ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
    actor: agt-state-hub-bridge
    cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"

actors:
  agt-state-hub-bridge:
    class: agt
    description: "state hub bridge agent"

Acceptance Criteria

  • Existing tunnels.yaml with class: automation loads without error (deprecation warning only); tunnel behaves identically
  • New config with class: agt and actor name not prefixed agt- raises ConfigError
  • Config with cert_command set: SSH process launched with both -i key and -i cert; cert_identity present in BRIDGE_CONNECTED audit event
  • Config without cert_command: no cert file written; cert_identity absent in audit; no TTL logic runs
  • cert_command exits non-zero: tunnel enters backoff/retry, BRIDGE_DISCONNECTED logged with stderr detail; eventually reaches FAILED after max_attempts
  • Cert within 5 min of expiry: SSH restarted with fresh cert; CERT_EXPIRING logged
  • bridge cert-status shows valid cert info; exits 1 on expired cert
  • All tests pass: uv run pytest
  • All lints pass: uv run ruff check .