- ActorType enum (adm/agt/atm) replaces actor_class string; config validates naming convention (adm-*/agt-*/atm-*) with hard ConfigError on mismatch; legacy 'human'/'automation' values accepted with DeprecationWarning - cert_command: pluggable shell string run before each SSH launch; cert written to state dir; -i cert appended to SSH command alongside -i key - TTL-aware cert refresh: parses Valid-to via ssh-keygen -L; pre-emptive restart 5 min before expiry (no backoff, no attempt increment); CERT_EXPIRING logged - CertAcquisitionError: cert failures trigger normal backoff/retry loop - cert_identity: Key ID parsed from cert and recorded in BRIDGE_CONNECTED event - bridge cert-status: new CLI command; exit 1 on expired cert; --json flag - 233 tests passing, ruff clean Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
12 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|
| BRIDGE-WP-0004 | workplan | AccessManagementDirective Alignment | custodian | ops-bridge | done | Bernd | custodian | 2026-03-28 | 2026-03-28 | e3451b70-688e-4e19-bff5-0c82c0f009a7 |
BRIDGE-WP-0004 — AccessManagementDirective Alignment
Scope: Align ops-bridge with wiki/AccessManagementDirective.md — three-actor model,
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
preserving full backward compatibility with the existing static-key mode.
Out of scope: CA/signing logic itself (lives in ops-warden), host-side principal
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
Goal
After this workplan:
ops-bridgeworks unchanged for anyone using plain, non-expiring SSH keys.ops-bridgeworks with CA-signed short-lived certs viaops-warden(or any compatiblecert_command) — cert acquisition, cert rotation, and cert identity logging are all handled transparently by the tunnel manager.- Actor attribution is expressed in the three-actor vocabulary (
adm | agt | atm) from the directive, with config validation that enforces naming conventions. - The audit log carries
cert_identitywhen a cert was used, satisfying the directive's §5 SIEM traceability requirement.
Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | wiki/AccessManagementDirective.md |
| WARDEN-WP-0001 | workplans/WARDEN-WP-0001-initial-implementation.md |
| PRD | wiki/OpsBridgePrd.md |
| FRS | wiki/OpsBridgeFrs.md |
Design Decisions
Static key mode stays first-class
If cert_command is absent from a tunnel config, ops-bridge behaves exactly as today:
ssh_key is passed directly to ssh -i. No deprecation, no warnings. Static keys are
explicitly supported for:
- Lab/dev environments without a CA
- Tunnels owned by
adm-class humans who manage their own cert refresh externally - Environments below the directive's complexity threshold
cert_command interface
# tunnels.yaml — optional cert_command field
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
When cert_command is present, manager.py runs it before every SSH subprocess launch,
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
-i <cert_path> alongside -i <key_path> to the SSH command. The cert file is cleaned up
on tunnel stop.
cert_command is a raw shell string, intentionally. The caller decides whether it invokes
warden, vault write, ssh-keygen -s, or any other tool. This keeps the interface
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
TTL-aware cert refresh
After acquiring a cert, manager.py parses Valid before: via ssh-keygen -L to
determine cert_expires_at. It schedules a pre-emptive cert refresh
(cert_expires_at - 5 min) inside the health-check/wait loop. When the refresh timer
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
failure, no reconnect backoff triggered.
If cert_command is absent, no TTL logic runs.
Actor type model
actor_class: str # "human" | "automation" is replaced by:
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
Backward-compat mapping at config load time: "human" → adm, "automation" → atm.
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
canonical values.
Config validation: if actor name is set, it must start with the prefix matching its type
(adm-*, agt-*, atm-*). Hard error, not a warning — the directive requires this for
SIEM auditability.
Tasks
T1 — ActorType enum
id: BRIDGE-WP-0004-T1
state_hub_task_id: 40c7f818-8233-4b84-9a0e-5f5359a47504
status: done
priority: high
models.py: replaceactor_class: strinActorInfowithactor_type: ActorTypeconfig.py: accept legacy"human"→ActorType.ADMand"automation"→ActorType.ATMwith aDeprecationWarning; reject unknown valuesconfig.py: enforce actor name prefix:adm-*for ADM,agt-*for AGT,atm-*for ATM; raiseConfigErroron mismatch- Update
manager.py/audit.pycall sites:actor_class→actor_type.value - Update tests
T2 — cert_command config field
id: BRIDGE-WP-0004-T2
state_hub_task_id: d69ac3b8-6c68-4da0-976f-0cce2ee626d6
status: done
priority: high
models.py: addcert_command: Optional[str] = NonetoTunnelConfigconfig.py: parsecert_commandfrom tunnel YAML; no validation of the string content (shell-level freedom intentional)- Document in config example / SCOPE.md
T3 — Cert acquisition in manager
id: BRIDGE-WP-0004-T3
state_hub_task_id: b93be1e4-dd32-4e9c-a085-c5bf81108d97
status: done
priority: high
manager.py: extract cert acquisition into_acquire_cert(cfg) -> Optional[Path]- Ifcfg.cert_commandis None: return None (static key mode) - Runcert_commandviasubprocess.run(shell=True, capture_output=True)- Write stdout to~/.local/state/bridge/<tunnel>-cert.pub(overwrite each time) - Return path; on non-zero exit code: raiseCertAcquisitionErrorwith stderrbuild_ssh_command: accept optionalcert_path; when set, insert-i <cert_path>after-i <key_path>(OpenSSH loads both automatically)- Call
_acquire_certat the top of each reconnect iteration (not once at startup) so every reconnect gets a fresh cert
T4 — cert_identity in audit log
id: BRIDGE-WP-0004-T4
state_hub_task_id: bc29cc2a-1d77-48d8-97d3-54a49de0550e
status: done
priority: high
manager.py: after cert acquisition, parsessh-keygen -L -f <cert>output to extractKey ID(the-Ivalue from signing time)- Add
cert_identity: Optional[str]toAuditLogger.log()signature; include in JSON entry when present - Log
cert_identityinBRIDGE_CONNECTEDandBRIDGE_STARTEDevents AuditEvent: no new events needed;cert_identityis metadata on existing events
T5 — TTL-aware cert refresh
id: BRIDGE-WP-0004-T5
state_hub_task_id: cc3aee49-7821-4a11-a331-be562aa88d91
status: done
priority: high
manager.py: after successful cert acquisition, parseValid before:timestamp fromssh-keygen -Loutput →cert_expires_at: datetime- In the health-check/wait loop, check
datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)on each iteration - When refresh is due: call
proc.terminate(), break inner loop, let the outer reconnect loop restart naturally (T3 will re-acquire the cert at the top of the next iteration) - Log a new
AuditEvent.CERT_EXPIRINGevent when refresh is triggered (add toAuditEventenum); includecert_identityandcert_expires_atin detail field - If
cert_commandis absent, skip all TTL logic entirely
T6 — bridge cert-status command
id: BRIDGE-WP-0004-T6
state_hub_task_id: b10275fc-bfe2-49a9-a83e-dd0dec796efd
status: done
priority: medium
cli.py: addcert-status [TUNNEL]subcommand- For each tunnel (or the named one): read cert file from state dir if present,
run
ssh-keygen -L, display: identity, principals, valid-from, valid-until, time-to-expiry (or "static key / no cert" if absent) - Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
--jsonflag for machine-readable output
T7 — CertAcquisitionError handling
id: BRIDGE-WP-0004-T7
state_hub_task_id: de355a7c-f07e-452e-974f-4ddf362b24a6
status: done
priority: high
- New exception
CertAcquisitionErrorinmodels.py - In
_run_loop: catchCertAcquisitionError, logAuditEvent.BRIDGE_DISCONNECTEDwithdetail="cert acquisition failed: <stderr>", apply normal backoff and retry (cert failures are transient — e.g., Vault briefly unreachable) - After
max_attemptsconsecutive cert failures, transition toFAILEDstate
T8 — SCOPE.md and documentation updates
id: BRIDGE-WP-0004-T8
state_hub_task_id: 40f5364b-f9e1-41cb-90e5-2b19511108f1
status: done
priority: medium
- Update
SCOPE.md: Current State updated to reflect completion; directive alignment done wiki/OpsBridgeFrs.md§5.7 already covers actor attribution abstractly — no changes needed.claude/rules/architecture.mdalready documents cert_command mode and actor vocab- Update
wiki/OpsBridgePrd.md: note directive alignment, ops-warden dependency (deferred)
T9 — Tests
id: BRIDGE-WP-0004-T9
state_hub_task_id: fc1d1321-c1d0-4a0a-ae2e-d9ec9939dd6a
status: done
priority: high
test_config.py: actor name prefix validation (adm/agt/atm); legacy class mapping; cert_command parsetest_manager.py: mockcert_commandsubprocess; verify cert path appended to SSH args; verifyCertAcquisitionErroron non-zero exit; TTL logic helperstest_audit.py:cert_identityfield; actor_type renametest_cli.py:cert-statusexit codes; JSON output shape- 233 tests, 0 failures
Config Schema — Before / After
Before
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: ops-agent
ssh_key: ~/.ssh/id_ed25519
actor: automation-agent
actors:
automation-agent:
class: automation
description: "state hub bridge agent"
After (static key mode — unchanged behavior)
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
After (cert_command mode — ops-warden or any CA)
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
Acceptance Criteria
- Existing
tunnels.yamlwithclass: automationloads without error (deprecation warning only); tunnel behaves identically - New config with
class: agtand actor name not prefixedagt-raisesConfigError - Config with
cert_commandset: SSH process launched with both-i keyand-i cert;cert_identitypresent inBRIDGE_CONNECTEDaudit event - Config without
cert_command: no cert file written;cert_identityabsent in audit; no TTL logic runs cert_commandexits non-zero: tunnel enters backoff/retry,BRIDGE_DISCONNECTEDlogged with stderr detail; eventually reachesFAILEDaftermax_attempts- Cert within 5 min of expiry: SSH restarted with fresh cert;
CERT_EXPIRINGlogged bridge cert-statusshows valid cert info; exits 1 on expired cert- All tests pass:
uv run pytest(233 passed) - All lints pass:
uv run ruff check .