Expands architecture constraints and SCOPE.md to reflect the three-actor vocabulary (adm/agt/atm), two credential modes (static key + cert_command), and ops-warden boundary. Adds directive wiki doc and two new workplans (BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
11 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated
| id | type | title | domain | repo | status | owner | topic_slug | created | updated |
|---|---|---|---|---|---|---|---|---|---|
| BRIDGE-WP-0004 | workplan | AccessManagementDirective Alignment | custodian | ops-bridge | draft | Bernd | custodian | 2026-03-28 | 2026-03-28 |
BRIDGE-WP-0004 — AccessManagementDirective Alignment
Scope: Align ops-bridge with wiki/AccessManagementDirective.md — three-actor model,
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
preserving full backward compatibility with the existing static-key mode.
Out of scope: CA/signing logic itself (lives in ops-warden), host-side principal
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
Goal
After this workplan:
ops-bridgeworks unchanged for anyone using plain, non-expiring SSH keys.ops-bridgeworks with CA-signed short-lived certs viaops-warden(or any compatiblecert_command) — cert acquisition, cert rotation, and cert identity logging are all handled transparently by the tunnel manager.- Actor attribution is expressed in the three-actor vocabulary (
adm | agt | atm) from the directive, with config validation that enforces naming conventions. - The audit log carries
cert_identitywhen a cert was used, satisfying the directive's §5 SIEM traceability requirement.
Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | wiki/AccessManagementDirective.md |
| WARDEN-WP-0001 | workplans/WARDEN-WP-0001-initial-implementation.md |
| PRD | wiki/OpsBridgePrd.md |
| FRS | wiki/OpsBridgeFrs.md |
Design Decisions
Static key mode stays first-class
If cert_command is absent from a tunnel config, ops-bridge behaves exactly as today:
ssh_key is passed directly to ssh -i. No deprecation, no warnings. Static keys are
explicitly supported for:
- Lab/dev environments without a CA
- Tunnels owned by
adm-class humans who manage their own cert refresh externally - Environments below the directive's complexity threshold
cert_command interface
# tunnels.yaml — optional cert_command field
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
When cert_command is present, manager.py runs it before every SSH subprocess launch,
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
-i <cert_path> alongside -i <key_path> to the SSH command. The cert file is cleaned up
on tunnel stop.
cert_command is a raw shell string, intentionally. The caller decides whether it invokes
warden, vault write, ssh-keygen -s, or any other tool. This keeps the interface
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
TTL-aware cert refresh
After acquiring a cert, manager.py parses Valid before: via ssh-keygen -L to
determine cert_expires_at. It schedules a pre-emptive cert refresh
(cert_expires_at - 5 min) inside the health-check/wait loop. When the refresh timer
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
failure, no reconnect backoff triggered.
If cert_command is absent, no TTL logic runs.
Actor type model
actor_class: str # "human" | "automation" is replaced by:
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
Backward-compat mapping at config load time: "human" → adm, "automation" → atm.
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
canonical values.
Config validation: if actor name is set, it must start with the prefix matching its type
(adm-*, agt-*, atm-*). Hard error, not a warning — the directive requires this for
SIEM auditability.
Tasks
T1 — ActorType enum
models.py: replaceactor_class: strinActorInfowithactor_type: ActorTypeconfig.py: accept legacy"human"→ActorType.ADMand"automation"→ActorType.ATMwith aDeprecationWarning; reject unknown valuesconfig.py: enforce actor name prefix:adm-*for ADM,agt-*for AGT,atm-*for ATM; raiseConfigErroron mismatch- Update
manager.py/audit.pycall sites:actor_class→actor_type.value - Update tests
T2 — cert_command config field
models.py: addcert_command: Optional[str] = NonetoTunnelConfigconfig.py: parsecert_commandfrom tunnel YAML; no validation of the string content (shell-level freedom intentional)- Document in config example / SCOPE.md
T3 — Cert acquisition in manager
manager.py: extract cert acquisition into_acquire_cert(cfg) -> Optional[Path]- Ifcfg.cert_commandis None: return None (static key mode) - Runcert_commandviasubprocess.run(shell=True, capture_output=True)- Write stdout to~/.local/state/bridge/<tunnel>-cert.pub(overwrite each time) - Return path; on non-zero exit code: raiseCertAcquisitionErrorwith stderrbuild_ssh_command: accept optionalcert_path; when set, insert-i <cert_path>after-i <key_path>(OpenSSH loads both automatically)- Call
_acquire_certat the top of each reconnect iteration (not once at startup) so every reconnect gets a fresh cert
T4 — cert_identity in audit log
manager.py: after cert acquisition, parsessh-keygen -L -f <cert>output to extractKey ID(the-Ivalue from signing time)- Add
cert_identity: Optional[str]toAuditLogger.log()signature; include in JSON entry when present - Log
cert_identityinBRIDGE_CONNECTEDandBRIDGE_STARTEDevents AuditEvent: no new events needed;cert_identityis metadata on existing events
T5 — TTL-aware cert refresh
manager.py: after successful cert acquisition, parseValid before:timestamp fromssh-keygen -Loutput →cert_expires_at: datetime- In the health-check/wait loop, check
datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)on each iteration - When refresh is due: call
proc.terminate(), break inner loop, let the outer reconnect loop restart naturally (T3 will re-acquire the cert at the top of the next iteration) - Log a new
AuditEvent.CERT_EXPIRINGevent when refresh is triggered (add toAuditEventenum); includecert_identityandcert_expires_atin detail field - If
cert_commandis absent, skip all TTL logic entirely
T6 — bridge cert-status command
cli.py: addcert-status [TUNNEL]subcommand- For each tunnel (or the named one): read cert file from state dir if present,
run
ssh-keygen -L, display: identity, principals, valid-from, valid-until, time-to-expiry (or "static key / no cert" if absent) - Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
--jsonflag for machine-readable output
T7 — CertAcquisitionError handling
- New exception
CertAcquisitionErrorinmodels.py - In
_run_loop: catchCertAcquisitionError, logAuditEvent.BRIDGE_DISCONNECTEDwithdetail="cert acquisition failed: <stderr>", apply normal backoff and retry (cert failures are transient — e.g., Vault briefly unreachable) - After
max_attemptsconsecutive cert failures, transition toFAILEDstate
T8 — SCOPE.md and documentation updates
- Update
SCOPE.md: replace "Identity/credential management (uses existing SSH keys)" with the pluggable cert_command model; add ops-warden as related repo; update actor terminology to adm/agt/atm; update Current State - Update
wiki/OpsBridgeFrs.md§5.7 (actor attribution): note three-actor model, cert_identity field, cert_command interface - Update
wiki/OpsBridgePrd.md: note directive alignment, ops-warden dependency - Update config example in README /
wiki/to show both static and cert_command modes - Update
.claude/rules/architecture.md: add cert lifecycle to architecture description
T9 — Tests
test_config.py: actor name prefix validation (adm/agt/atm); legacy class mapping; cert_command parsetest_manager.py: mockcert_commandsubprocess; verify cert path appended to SSH args; verifyCertAcquisitionErroron non-zero exittest_manager.py: TTL logic — mockcert_expires_atin past; verify refresh triggerstest_audit.py:cert_identityfield present in CONNECTED event when cert was used; absent in static-key modetest_cli.py:cert-statusexit codes; JSON output shape
Config Schema — Before / After
Before
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: ops-agent
ssh_key: ~/.ssh/id_ed25519
actor: automation-agent
actors:
automation-agent:
class: automation
description: "state hub bridge agent"
After (static key mode — unchanged behavior)
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
After (cert_command mode — ops-warden or any CA)
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
Acceptance Criteria
- Existing
tunnels.yamlwithclass: automationloads without error (deprecation warning only); tunnel behaves identically - New config with
class: agtand actor name not prefixedagt-raisesConfigError - Config with
cert_commandset: SSH process launched with both-i keyand-i cert;cert_identitypresent inBRIDGE_CONNECTEDaudit event - Config without
cert_command: no cert file written;cert_identityabsent in audit; no TTL logic runs cert_commandexits non-zero: tunnel enters backoff/retry,BRIDGE_DISCONNECTEDlogged with stderr detail; eventually reachesFAILEDaftermax_attempts- Cert within 5 min of expiry: SSH restarted with fresh cert;
CERT_EXPIRINGlogged bridge cert-statusshows valid cert info; exits 1 on expired cert- All tests pass:
uv run pytest - All lints pass:
uv run ruff check .