Files
ops-bridge/workplans/BRIDGE-WP-0004-directive-alignment.md
2026-04-25 17:06:05 +02:00

274 lines
11 KiB
Markdown

---
id: BRIDGE-WP-0004
type: workplan
title: "AccessManagementDirective Alignment"
domain: custodian
repo: ops-bridge
status: draft
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
state_hub_workstream_id: "e3451b70-688e-4e19-bff5-0c82c0f009a7"
---
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
preserving full backward compatibility with the existing static-key mode.
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
---
## Goal
After this workplan:
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
handled transparently by the tunnel manager.
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
the directive, with config validation that enforces naming conventions.
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
§5 SIEM traceability requirement.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
---
## Design Decisions
### Static key mode stays first-class
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
explicitly supported for:
- Lab/dev environments without a CA
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
- Environments below the directive's complexity threshold
### cert_command interface
```yaml
# tunnels.yaml — optional cert_command field
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
on tunnel stop.
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
### TTL-aware cert refresh
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
failure, no reconnect backoff triggered.
If `cert_command` is absent, no TTL logic runs.
### Actor type model
`actor_class: str # "human" | "automation"` is replaced by:
```python
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
```
Backward-compat mapping at config load time: `"human"``adm`, `"automation"``atm`.
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
canonical values.
Config validation: if `actor` name is set, it must start with the prefix matching its type
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
SIEM auditability.
---
## Tasks
### T1 — ActorType enum
- [ ] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
- [ ] `config.py`: accept legacy `"human"``ActorType.ADM` and `"automation"`
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
- [ ] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
`atm-*` for ATM; raise `ConfigError` on mismatch
- [ ] Update `manager.py` / `audit.py` call sites: `actor_class``actor_type.value`
- [ ] Update tests
### T2 — cert_command config field
- [ ] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
- [ ] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
content (shell-level freedom intentional)
- [ ] Document in config example / SCOPE.md
### T3 — Cert acquisition in manager
- [ ] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
- If `cfg.cert_command` is None: return None (static key mode)
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
- [ ] `build_ssh_command`: accept optional `cert_path`; when set, insert
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
- [ ] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
so every reconnect gets a fresh cert
### T4 — cert_identity in audit log
- [ ] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
extract `Key ID` (the `-I` value from signing time)
- [ ] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
JSON entry when present
- [ ] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
- [ ] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
### T5 — TTL-aware cert refresh
- [ ] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
from `ssh-keygen -L` output → `cert_expires_at: datetime`
- [ ] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
on each iteration
- [ ] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
next iteration)
- [ ] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
- [ ] If `cert_command` is absent, skip all TTL logic entirely
### T6 — `bridge cert-status` command
- [ ] `cli.py`: add `cert-status [TUNNEL]` subcommand
- [ ] For each tunnel (or the named one): read cert file from state dir if present,
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
time-to-expiry (or "static key / no cert" if absent)
- [ ] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
- [ ] `--json` flag for machine-readable output
### T7 — CertAcquisitionError handling
- [ ] New exception `CertAcquisitionError` in `models.py`
- [ ] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
(cert failures are transient — e.g., Vault briefly unreachable)
- [ ] After `max_attempts` consecutive cert failures, transition to `FAILED` state
### T8 — SCOPE.md and documentation updates
- [ ] Update `SCOPE.md`: replace "Identity/credential management (uses existing SSH keys)"
with the pluggable cert_command model; add ops-warden as related repo; update
actor terminology to adm/agt/atm; update Current State
- [ ] Update `wiki/OpsBridgeFrs.md` §5.7 (actor attribution): note three-actor model,
cert_identity field, cert_command interface
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency
- [ ] Update config example in README / `wiki/` to show both static and cert_command modes
- [ ] Update `.claude/rules/architecture.md`: add cert lifecycle to architecture description
### T9 — Tests
- [ ] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
cert_command parse
- [ ] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
args; verify `CertAcquisitionError` on non-zero exit
- [ ] `test_manager.py`: TTL logic — mock `cert_expires_at` in past; verify refresh triggers
- [ ] `test_audit.py`: `cert_identity` field present in CONNECTED event when cert was used;
absent in static-key mode
- [ ] `test_cli.py`: `cert-status` exit codes; JSON output shape
---
## Config Schema — Before / After
### Before
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: ops-agent
ssh_key: ~/.ssh/id_ed25519
actor: automation-agent
actors:
automation-agent:
class: automation
description: "state hub bridge agent"
```
### After (static key mode — unchanged behavior)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
### After (cert_command mode — ops-warden or any CA)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
---
## Acceptance Criteria
- [ ] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
warning only); tunnel behaves identically
- [ ] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
- [ ] Config with `cert_command` set: SSH process launched with both `-i key` and
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
- [ ] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
no TTL logic runs
- [ ] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
- [ ] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
- [ ] `bridge cert-status` shows valid cert info; exits 1 on expired cert
- [ ] All tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`