--- id: WARDEN-WP-0002 type: workplan title: "OpsWarden Correctness and Operational Completeness" domain: custodian repo: ops-warden status: active owner: Bernd topic_slug: custodian created: "2026-05-15" updated: "2026-05-15" state_hub_workstream_id: "5a9fba2c-6161-49a4-a231-e750fa4ab572" --- # WARDEN-WP-0002 — Correctness and Operational Completeness **Scope:** Fix three functional gaps identified after WARDEN-WP-0001: TTL max enforcement (directive compliance), stale cert cleanup (SCOPE.md promises it), and an outgoing signatures log (audit traceability for every signing operation). **Out of scope:** Test coverage improvements (WARDEN-WP-0003), Vault cluster setup, host-side principal deployment. --- ## Goal After this workplan: 1. `warden sign` and `warden issue` reject TTLs that exceed the type maximum defined in the AccessManagementDirective — no cert can be silently issued with a longer-than-allowed validity window. 2. Stale/expired certs do not accumulate in the state dir. `warden cleanup` provides an on-demand sweep; `LocalCA.sign()` auto-evicts the previous cert for the same actor before writing the new one. 3. Every successful signing operation is recorded in an append-only `signatures.log` in the state dir. `warden log` provides a human-readable and machine-readable view of the signing history. --- ## Reference Documents | Document | Location | |---|---| | AccessManagementDirective | `wiki/AccessManagementDirective.md` | | WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` | | SCOPE.md | `SCOPE.md` | --- ## Design Decisions ### TTL enforcement: reject, don't clamp When `spec.ttl_hours > DEFAULT_TTL_HOURS[actor_type]`, raise `CAError` rather than silently clamping. A silent clamp would mask configuration errors and hide directive violations from operators. An explicit error forces a deliberate decision. The check lives in `CABackend.sign()` before the subprocess call so it applies to both `LocalCA` and `VaultCA`. Vault's own role `max_ttl` provides a second layer; this check is the warden-side gate. ### Cleanup: proactive (on sign) + reactive (on demand) `LocalCA.sign()` removes the previous cert for the same actor before writing the new one — this keeps state_dir from growing unboundedly under normal operation. `warden cleanup` handles the edge cases: certs whose actor is no longer in the inventory, certs from aborted sessions, certs left by actors that were renamed. `VaultCA.sign()` also evicts before writing (same logic, same helper function). ### Signatures log: JSONL, append-only, in state_dir One line per signing event, written after a successful `CertRecord` is produced. Format: `{"timestamp": ..., "actor": ..., "actor_type": ..., "identity": ..., "principals": [...], "ttl_hours": ..., "valid_before": ..., "backend": ...}`. The log lives alongside certs in `state_dir` so a single directory backup captures the full operational history. No rotation at this scope — add rotation in a follow-up if the file grows beyond a few MB in practice. `warden log` is read-only. No deletion via CLI — the log is an audit artefact. --- ## Tasks ### T1 — TTL max enforcement per ActorType ```task id: WARDEN-WP-0002-T1 state_hub_task_id: b0d0b5f7-a181-4590-be26-c48ae28cd964 status: todo priority: high ``` - [ ] `models.py`: add `MAX_TTL_HOURS = DEFAULT_TTL_HOURS` alias (same values, explicit name signals policy intent); add helper `enforce_ttl(spec: CertSpec) -> None` that raises `CAError` when `spec.ttl_hours > MAX_TTL_HOURS[spec.actor_type]` - [ ] `ca.py`: call `enforce_ttl(spec)` at the top of `CABackend.sign()` base (or in both `LocalCA.sign()` and `VaultCA.sign()` if no shared base call) - [ ] `scorecard.py`: add `check_ttl_policy(state_dir, inventory)` — parse each cert in state_dir via `ssh-keygen -L`; compare cert validity window duration against `MAX_TTL_HOURS[actor_type]`; flag if exceeded - [ ] Add `check_ttl_policy` to `run_scorecard()` - [ ] Update tests: `test_ca.py` — assert `CAError` raised when `ttl_hours` exceeds max for each type; assert no error at exactly the max ### T2 — Stale cert cleanup command ```task id: WARDEN-WP-0002-T2 state_hub_task_id: aeeefbad-c0bd-4ae8-a3fe-9f72321b4caa status: todo priority: medium ``` - [ ] `ca.py`: extract `_evict_cert(actor_name, state_dir)` — removes `state_dir/-cert.pub` if it exists; call at the top of `LocalCA.sign()` and `VaultCA.sign()` before writing the new cert - [ ] `cli.py`: add `warden cleanup [actor-name]` command - No actor-name: iterate `state_dir/*.cert.pub`, remove any whose `valid_before < now - 5 min` - With actor-name: remove only that actor's cert if stale - `--dry-run`: print what would be removed without deleting - Exit 0 always (cleanup is idempotent; nothing to clean is not an error) - [ ] Update `check_no_stale_certs` scorecard check detail message to suggest running `warden cleanup` - [ ] Update tests: verify `_evict_cert` is called during sign; verify cleanup command removes stale file; verify `--dry-run` does not delete ### T3 — Outgoing signatures log ```task id: WARDEN-WP-0002-T3 state_hub_task_id: 0194d24f-a8fe-4f6d-88e6-addea3542c0e status: todo priority: medium ``` - [ ] `ca.py`: after a successful `CertRecord` is produced in `LocalCA.sign()` and `VaultCA.sign()`, call `_append_signature_log(record, spec, state_dir, backend)` which appends a JSONL line to `state_dir/signatures.log` Fields: `timestamp` (ISO 8601 UTC), `actor`, `actor_type`, `identity`, `principals`, `ttl_hours`, `valid_before`, `cert_path`, `backend` - [ ] `cli.py`: add `warden log [actor-name]` command - Reads `state_dir/signatures.log` (empty list if absent) - `--last N` (default 20): show last N entries - `--actor `: filter by actor - `--json`: output newline-delimited JSON; default: Rich table - Exit 0 always - [ ] Update tests: verify log entry written after sign; verify log not written on CAError; verify `warden log` filters correctly --- ## Acceptance Criteria - [ ] `warden sign agt-test --pubkey /tmp/k.pub --ttl 100` raises `CAError` (agt max is 24h) - [ ] `warden sign agt-test --pubkey /tmp/k.pub --ttl 24` succeeds - [ ] `warden scorecard` includes TTL policy check; fails when a cert exceeds type max - [ ] After `warden sign`, `state_dir/signatures.log` has one new line; valid JSON - [ ] `warden log` renders a table; `warden log --json` is parseable - [ ] `warden log --actor agt-test` returns only entries for that actor - [ ] `warden cleanup --dry-run` lists stale certs without deleting - [ ] `warden cleanup` removes stale certs; scorecard `no_stale_certs` passes after - [ ] Re-signing an actor replaces its cert file (no accumulation) - [ ] All tests pass: `uv run pytest` - [ ] All lints pass: `uv run ruff check .`