Files
ops-warden/workplans/WARDEN-WP-0002-correctness-and-completeness.md
tegwick 9857ed1424 feat(warden): implement WARDEN-WP-0002 correctness and operational completeness
T1 — TTL max enforcement:
  - models.py: MAX_TTL_HOURS policy constant
  - ca.py: _enforce_ttl() raises CAError when spec.ttl_hours > type max
  - Called at top of LocalCA.sign() and VaultCA.sign()
  - scorecard.py: check_ttl_policy() — flags certs with issued TTL > type max
  - run_scorecard() now returns 5 checks

T2 — Stale cert cleanup:
  - ca.py: _evict_cert() removes existing cert before writing new one (no accumulation)
  - cli.py: warden cleanup [actor] [--dry-run] command
  - check_no_stale_certs detail suggests 'warden cleanup' when stale certs found

T3 — Outgoing signatures log:
  - ca.py: _append_signature_log() writes JSONL to state_dir/signatures.log
  - Called after every successful sign() in LocalCA and VaultCA
  - cli.py: warden log [actor] [--last N] [--json] command
  - parse_cert_metadata now also returns valid_from (needed for TTL policy check)

61 tests passing, ruff clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:53:10 +02:00

177 lines
6.9 KiB
Markdown

---
id: WARDEN-WP-0002
type: workplan
title: "OpsWarden Correctness and Operational Completeness"
domain: custodian
repo: ops-warden
status: done
owner: Bernd
topic_slug: custodian
planning_priority: high
planning_order: 2
created: "2026-05-15"
updated: "2026-05-15"
state_hub_workstream_id: "5a9fba2c-6161-49a4-a231-e750fa4ab572"
---
# WARDEN-WP-0002 — Correctness and Operational Completeness
**Scope:** Fix three functional gaps identified after WARDEN-WP-0001: TTL max
enforcement (directive compliance), stale cert cleanup (SCOPE.md promises it),
and an outgoing signatures log (audit traceability for every signing operation).
**Out of scope:** Test coverage improvements (WARDEN-WP-0003), Vault cluster
setup, host-side principal deployment.
---
## Goal
After this workplan:
1. `warden sign` and `warden issue` reject TTLs that exceed the type maximum
defined in the AccessManagementDirective — no cert can be silently issued
with a longer-than-allowed validity window.
2. Stale/expired certs do not accumulate in the state dir. `warden cleanup`
provides an on-demand sweep; `LocalCA.sign()` auto-evicts the previous cert
for the same actor before writing the new one.
3. Every successful signing operation is recorded in an append-only
`signatures.log` in the state dir. `warden log` provides a human-readable
and machine-readable view of the signing history.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
| SCOPE.md | `SCOPE.md` |
---
## Design Decisions
### TTL enforcement: reject, don't clamp
When `spec.ttl_hours > DEFAULT_TTL_HOURS[actor_type]`, raise `CAError` rather
than silently clamping. A silent clamp would mask configuration errors and hide
directive violations from operators. An explicit error forces a deliberate
decision.
The check lives in `CABackend.sign()` before the subprocess call so it applies
to both `LocalCA` and `VaultCA`. Vault's own role `max_ttl` provides a second
layer; this check is the warden-side gate.
### Cleanup: proactive (on sign) + reactive (on demand)
`LocalCA.sign()` removes the previous cert for the same actor before writing the
new one — this keeps state_dir from growing unboundedly under normal operation.
`warden cleanup` handles the edge cases: certs whose actor is no longer in the
inventory, certs from aborted sessions, certs left by actors that were renamed.
`VaultCA.sign()` also evicts before writing (same logic, same helper function).
### Signatures log: JSONL, append-only, in state_dir
One line per signing event, written after a successful `CertRecord` is produced.
Format: `{"timestamp": ..., "actor": ..., "actor_type": ..., "identity": ...,
"principals": [...], "ttl_hours": ..., "valid_before": ..., "backend": ...}`.
The log lives alongside certs in `state_dir` so a single directory backup
captures the full operational history. No rotation at this scope — add rotation
in a follow-up if the file grows beyond a few MB in practice.
`warden log` is read-only. No deletion via CLI — the log is an audit artefact.
---
## Tasks
### T1 — TTL max enforcement per ActorType
```task
id: WARDEN-WP-0002-T1
state_hub_task_id: b0d0b5f7-a181-4590-be26-c48ae28cd964
status: done
priority: high
```
- [x] `models.py`: add `MAX_TTL_HOURS = DEFAULT_TTL_HOURS` alias (same values,
explicit name signals policy intent); add helper
`enforce_ttl(spec: CertSpec) -> None` that raises `CAError` when
`spec.ttl_hours > MAX_TTL_HOURS[spec.actor_type]`
- [x] `ca.py`: call `enforce_ttl(spec)` at the top of `CABackend.sign()` base
(or in both `LocalCA.sign()` and `VaultCA.sign()` if no shared base call)
- [x] `scorecard.py`: add `check_ttl_policy(state_dir, inventory)` — parse each
cert in state_dir via `ssh-keygen -L`; compare cert validity window
duration against `MAX_TTL_HOURS[actor_type]`; flag if exceeded
- [x] Add `check_ttl_policy` to `run_scorecard()`
- [x] Update tests: `test_ca.py` — assert `CAError` raised when `ttl_hours`
exceeds max for each type; assert no error at exactly the max
### T2 — Stale cert cleanup command
```task
id: WARDEN-WP-0002-T2
state_hub_task_id: aeeefbad-c0bd-4ae8-a3fe-9f72321b4caa
status: done
priority: medium
```
- [x] `ca.py`: extract `_evict_cert(actor_name, state_dir)` — removes
`state_dir/<actor_name>-cert.pub` if it exists; call at the top of
`LocalCA.sign()` and `VaultCA.sign()` before writing the new cert
- [x] `cli.py`: add `warden cleanup [actor-name]` command
- No actor-name: iterate `state_dir/*.cert.pub`, remove any whose
`valid_before < now - 5 min`
- With actor-name: remove only that actor's cert if stale
- `--dry-run`: print what would be removed without deleting
- Exit 0 always (cleanup is idempotent; nothing to clean is not an error)
- [x] Update `check_no_stale_certs` scorecard check detail message to suggest
running `warden cleanup`
- [x] Update tests: verify `_evict_cert` is called during sign; verify cleanup
command removes stale file; verify `--dry-run` does not delete
### T3 — Outgoing signatures log
```task
id: WARDEN-WP-0002-T3
state_hub_task_id: 0194d24f-a8fe-4f6d-88e6-addea3542c0e
status: done
priority: medium
```
- [x] `ca.py`: after a successful `CertRecord` is produced in `LocalCA.sign()`
and `VaultCA.sign()`, call `_append_signature_log(record, spec, state_dir,
backend)` which appends a JSONL line to
`state_dir/signatures.log`
Fields: `timestamp` (ISO 8601 UTC), `actor`, `actor_type`, `identity`,
`principals`, `ttl_hours`, `valid_before`, `cert_path`, `backend`
- [x] `cli.py`: add `warden log [actor-name]` command
- Reads `state_dir/signatures.log` (empty list if absent)
- `--last N` (default 20): show last N entries
- `--actor <name>`: filter by actor
- `--json`: output newline-delimited JSON; default: Rich table
- Exit 0 always
- [x] Update tests: verify log entry written after sign; verify log not written
on CAError; verify `warden log` filters correctly
---
## Acceptance Criteria
- [x] `warden sign agt-test --pubkey /tmp/k.pub --ttl 100` raises `CAError`
(agt max is 24h)
- [x] `warden sign agt-test --pubkey /tmp/k.pub --ttl 24` succeeds
- [x] `warden scorecard` includes TTL policy check; fails when a cert exceeds type max
- [x] After `warden sign`, `state_dir/signatures.log` has one new line; valid JSON
- [x] `warden log` renders a table; `warden log --json` is parseable
- [x] `warden log --actor agt-test` returns only entries for that actor
- [x] `warden cleanup --dry-run` lists stale certs without deleting
- [x] `warden cleanup` removes stale certs; scorecard `no_stale_certs` passes after
- [x] Re-signing an actor replaces its cert file (no accumulation)
- [x] All tests pass: `uv run pytest`
- [x] All lints pass: `uv run ruff check .`