Files
ops-warden/workplans/WARDEN-WP-0002-correctness-and-completeness.md
tegwick 9857ed1424 feat(warden): implement WARDEN-WP-0002 correctness and operational completeness
T1 — TTL max enforcement:
  - models.py: MAX_TTL_HOURS policy constant
  - ca.py: _enforce_ttl() raises CAError when spec.ttl_hours > type max
  - Called at top of LocalCA.sign() and VaultCA.sign()
  - scorecard.py: check_ttl_policy() — flags certs with issued TTL > type max
  - run_scorecard() now returns 5 checks

T2 — Stale cert cleanup:
  - ca.py: _evict_cert() removes existing cert before writing new one (no accumulation)
  - cli.py: warden cleanup [actor] [--dry-run] command
  - check_no_stale_certs detail suggests 'warden cleanup' when stale certs found

T3 — Outgoing signatures log:
  - ca.py: _append_signature_log() writes JSONL to state_dir/signatures.log
  - Called after every successful sign() in LocalCA and VaultCA
  - cli.py: warden log [actor] [--last N] [--json] command
  - parse_cert_metadata now also returns valid_from (needed for TTL policy check)

61 tests passing, ruff clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 15:53:10 +02:00

6.9 KiB

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug planning_priority planning_order created updated state_hub_workstream_id
WARDEN-WP-0002 workplan OpsWarden Correctness and Operational Completeness custodian ops-warden done Bernd custodian high 2 2026-05-15 2026-05-15 5a9fba2c-6161-49a4-a231-e750fa4ab572

WARDEN-WP-0002 — Correctness and Operational Completeness

Scope: Fix three functional gaps identified after WARDEN-WP-0001: TTL max enforcement (directive compliance), stale cert cleanup (SCOPE.md promises it), and an outgoing signatures log (audit traceability for every signing operation).

Out of scope: Test coverage improvements (WARDEN-WP-0003), Vault cluster setup, host-side principal deployment.


Goal

After this workplan:

  1. warden sign and warden issue reject TTLs that exceed the type maximum defined in the AccessManagementDirective — no cert can be silently issued with a longer-than-allowed validity window.
  2. Stale/expired certs do not accumulate in the state dir. warden cleanup provides an on-demand sweep; LocalCA.sign() auto-evicts the previous cert for the same actor before writing the new one.
  3. Every successful signing operation is recorded in an append-only signatures.log in the state dir. warden log provides a human-readable and machine-readable view of the signing history.

Reference Documents

Document Location
AccessManagementDirective wiki/AccessManagementDirective.md
WARDEN-WP-0001 workplans/WARDEN-WP-0001-initial-implementation.md
SCOPE.md SCOPE.md

Design Decisions

TTL enforcement: reject, don't clamp

When spec.ttl_hours > DEFAULT_TTL_HOURS[actor_type], raise CAError rather than silently clamping. A silent clamp would mask configuration errors and hide directive violations from operators. An explicit error forces a deliberate decision.

The check lives in CABackend.sign() before the subprocess call so it applies to both LocalCA and VaultCA. Vault's own role max_ttl provides a second layer; this check is the warden-side gate.

Cleanup: proactive (on sign) + reactive (on demand)

LocalCA.sign() removes the previous cert for the same actor before writing the new one — this keeps state_dir from growing unboundedly under normal operation. warden cleanup handles the edge cases: certs whose actor is no longer in the inventory, certs from aborted sessions, certs left by actors that were renamed.

VaultCA.sign() also evicts before writing (same logic, same helper function).

Signatures log: JSONL, append-only, in state_dir

One line per signing event, written after a successful CertRecord is produced. Format: {"timestamp": ..., "actor": ..., "actor_type": ..., "identity": ..., "principals": [...], "ttl_hours": ..., "valid_before": ..., "backend": ...}.

The log lives alongside certs in state_dir so a single directory backup captures the full operational history. No rotation at this scope — add rotation in a follow-up if the file grows beyond a few MB in practice.

warden log is read-only. No deletion via CLI — the log is an audit artefact.


Tasks

T1 — TTL max enforcement per ActorType

id: WARDEN-WP-0002-T1
state_hub_task_id: b0d0b5f7-a181-4590-be26-c48ae28cd964
status: done
priority: high
  • models.py: add MAX_TTL_HOURS = DEFAULT_TTL_HOURS alias (same values, explicit name signals policy intent); add helper enforce_ttl(spec: CertSpec) -> None that raises CAError when spec.ttl_hours > MAX_TTL_HOURS[spec.actor_type]
  • ca.py: call enforce_ttl(spec) at the top of CABackend.sign() base (or in both LocalCA.sign() and VaultCA.sign() if no shared base call)
  • scorecard.py: add check_ttl_policy(state_dir, inventory) — parse each cert in state_dir via ssh-keygen -L; compare cert validity window duration against MAX_TTL_HOURS[actor_type]; flag if exceeded
  • Add check_ttl_policy to run_scorecard()
  • Update tests: test_ca.py — assert CAError raised when ttl_hours exceeds max for each type; assert no error at exactly the max

T2 — Stale cert cleanup command

id: WARDEN-WP-0002-T2
state_hub_task_id: aeeefbad-c0bd-4ae8-a3fe-9f72321b4caa
status: done
priority: medium
  • ca.py: extract _evict_cert(actor_name, state_dir) — removes state_dir/<actor_name>-cert.pub if it exists; call at the top of LocalCA.sign() and VaultCA.sign() before writing the new cert
  • cli.py: add warden cleanup [actor-name] command - No actor-name: iterate state_dir/*.cert.pub, remove any whose valid_before < now - 5 min - With actor-name: remove only that actor's cert if stale - --dry-run: print what would be removed without deleting - Exit 0 always (cleanup is idempotent; nothing to clean is not an error)
  • Update check_no_stale_certs scorecard check detail message to suggest running warden cleanup
  • Update tests: verify _evict_cert is called during sign; verify cleanup command removes stale file; verify --dry-run does not delete

T3 — Outgoing signatures log

id: WARDEN-WP-0002-T3
state_hub_task_id: 0194d24f-a8fe-4f6d-88e6-addea3542c0e
status: done
priority: medium
  • ca.py: after a successful CertRecord is produced in LocalCA.sign() and VaultCA.sign(), call _append_signature_log(record, spec, state_dir, backend) which appends a JSONL line to state_dir/signatures.log Fields: timestamp (ISO 8601 UTC), actor, actor_type, identity, principals, ttl_hours, valid_before, cert_path, backend
  • cli.py: add warden log [actor-name] command - Reads state_dir/signatures.log (empty list if absent) - --last N (default 20): show last N entries - --actor <name>: filter by actor - --json: output newline-delimited JSON; default: Rich table - Exit 0 always
  • Update tests: verify log entry written after sign; verify log not written on CAError; verify warden log filters correctly

Acceptance Criteria

  • warden sign agt-test --pubkey /tmp/k.pub --ttl 100 raises CAError (agt max is 24h)
  • warden sign agt-test --pubkey /tmp/k.pub --ttl 24 succeeds
  • warden scorecard includes TTL policy check; fails when a cert exceeds type max
  • After warden sign, state_dir/signatures.log has one new line; valid JSON
  • warden log renders a table; warden log --json is parseable
  • warden log --actor agt-test returns only entries for that actor
  • warden cleanup --dry-run lists stale certs without deleting
  • warden cleanup removes stale certs; scorecard no_stale_certs passes after
  • Re-signing an actor replaces its cert file (no accumulation)
  • All tests pass: uv run pytest
  • All lints pass: uv run ruff check .