Files

tegwick 9857ed1424 feat(warden): implement WARDEN-WP-0002 correctness and operational completeness

T1 — TTL max enforcement:
  - models.py: MAX_TTL_HOURS policy constant
  - ca.py: _enforce_ttl() raises CAError when spec.ttl_hours > type max
  - Called at top of LocalCA.sign() and VaultCA.sign()
  - scorecard.py: check_ttl_policy() — flags certs with issued TTL > type max
  - run_scorecard() now returns 5 checks

T2 — Stale cert cleanup:
  - ca.py: _evict_cert() removes existing cert before writing new one (no accumulation)
  - cli.py: warden cleanup [actor] [--dry-run] command
  - check_no_stale_certs detail suggests 'warden cleanup' when stale certs found

T3 — Outgoing signatures log:
  - ca.py: _append_signature_log() writes JSONL to state_dir/signatures.log
  - Called after every successful sign() in LocalCA and VaultCA
  - cli.py: warden log [actor] [--last N] [--json] command
  - parse_cert_metadata now also returns valid_from (needed for TTL policy check)

61 tests passing, ruff clean.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-15 15:53:10 +02:00

6.9 KiB

Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, planning_priority, planning_order, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	planning_priority	planning_order	created	updated	state_hub_workstream_id
WARDEN-WP-0002	workplan	OpsWarden Correctness and Operational Completeness	custodian	ops-warden	done	Bernd	custodian	high	2	2026-05-15	2026-05-15	5a9fba2c-6161-49a4-a231-e750fa4ab572

WARDEN-WP-0002 — Correctness and Operational Completeness

Scope: Fix three functional gaps identified after WARDEN-WP-0001: TTL max enforcement (directive compliance), stale cert cleanup (SCOPE.md promises it), and an outgoing signatures log (audit traceability for every signing operation).

Out of scope: Test coverage improvements (WARDEN-WP-0003), Vault cluster setup, host-side principal deployment.

Goal

After this workplan:

warden sign and warden issue reject TTLs that exceed the type maximum defined in the AccessManagementDirective — no cert can be silently issued with a longer-than-allowed validity window.
Stale/expired certs do not accumulate in the state dir. warden cleanup provides an on-demand sweep; LocalCA.sign() auto-evicts the previous cert for the same actor before writing the new one.
Every successful signing operation is recorded in an append-only signatures.log in the state dir. warden log provides a human-readable and machine-readable view of the signing history.

Reference Documents

Document	Location
AccessManagementDirective	`wiki/AccessManagementDirective.md`
WARDEN-WP-0001	`workplans/WARDEN-WP-0001-initial-implementation.md`
SCOPE.md	`SCOPE.md`

Design Decisions

TTL enforcement: reject, don't clamp

When spec.ttl_hours > DEFAULT_TTL_HOURS[actor_type], raise CAError rather than silently clamping. A silent clamp would mask configuration errors and hide directive violations from operators. An explicit error forces a deliberate decision.

The check lives in CABackend.sign() before the subprocess call so it applies to both LocalCA and VaultCA. Vault's own role max_ttl provides a second layer; this check is the warden-side gate.

Cleanup: proactive (on sign) + reactive (on demand)

LocalCA.sign() removes the previous cert for the same actor before writing the new one — this keeps state_dir from growing unboundedly under normal operation. warden cleanup handles the edge cases: certs whose actor is no longer in the inventory, certs from aborted sessions, certs left by actors that were renamed.

VaultCA.sign() also evicts before writing (same logic, same helper function).

Signatures log: JSONL, append-only, in state_dir

One line per signing event, written after a successful CertRecord is produced. Format: {"timestamp": ..., "actor": ..., "actor_type": ..., "identity": ..., "principals": [...], "ttl_hours": ..., "valid_before": ..., "backend": ...}.

The log lives alongside certs in state_dir so a single directory backup captures the full operational history. No rotation at this scope — add rotation in a follow-up if the file grows beyond a few MB in practice.

warden log is read-only. No deletion via CLI — the log is an audit artefact.

Tasks

T1 — TTL max enforcement per ActorType

id: WARDEN-WP-0002-T1
state_hub_task_id: b0d0b5f7-a181-4590-be26-c48ae28cd964
status: done
priority: high

models.py: add MAX_TTL_HOURS = DEFAULT_TTL_HOURS alias (same values, explicit name signals policy intent); add helper enforce_ttl(spec: CertSpec) -> None that raises CAError when spec.ttl_hours > MAX_TTL_HOURS[spec.actor_type]
ca.py: call enforce_ttl(spec) at the top of CABackend.sign() base (or in both LocalCA.sign() and VaultCA.sign() if no shared base call)
scorecard.py: add check_ttl_policy(state_dir, inventory) — parse each cert in state_dir via ssh-keygen -L; compare cert validity window duration against MAX_TTL_HOURS[actor_type]; flag if exceeded
Add check_ttl_policy to run_scorecard()
Update tests: test_ca.py — assert CAError raised when ttl_hours exceeds max for each type; assert no error at exactly the max

T2 — Stale cert cleanup command

id: WARDEN-WP-0002-T2
state_hub_task_id: aeeefbad-c0bd-4ae8-a3fe-9f72321b4caa
status: done
priority: medium

ca.py: extract _evict_cert(actor_name, state_dir) — removes state_dir/<actor_name>-cert.pub if it exists; call at the top of LocalCA.sign() and VaultCA.sign() before writing the new cert
cli.py: add warden cleanup [actor-name] command - No actor-name: iterate state_dir/*.cert.pub, remove any whose valid_before < now - 5 min - With actor-name: remove only that actor's cert if stale - --dry-run: print what would be removed without deleting - Exit 0 always (cleanup is idempotent; nothing to clean is not an error)
Update check_no_stale_certs scorecard check detail message to suggest running warden cleanup
Update tests: verify _evict_cert is called during sign; verify cleanup command removes stale file; verify --dry-run does not delete

T3 — Outgoing signatures log

id: WARDEN-WP-0002-T3
state_hub_task_id: 0194d24f-a8fe-4f6d-88e6-addea3542c0e
status: done
priority: medium

ca.py: after a successful CertRecord is produced in LocalCA.sign() and VaultCA.sign(), call _append_signature_log(record, spec, state_dir, backend) which appends a JSONL line to state_dir/signatures.log Fields: timestamp (ISO 8601 UTC), actor, actor_type, identity, principals, ttl_hours, valid_before, cert_path, backend
cli.py: add warden log [actor-name] command - Reads state_dir/signatures.log (empty list if absent) - --last N (default 20): show last N entries - --actor <name>: filter by actor - --json: output newline-delimited JSON; default: Rich table - Exit 0 always
Update tests: verify log entry written after sign; verify log not written on CAError; verify warden log filters correctly

Acceptance Criteria

warden sign agt-test --pubkey /tmp/k.pub --ttl 100 raises CAError (agt max is 24h)
warden sign agt-test --pubkey /tmp/k.pub --ttl 24 succeeds
warden scorecard includes TTL policy check; fails when a cert exceeds type max
After warden sign, state_dir/signatures.log has one new line; valid JSON
warden log renders a table; warden log --json is parseable
warden log --actor agt-test returns only entries for that actor
warden cleanup --dry-run lists stale certs without deleting
warden cleanup removes stale certs; scorecard no_stale_certs passes after
Re-signing an actor replaces its cert file (no accumulation)
All tests pass: uv run pytest
All lints pass: uv run ruff check .

6.9 KiB Raw Permalink Blame History