feat(bootstrap): WARDEN-WP-0001 initial implementation — 42 tests passing

- LocalCA: ssh-keygen -s signing, keypair generation, cert parsing via ssh-keygen -L
- VaultCA: Vault SSH engine backend via httpx
- Inventory: YAML actor registry with ActorType, principals, TTL policy
- Scorecard: four cert-side compliance checks (prefixes, principals, no expired/stale)
- CLI: sign (cert_command interface), issue, status, scorecard, inventory subcommands
- ops-ssh-wrapper: acquire cert and exec SSH command
- Fix: principal parser stops at section headers containing ':' (Critical Options, Extensions)
- Move WARDEN-WP-0001 workplan from ops-bridge; register repo in state-hub (74df727e)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-05-15 13:27:49 +02:00
parent fee16417b8
commit 42ca370085
7 changed files with 605 additions and 73 deletions

View File

@@ -4,19 +4,22 @@ type: workplan
title: "OpsWarden Initial Implementation"
domain: custodian
repo: ops-warden
status: draft
status: active
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
state_hub_workstream_id: "c3118cc6-adfb-428c-a9c6-edd0ee152ae6"
---
# WARDEN-WP-0001 — OpsWarden Initial Implementation
**Scope:** Deliver a working `warden` CLI that implements the SSH CA and certificate
lifecycle defined in `wiki/AccessManagementDirective.md`. Scaffolding (models, config,
CA backends, inventory, scorecard, CLI) is already present in the repo; this workplan
tracks the remaining implementation, testing, and integration work.
> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
> first commit action.
**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
@@ -26,14 +29,10 @@ the directive when scale requires it).
## Goal
After this workplan:
1. `warden sign agt-test --pubkey /tmp/test.pub` outputs a valid cert (local backend).
2. `warden status agt-test` shows correct identity, principals, and time-to-expiry.
3. `warden scorecard` returns 4/4 on a clean test inventory.
4. `warden sign` called from ops-bridge `cert_command` works end-to-end in an integration
test tunnel.
5. All tests pass (`uv run pytest`) and lints pass (`uv run ruff check .`).
Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
to sibling repos is a well-defined `cert_command` interface that any tool (principally
`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
---
@@ -41,86 +40,294 @@ After this workplan:
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| cert_command interface | `wiki/CertCommandInterface.md` |
| Config reference | `wiki/OpsWardenConfig.md` |
| ops-bridge alignment workplan | `../ops-bridge/workplans/BRIDGE-WP-0004-directive-alignment.md` |
| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
---
## Architecture Summary
## Architecture
```
~/.config/warden/warden.yaml # backend, ca_key, inventory_path, state_dir
~/.config/warden/inventory.yaml # actor registry (name → type, principals, ttl_hours)
~/.local/state/warden/ # signed certs (*-cert.pub); keypairs (keys/)
ops-warden/
├── SCOPE.md
├── CLAUDE.md
├── pyproject.toml
├── src/warden/
│ ├── cli.py # Typer CLI: sign / issue / status / inventory / scorecard
│ ├── models.py # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
│ ├── ca.py # LocalCA backend (file-based, for dev / non-Vault)
│ ├── vault.py # VaultCA backend (Vault SSH engine, for production)
│ ├── inventory.py # YAML principals inventory read/write
│ ├── scorecard.py # §5 compliance checks
│ └── config.py # ~/.config/warden/warden.yaml loader
├── tests/
└── wiki/ # (symlink or copy of AccessManagementDirective.md)
```
Two swappable CA backends — both expose the same `sign(spec) -> CertRecord` interface:
- `LocalCA``ssh-keygen -s`; no Vault dependency; default for dev/lab
- `VaultCA` — Vault SSH engine via httpx
**Backends are swappable.** Config key `backend: local | vault` selects which CA
implementation is used. This means the tool is fully functional without Vault for local lab
use, and production-grade with Vault — the same CLI surface, the same `cert_command`
interface, the same principals inventory format.
cert_command interface (consumed by ops-bridge):
**cert_command interface contract:**
```
warden sign <actor-name> --pubkey <path> # → cert text to stdout
warden sign <actor-name> --pubkey <path>
```
Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
---
## Stack
- **Language:** Python 3.11+
- **CLI framework:** Typer
- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
- **Vault SDK:** `hvac` (optional; only required for vault backend)
- **Packaging:** `uv tool install`
---
## Tasks
### T1 — Repository registration
- [ ] Register repo with state-hub (`register_repo`); assign Repo ID; update
`.claude/rules/repo-identity.md`
### T1 — Repository bootstrap
```task
id: WARDEN-WP-0001-T1
state_hub_task_id: 6d643e9d-5e97-4224-9d82-87267b5ba6bc
status: todo
priority: high
```
- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
`workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
- [ ] Write `SCOPE.md` (see template in §SCOPE below)
- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
- [ ] Register repo with state-hub (`register_repo`)
- [ ] Create state-hub workstream for this workplan
### T2 — LocalCA integration test
- [ ] Generate a test CA key: `ssh-keygen -t ed25519 -f /tmp/test-ca -N ""`
- [ ] Run `warden sign` against a real pubkey with the test CA (requires `ssh-keygen` in PATH)
- [ ] Verify cert parses correctly with `ssh-keygen -L`
- [ ] Add to `tests/test_ca.py` as an integration test (skipped if `ssh-keygen` not in PATH)
### T2 — Models and config
### T3 — VaultCA integration test
- [ ] Set up a local Vault dev server (`vault server -dev`)
- [ ] Enable SSH secrets engine: `vault secrets enable ssh`
- [ ] Configure a signing role for `agt`
- [ ] Run `warden sign` with `backend: vault` config
- [ ] Add to `tests/test_vault.py` as an integration test (skipped if Vault not reachable)
```task
id: WARDEN-WP-0001-T2
state_hub_task_id: c66fc65a-0b16-4ba2-9e70-a83d875572ec
status: todo
priority: high
```
### T4 — CLI end-to-end smoke tests
- [ ] `warden inventory add agt-test --type agt --principal agt-task-test`
- [ ] `warden inventory list` shows the actor
- [ ] `warden issue agt-test` (local backend) produces keypair + cert
- [ ] `warden status agt-test` shows valid cert
- [ ] `warden scorecard` returns 4/4
- [ ] `warden inventory remove agt-test` removes actor
- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
`ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
`inventory_path`, `state_dir`
- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
### T5ops-bridge cert_command integration
- [ ] Add `agt-state-hub-bridge` to inventory (or use existing from ops-bridge config)
- [ ] Set `cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"`
in a test `tunnels.yaml`
- [ ] Run `bridge up state-hub-coulombcore`; confirm cert is present in
`~/.local/state/bridge/` and `cert_identity` appears in the audit log
- [ ] Document result in a progress event
### T3LocalCA backend
### T6 — CI/CD setup
- [ ] Add `.github/workflows/ci.yml` (or equivalent) running `uv run pytest` and
`uv run ruff check .` on push
- [ ] Tests must pass without Vault (VaultCA integration tests skipped via pytest marker)
```task
id: WARDEN-WP-0001-T3
state_hub_task_id: a5a41e58-1c6d-42a9-9b11-2088f17c29b5
status: todo
priority: high
```
### T7 — Documentation
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference (already stubbed)
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (already stubbed)
- [ ] Ensure `wiki/AccessManagementDirective.md` is in sync with `ops-bridge/wiki/`
- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
- Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
- Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
`Principals`
- Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
(overridable per actor in inventory)
- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
actors that do not bring their own key
### T4 — VaultCA backend
```task
id: WARDEN-WP-0001-T4
state_hub_task_id: b2067ee6-c9ce-423b-9d60-0d28069fb304
status: todo
priority: medium
```
- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
- `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
- Parse response `signed_key` field; write to state dir; extract metadata via
`ssh-keygen -L`
- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
### T5 — Principals inventory
```task
id: WARDEN-WP-0001-T5
state_hub_task_id: 6d13f8cd-1850-44c9-b769-b21250348319
status: todo
priority: high
```
- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
```yaml
actors:
agt-state-hub-bridge:
type: agt
principals: [agt-task-bridge]
ttl_hours: 24
description: "ops-bridge tunnel actor"
hosts:
coulombcore:
allowed_principals:
agt: [agt-task-bridge]
atm: [atm-backup-daily]
```
- [ ] `warden inventory list` — print table
- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
- [ ] `warden inventory remove <actor-name>`
### T6 — CLI commands
```task
id: WARDEN-WP-0001-T6
state_hub_task_id: 656a4615-92bb-4b5d-9406-e86d24fa15d0
status: todo
priority: high
```
- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
stdout (the `cert_command` interface for ops-bridge)
- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
`privkey`, `cert`, `valid_before`, `identity`
- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
remaining; `--all` flag to show all actors in state dir
- [ ] `warden scorecard` — run §5 checks (see T7)
- [ ] `warden inventory <subcommand>` (list / add / remove)
### T7 — Scorecard runner
```task
id: WARDEN-WP-0001-T7
state_hub_task_id: 7818bcc5-f40e-4793-b117-d36f653ffeed
status: todo
priority: medium
```
- [ ] `scorecard.py`: implement each §5 row as a named check function returning
`CheckResult(name, passed, detail)`
- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
- All certs in state dir respect TTL policy for their `ActorType`
- No actor in inventory lacks a `principals` entry
- Actor name prefix matches declared type
- No cert expired by more than 5 min still present in state dir (stale cleanup)
- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
— those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
- [ ] `warden scorecard --json` for machine-readable output
### T8 — ops-ssh-wrapper script
```task
id: WARDEN-WP-0001-T8
state_hub_task_id: e9c28152-5785-4995-83a5-439985ed3db9
status: todo
priority: medium
```
- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
- Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
- Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
- Loads cert via `ssh-add`; execs the given command
- [ ] Install as part of `uv tool install` entry points
### T9 — Tests
```task
id: WARDEN-WP-0001-T9
state_hub_task_id: 950139ab-cc17-4f1d-9a17-d5744e402ddf
status: todo
priority: high
```
- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
- [ ] Unit tests for inventory YAML round-trip
- [ ] Unit tests for actor name prefix validation
- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
- [ ] Scorecard unit tests (mock cert records)
### T10 — Documentation
```task
id: WARDEN-WP-0001-T10
state_hub_task_id: 271d6759-e359-41ce-80e4-76c574634a87
status: todo
priority: medium
```
- [ ] `SCOPE.md` (see below)
- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
---
## SCOPE.md Template
```
# SCOPE
## One-liner
SSH Certificate Authority and credential issuance for the ops fleet —
signs short-lived certs for adm/agt/atm actors; provides the cert_command
interface consumed by ops-bridge and other tooling.
## Core Idea
Implements AccessManagementDirective §§15. Owns the CA key, actor inventory,
signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
or SSH key generation for humans.
## In Scope
- Local CA backend (ssh-keygen -s) for lab / non-Vault use
- Vault SSH engine backend for production
- Actor identity registry (inventory.yaml)
- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
- TTL policy enforcement per ActorType (adm/agt/atm)
- Certificate status and stale-cert cleanup
- Scorecard checks (local / cert-side only)
- ops-ssh-wrapper script for agt/atm startup automation
## Out of Scope
- Host-side principal deployment (railiance-infra Ansible)
- SSH key generation for human admins (self-service: ssh-keygen)
- Vault cluster setup / HA
- Session recording, audit forwarding to SIEM (host-side)
- Tunnel lifecycle (ops-bridge)
- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
## Relevant When
- Issuing or refreshing a cert for any adm/agt/atm actor
- Checking cert validity / scorecard compliance
- ops-bridge needs cert_command to be defined
- Adding a new actor to the principals inventory
## Not Relevant When
- Managing tunnel lifecycle (ops-bridge)
- Deploying SSH config to hosts (railiance-infra)
- All access is via static keys with no TTL (legacy mode)
## Current State
Status: planned (WARDEN-WP-0001 not yet started)
## Related Repositories
- ops-bridge — primary consumer of cert_command interface
- railiance-infra — owns host-side principal deployment
- the-custodian/state-hub — registers domain/workstreams
```
---
## Acceptance Criteria
- [ ] `warden sign agt-test --pubkey /tmp/test.pub` valid cert on stdout (local backend)
- [ ] `warden status agt-test` identity, principals, time-to-expiry shown correctly
- [ ] `warden scorecard` → 4/4 on clean inventory
- [ ] `warden sign` works as `cert_command` in ops-bridge tunnel config
- [ ] All unit tests pass: `uv run pytest`
- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
- [ ] `warden scorecard` returns 5/5 on a clean test inventory
- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
- [ ] All tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`
- [ ] No secrets (CA private key, certs) committed to repo