Files
ops-bridge/workplans/WARDEN-WP-0001-initial-implementation.md
tegwick 22601ef3e6 chore(workplans): sync BRIDGE-WP-0004 and WARDEN-WP-0001 tasks to state hub
Both workplans had been registered as active workstreams but tasks were
never ingested — the markdown checkbox format was invisible to the
consistency checker, which requires task code blocks. Activated both
workplans (draft→active) and added task blocks with state_hub_task_id
for all 19 tasks (9 + 10).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-05-15 00:29:51 +02:00

334 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: WARDEN-WP-0001
type: workplan
title: "OpsWarden Initial Implementation"
domain: custodian
repo: ops-warden
status: active
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
state_hub_workstream_id: "c3118cc6-adfb-428c-a9c6-edd0ee152ae6"
---
# WARDEN-WP-0001 — OpsWarden Initial Implementation
> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
> first commit action.
**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
the directive when scale requires it).
---
## Goal
Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
to sibling repos is a well-defined `cert_command` interface that any tool (principally
`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
---
## Architecture
```
ops-warden/
├── SCOPE.md
├── CLAUDE.md
├── pyproject.toml
├── src/warden/
│ ├── cli.py # Typer CLI: sign / issue / status / inventory / scorecard
│ ├── models.py # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
│ ├── ca.py # LocalCA backend (file-based, for dev / non-Vault)
│ ├── vault.py # VaultCA backend (Vault SSH engine, for production)
│ ├── inventory.py # YAML principals inventory read/write
│ ├── scorecard.py # §5 compliance checks
│ └── config.py # ~/.config/warden/warden.yaml loader
├── tests/
└── wiki/ # (symlink or copy of AccessManagementDirective.md)
```
**Backends are swappable.** Config key `backend: local | vault` selects which CA
implementation is used. This means the tool is fully functional without Vault for local lab
use, and production-grade with Vault — the same CLI surface, the same `cert_command`
interface, the same principals inventory format.
**cert_command interface contract:**
```
warden sign <actor-name> --pubkey <path>
```
Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
---
## Stack
- **Language:** Python 3.11+
- **CLI framework:** Typer
- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
- **Vault SDK:** `hvac` (optional; only required for vault backend)
- **Packaging:** `uv tool install`
---
## Tasks
### T1 — Repository bootstrap
```task
id: WARDEN-WP-0001-T1
state_hub_task_id: 6d643e9d-5e97-4224-9d82-87267b5ba6bc
status: todo
priority: high
```
- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
`workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
- [ ] Write `SCOPE.md` (see template in §SCOPE below)
- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
- [ ] Register repo with state-hub (`register_repo`)
- [ ] Create state-hub workstream for this workplan
### T2 — Models and config
```task
id: WARDEN-WP-0001-T2
state_hub_task_id: c66fc65a-0b16-4ba2-9e70-a83d875572ec
status: todo
priority: high
```
- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
`ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
`inventory_path`, `state_dir`
- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
### T3 — LocalCA backend
```task
id: WARDEN-WP-0001-T3
state_hub_task_id: a5a41e58-1c6d-42a9-9b11-2088f17c29b5
status: todo
priority: high
```
- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
- Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
- Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
`Principals`
- Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
(overridable per actor in inventory)
- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
actors that do not bring their own key
### T4 — VaultCA backend
```task
id: WARDEN-WP-0001-T4
state_hub_task_id: b2067ee6-c9ce-423b-9d60-0d28069fb304
status: todo
priority: medium
```
- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
- `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
- Parse response `signed_key` field; write to state dir; extract metadata via
`ssh-keygen -L`
- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
### T5 — Principals inventory
```task
id: WARDEN-WP-0001-T5
state_hub_task_id: 6d13f8cd-1850-44c9-b769-b21250348319
status: todo
priority: high
```
- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
```yaml
actors:
agt-state-hub-bridge:
type: agt
principals: [agt-task-bridge]
ttl_hours: 24
description: "ops-bridge tunnel actor"
hosts:
coulombcore:
allowed_principals:
agt: [agt-task-bridge]
atm: [atm-backup-daily]
```
- [ ] `warden inventory list` — print table
- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
- [ ] `warden inventory remove <actor-name>`
### T6 — CLI commands
```task
id: WARDEN-WP-0001-T6
state_hub_task_id: 656a4615-92bb-4b5d-9406-e86d24fa15d0
status: todo
priority: high
```
- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
stdout (the `cert_command` interface for ops-bridge)
- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
`privkey`, `cert`, `valid_before`, `identity`
- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
remaining; `--all` flag to show all actors in state dir
- [ ] `warden scorecard` — run §5 checks (see T7)
- [ ] `warden inventory <subcommand>` (list / add / remove)
### T7 — Scorecard runner
```task
id: WARDEN-WP-0001-T7
state_hub_task_id: 7818bcc5-f40e-4793-b117-d36f653ffeed
status: todo
priority: medium
```
- [ ] `scorecard.py`: implement each §5 row as a named check function returning
`CheckResult(name, passed, detail)`
- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
- All certs in state dir respect TTL policy for their `ActorType`
- No actor in inventory lacks a `principals` entry
- Actor name prefix matches declared type
- No cert expired by more than 5 min still present in state dir (stale cleanup)
- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
— those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
- [ ] `warden scorecard --json` for machine-readable output
### T8 — ops-ssh-wrapper script
```task
id: WARDEN-WP-0001-T8
state_hub_task_id: e9c28152-5785-4995-83a5-439985ed3db9
status: todo
priority: medium
```
- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
- Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
- Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
- Loads cert via `ssh-add`; execs the given command
- [ ] Install as part of `uv tool install` entry points
### T9 — Tests
```task
id: WARDEN-WP-0001-T9
state_hub_task_id: 950139ab-cc17-4f1d-9a17-d5744e402ddf
status: todo
priority: high
```
- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
- [ ] Unit tests for inventory YAML round-trip
- [ ] Unit tests for actor name prefix validation
- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
- [ ] Scorecard unit tests (mock cert records)
### T10 — Documentation
```task
id: WARDEN-WP-0001-T10
state_hub_task_id: 271d6759-e359-41ce-80e4-76c574634a87
status: todo
priority: medium
```
- [ ] `SCOPE.md` (see below)
- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
---
## SCOPE.md Template
```
# SCOPE
## One-liner
SSH Certificate Authority and credential issuance for the ops fleet —
signs short-lived certs for adm/agt/atm actors; provides the cert_command
interface consumed by ops-bridge and other tooling.
## Core Idea
Implements AccessManagementDirective §§15. Owns the CA key, actor inventory,
signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
or SSH key generation for humans.
## In Scope
- Local CA backend (ssh-keygen -s) for lab / non-Vault use
- Vault SSH engine backend for production
- Actor identity registry (inventory.yaml)
- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
- TTL policy enforcement per ActorType (adm/agt/atm)
- Certificate status and stale-cert cleanup
- Scorecard checks (local / cert-side only)
- ops-ssh-wrapper script for agt/atm startup automation
## Out of Scope
- Host-side principal deployment (railiance-infra Ansible)
- SSH key generation for human admins (self-service: ssh-keygen)
- Vault cluster setup / HA
- Session recording, audit forwarding to SIEM (host-side)
- Tunnel lifecycle (ops-bridge)
- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
## Relevant When
- Issuing or refreshing a cert for any adm/agt/atm actor
- Checking cert validity / scorecard compliance
- ops-bridge needs cert_command to be defined
- Adding a new actor to the principals inventory
## Not Relevant When
- Managing tunnel lifecycle (ops-bridge)
- Deploying SSH config to hosts (railiance-infra)
- All access is via static keys with no TTL (legacy mode)
## Current State
Status: planned (WARDEN-WP-0001 not yet started)
## Related Repositories
- ops-bridge — primary consumer of cert_command interface
- railiance-infra — owns host-side principal deployment
- the-custodian/state-hub — registers domain/workstreams
```
---
## Acceptance Criteria
- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
- [ ] `warden scorecard` returns 5/5 on a clean test inventory
- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
- [ ] All tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`