Files
ops-bridge/workplans/WARDEN-WP-0001-initial-implementation.md
2026-04-25 17:06:05 +02:00

254 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: WARDEN-WP-0001
type: workplan
title: "OpsWarden Initial Implementation"
domain: custodian
repo: ops-warden
status: draft
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
state_hub_workstream_id: "c3118cc6-adfb-428c-a9c6-edd0ee152ae6"
---
# WARDEN-WP-0001 — OpsWarden Initial Implementation
> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
> first commit action.
**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
the directive when scale requires it).
---
## Goal
Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
to sibling repos is a well-defined `cert_command` interface that any tool (principally
`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
---
## Architecture
```
ops-warden/
├── SCOPE.md
├── CLAUDE.md
├── pyproject.toml
├── src/warden/
│ ├── cli.py # Typer CLI: sign / issue / status / inventory / scorecard
│ ├── models.py # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
│ ├── ca.py # LocalCA backend (file-based, for dev / non-Vault)
│ ├── vault.py # VaultCA backend (Vault SSH engine, for production)
│ ├── inventory.py # YAML principals inventory read/write
│ ├── scorecard.py # §5 compliance checks
│ └── config.py # ~/.config/warden/warden.yaml loader
├── tests/
└── wiki/ # (symlink or copy of AccessManagementDirective.md)
```
**Backends are swappable.** Config key `backend: local | vault` selects which CA
implementation is used. This means the tool is fully functional without Vault for local lab
use, and production-grade with Vault — the same CLI surface, the same `cert_command`
interface, the same principals inventory format.
**cert_command interface contract:**
```
warden sign <actor-name> --pubkey <path>
```
Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
---
## Stack
- **Language:** Python 3.11+
- **CLI framework:** Typer
- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
- **Vault SDK:** `hvac` (optional; only required for vault backend)
- **Packaging:** `uv tool install`
---
## Tasks
### T1 — Repository bootstrap
- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
`workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
- [ ] Write `SCOPE.md` (see template in §SCOPE below)
- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
- [ ] Register repo with state-hub (`register_repo`)
- [ ] Create state-hub workstream for this workplan
### T2 — Models and config
- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
`ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
`inventory_path`, `state_dir`
- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
### T3 — LocalCA backend
- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
- Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
- Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
`Principals`
- Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
(overridable per actor in inventory)
- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
actors that do not bring their own key
### T4 — VaultCA backend
- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
- `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
- Parse response `signed_key` field; write to state dir; extract metadata via
`ssh-keygen -L`
- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
### T5 — Principals inventory
- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
```yaml
actors:
agt-state-hub-bridge:
type: agt
principals: [agt-task-bridge]
ttl_hours: 24
description: "ops-bridge tunnel actor"
hosts:
coulombcore:
allowed_principals:
agt: [agt-task-bridge]
atm: [atm-backup-daily]
```
- [ ] `warden inventory list` — print table
- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
- [ ] `warden inventory remove <actor-name>`
### T6 — CLI commands
- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
stdout (the `cert_command` interface for ops-bridge)
- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
`privkey`, `cert`, `valid_before`, `identity`
- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
remaining; `--all` flag to show all actors in state dir
- [ ] `warden scorecard` — run §5 checks (see T7)
- [ ] `warden inventory <subcommand>` (list / add / remove)
### T7 — Scorecard runner
- [ ] `scorecard.py`: implement each §5 row as a named check function returning
`CheckResult(name, passed, detail)`
- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
- All certs in state dir respect TTL policy for their `ActorType`
- No actor in inventory lacks a `principals` entry
- Actor name prefix matches declared type
- No cert expired by more than 5 min still present in state dir (stale cleanup)
- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
— those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
- [ ] `warden scorecard --json` for machine-readable output
### T8 — ops-ssh-wrapper script
- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
- Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
- Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
- Loads cert via `ssh-add`; execs the given command
- [ ] Install as part of `uv tool install` entry points
### T9 — Tests
- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
- [ ] Unit tests for inventory YAML round-trip
- [ ] Unit tests for actor name prefix validation
- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
- [ ] Scorecard unit tests (mock cert records)
### T10 — Documentation
- [ ] `SCOPE.md` (see below)
- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
---
## SCOPE.md Template
```
# SCOPE
## One-liner
SSH Certificate Authority and credential issuance for the ops fleet —
signs short-lived certs for adm/agt/atm actors; provides the cert_command
interface consumed by ops-bridge and other tooling.
## Core Idea
Implements AccessManagementDirective §§15. Owns the CA key, actor inventory,
signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
or SSH key generation for humans.
## In Scope
- Local CA backend (ssh-keygen -s) for lab / non-Vault use
- Vault SSH engine backend for production
- Actor identity registry (inventory.yaml)
- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
- TTL policy enforcement per ActorType (adm/agt/atm)
- Certificate status and stale-cert cleanup
- Scorecard checks (local / cert-side only)
- ops-ssh-wrapper script for agt/atm startup automation
## Out of Scope
- Host-side principal deployment (railiance-infra Ansible)
- SSH key generation for human admins (self-service: ssh-keygen)
- Vault cluster setup / HA
- Session recording, audit forwarding to SIEM (host-side)
- Tunnel lifecycle (ops-bridge)
- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
## Relevant When
- Issuing or refreshing a cert for any adm/agt/atm actor
- Checking cert validity / scorecard compliance
- ops-bridge needs cert_command to be defined
- Adding a new actor to the principals inventory
## Not Relevant When
- Managing tunnel lifecycle (ops-bridge)
- Deploying SSH config to hosts (railiance-infra)
- All access is via static keys with no TTL (legacy mode)
## Current State
Status: planned (WARDEN-WP-0001 not yet started)
## Related Repositories
- ops-bridge — primary consumer of cert_command interface
- railiance-infra — owns host-side principal deployment
- the-custodian/state-hub — registers domain/workstreams
```
---
## Acceptance Criteria
- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
- [ ] `warden scorecard` returns 5/5 on a clean test inventory
- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
- [ ] All tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`