diff --git a/.claude/rules/architecture.md b/.claude/rules/architecture.md index e302e1d..74f3cc4 100644 --- a/.claude/rules/architecture.md +++ b/.claude/rules/architecture.md @@ -17,11 +17,18 @@ The catalog layout follows: `opscatalog/domains//{domain.yaml, targets/, bridges/, docs/}`. Key design constraints: -- OpsBridge owns lifecycle management only; it does not own identity/credentials +- OpsBridge owns lifecycle management only; it does not own credential issuance or CA + operations (those belong to `ops-warden`) - Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used in config, CLI args, and log filenames must stay consistent -- Actor attribution (human operator vs. automation agent) is tracked per bridge - for audit log traceability (FRS §5.7) +- Actor attribution is tracked per bridge using the three-actor vocabulary from the + AccessManagementDirective: `adm` (human), `agt` (LLM agent), `atm` (automation); + actor names must carry the matching prefix (`adm-*`, `agt-*`, `atm-*`) (FRS §5.7) +- Two credential modes are first-class and must remain independently functional: + 1. **Static key mode** (default) — `ssh_key` only; no TTL, no cert logic + 2. **cert_command mode** — a pluggable shell command that issues a CA-signed cert + before each SSH launch; TTL parsed from the cert; pre-emptive refresh ~5 min + before expiry; `cert_identity` logged in every `BRIDGE_CONNECTED` event Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS (`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`). diff --git a/SCOPE.md b/SCOPE.md index d9bdb37..9bb10db 100644 --- a/SCOPE.md +++ b/SCOPE.md @@ -8,7 +8,7 @@ ## One-liner -SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. +SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface. --- @@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo ## In Scope -- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`) +- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`) - Auto-reconnect with exponential backoff and configurable retry policy - Optional HTTP health checks (confirm forwarded service is actually reachable from remote) - Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.) -- Actor attribution: per-tunnel actor class (human / automation) for audit traceability +- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability, + with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`) +- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic, + works without any CA or external tooling +- **cert_command mode** (optional): pluggable shell command that issues a short-lived + CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh; + `cert_identity` recorded in audit log — satisfies AccessManagementDirective §5 - PID + state file management in `~/.local/state/bridge/` - MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools - OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges) @@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo ## Out of Scope -- Identity/credential management (uses existing SSH keys) +- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes + certs via the `cert_command` interface but never signs anything itself) +- SSH key generation for human admins (self-service: `ssh-keygen`) +- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra` - Long-running application hosting on remote machines (port-forward only, not deployment) - VPN or layer-3 connectivity - Monitoring/alerting beyond JSON audit logs @@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo ## Relevant When - Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP -- Need audit trail of which actor (human vs. automation) started/stopped tunnels +- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels - Setting up a new machine in the Railiance ecosystem that must phone home to the hub - Diagnosing connectivity issues between local hub and remote services +- Checking certificate validity for active tunnels (`bridge cert-status`) +- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials --- @@ -60,8 +71,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo ## Current State -- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped) -- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated +- Status: active (v0.1 core complete; directive alignment in progress — BRIDGE-WP-0004) +- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health + checks and audit logging complete; OpsCatalog framework present but not populated; + cert_command / ActorType alignment not yet implemented - Stability: stable tunnel lifecycle; tested under network drops and SSH failures - Usage: running in lab for daily Railiance/Temporal connectivity @@ -77,17 +90,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo ## Terminology -- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check +- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check, + cert_command, cert_identity +- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation) - Also known as: "the bridge" -- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge +- Potentially confusing: "bridge state" is a tunnel-specific state machine + (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge +- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`) --- ## Related / Overlapping Repositories - `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it +- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via + `cert_command` when short-lived certificates are required - `activity-core` — Temporal server on remote reached via ops-bridge tunnel -- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home +- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns + host-side principal deployment (`/etc/ssh/auth_principals/`) --- @@ -105,5 +125,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge ## Getting Oriented - Start with: `README.txt` (architecture, config format, CLI commands, MCP integration) -- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files) -- Entry points: `bridge --help`; `bridge up `; MCP: `bridge_status()` +- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), + `~/.local/state/bridge/` (PID/state/cert files) +- Entry points: `bridge --help`; `bridge up `; `bridge cert-status`; + MCP: `bridge_status()` +- AccessManagementDirective context: `wiki/AccessManagementDirective.md` +- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap) diff --git a/wiki/AccessManagementDirective.md b/wiki/AccessManagementDirective.md new file mode 100644 index 0000000..38cb8ed --- /dev/null +++ b/wiki/AccessManagementDirective.md @@ -0,0 +1,203 @@ +AccessManagementDirective + +*Practical host access control management * + +# AccessManagementDirective + +**Document Title:** SSH Access Management Directive +**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements) +**Date:** 28 March 2026 +**Audience:** Operations Department +**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm). +**Author:** Grok (on behalf of the team) +**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this. +**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations. + +## 0. Prerequisites + +Before bootstrapping, the following must be in place: +- Ansible (or equivalent config-management tool) with a central inventory. +- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled. +- GitOps repository containing the authoritative principals inventory. +- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent). +- At least two ops personnel trained on Vault SSH signing and Ansible playbooks. + +If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably. + +## 1. Concept Overview + +This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth. + +**Why this model?** +- A central CA signs short-lived certificates for every login. +- No more manual key copying, key sprawl, or painful revocation. +- Built-in expiration, role-based principals, and auditability. +- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts. +- Scales from 5 hosts to 500+ with almost zero per-host maintenance. + +**Core Principles** +- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions. +- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations). +- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host. +- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault). +- **Separation of concerns** – + - **Admins (adm)**: Human operators (full interactive shell when needed). + - **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks. + - **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights. + +## 2. Actor Definitions & Access Model + +| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions | +|------------|-------------------|-------------|------------------------------|---------------------------| +| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` | +| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-`, limited to specific scripts/directories | +| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-`, `force-command=/usr/local/bin/atm-wrapper.sh` | + +**Certificate Naming Convention** +- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily` +- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts) + +**LLM-Agent Risk Clarification** +Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents. + +## 3. Bootstrapping the System (One-Time Setup) + +### 3.1. Create the CA (do this once, offline) +```bash +ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N "" +``` +- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation. +- Rotate the CA key itself every 2–3 years using the same bootstrap playbook. +- Public key: `ca_user.pub` + +### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`) +- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned). +- Update `/etc/ssh/sshd_config`: + ```bash + TrustedUserCAKeys /etc/ssh/ca/ca_user.pub + AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u + PubkeyAuthentication yes + PasswordAuthentication no + PermitRootLogin no + ``` +- Create principals directory and files from the central Git inventory. +- `systemctl restart sshd` + +### 3.3. Initial Admin Access +First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step. + +## 4. Automatic Management of Access Rights + +### 4.1. Daily / On-Demand Workflow +1. **Key/Certificate Issuance Pipeline** (GitOps + Vault) + - **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible. + - **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon). + - **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script. + +2. **Ansible-Driven Host Updates** (run hourly via CI/CD) + - `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git). + - Example inventory snippet: + ```yaml + hosts: + - name: prod-db-01 + allowed_principals: + adm: [adm-full] + agt: [agt-incident-resolver-v2] + atm: [atm-backup-daily, atm-logrotate] + ``` + +3. **Revocation & Rotation** + - Short expiry = automatic revocation. + - For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`). + - Agents/automations never store long-lived private keys on disk. + +4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`) + ```python + #!/usr/bin/env python3 + import subprocess, os, tempfile + # Request short-lived cert from Vault + cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip() + with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f: + f.write(cert.encode()) + cert_path = f.name + # Load into ssh-agent and exec the real command + subprocess.run(["ssh-add", cert_path]) + os.execvp(sys.argv[1], sys.argv[1:]) + ``` + Agents call this wrapper; it auto-refreshes the cert on every wake-up. + +### 4.2. Human UX Guidance +Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases. + +### 4.3. Emergency Break-Glass Procedure +In case of total lockout (CA offline, misconfigured Ansible push, etc.): +1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access). +2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion). +3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`. +4. After recovery, immediately rotate the CA and run a full scorecard. + +## 5. AccessManagement Scorecard (Checklist) + +Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail. + +| Category | Check | Target | Tool | +|----------|-------|--------|------| +| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` | +| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` | +| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff | +| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` | +| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` | +| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` | +| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace | +| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query | +| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log | +| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run | +| **Score** | ≥ 10/10 = **Operational** | - | - | + +**Scorecard Execution Command** (run from ops laptop): +```bash +ansible all -m command -a "ssh-access-scorecard.sh" --become +``` + +## 6. Scope & Operational Boundaries + +### 6.1. When Bootstrapping Is Officially Closed +The system is **fully operational** when **ALL** of the following are true: +- Scorecard passes 10/10 on every host. +- Central Git repo contains the authoritative principals inventory. +- First three admins have successfully used signed certificates for 7 consecutive days. +- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate. +- CI/CD pipeline for host config updates is green and runs hourly. +- Emergency break-glass procedure has been tested once. + +**Declaration:** Ops Lead signs off with date in the Git commit message. + +### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling +Stay with **native OpenSSH CA + Ansible + Vault** while: +- ≤ 200 hosts +- ≤ 50 distinct agent/automation identities +- No regulatory requirement for SSO or full session recording + +**Switch triggers** (any one): +- > 200 hosts OR rapid daily growth +- Need for human SSO (Okta/Google) integration +- Requirement for audited web-based SSH sessions or just-in-time access approval +- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot) +- Audit/compliance demands central policy engine or session recording + +**Recommended next-level tools** (in order): +1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID). +2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily. +3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC. + +**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users. + +## 7. Enforcement & Review +- **Quarterly review** of this directive and scorecard results. +- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket. +- **Questions / improvements** → create PR against this file in the ops repo. + +**End of Document** +Approved for immediate use across all production and staging environments. + +xxx diff --git a/workplans/BRIDGE-WP-0004-directive-alignment.md b/workplans/BRIDGE-WP-0004-directive-alignment.md new file mode 100644 index 0000000..f0e4f06 --- /dev/null +++ b/workplans/BRIDGE-WP-0004-directive-alignment.md @@ -0,0 +1,272 @@ +--- +id: BRIDGE-WP-0004 +type: workplan +title: "AccessManagementDirective Alignment" +domain: custodian +repo: ops-bridge +status: draft +owner: Bernd +topic_slug: custodian +created: "2026-03-28" +updated: "2026-03-28" +--- + +# BRIDGE-WP-0004 — AccessManagementDirective Alignment + +**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model, +optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while +preserving full backward compatibility with the existing static-key mode. + +**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal +deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002). + +--- + +## Goal + +After this workplan: + +1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys. +2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible + `cert_command`) — cert acquisition, cert rotation, and cert identity logging are all + handled transparently by the tunnel manager. +3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from + the directive, with config validation that enforces naming conventions. +4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's + §5 SIEM traceability requirement. + +--- + +## Reference Documents + +| Document | Location | +|---|---| +| AccessManagementDirective | `wiki/AccessManagementDirective.md` | +| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` | +| PRD | `wiki/OpsBridgePrd.md` | +| FRS | `wiki/OpsBridgeFrs.md` | + +--- + +## Design Decisions + +### Static key mode stays first-class + +If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today: +`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are +explicitly supported for: +- Lab/dev environments without a CA +- Tunnels owned by `adm`-class humans who manage their own cert refresh externally +- Environments below the directive's complexity threshold + +### cert_command interface + +```yaml +# tunnels.yaml — optional cert_command field +tunnels: + state-hub-coulombcore: + host: coulombcore + remote_port: 8001 + local_port: 8000 + ssh_user: agt-state-hub-bridge + ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required) + actor: agt-state-hub-bridge + cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub" +``` + +When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch, +captures stdout as the cert text, writes it to a tempfile in the state dir, and adds +`-i ` alongside `-i ` to the SSH command. The cert file is cleaned up +on tunnel stop. + +`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes +`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface +dependency-free — no Vault SDK, no warden import needed inside ops-bridge. + +### TTL-aware cert refresh + +After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to +determine `cert_expires_at`. It schedules a pre-emptive cert refresh +(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer +fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth +failure, no reconnect backoff triggered. + +If `cert_command` is absent, no TTL logic runs. + +### Actor type model + +`actor_class: str # "human" | "automation"` is replaced by: + +```python +class ActorType(str, Enum): + ADM = "adm" # human operator + AGT = "agt" # LLM-powered autonomous agent + ATM = "atm" # deterministic script / pipeline +``` + +Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`. +The mapping is a one-way migration aid with a deprecation warning; new configs must use the +canonical values. + +Config validation: if `actor` name is set, it must start with the prefix matching its type +(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for +SIEM auditability. + +--- + +## Tasks + +### T1 — ActorType enum +- [ ] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType` +- [ ] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` → + `ActorType.ATM` with a `DeprecationWarning`; reject unknown values +- [ ] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT, + `atm-*` for ATM; raise `ConfigError` on mismatch +- [ ] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value` +- [ ] Update tests + +### T2 — cert_command config field +- [ ] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig` +- [ ] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string + content (shell-level freedom intentional) +- [ ] Document in config example / SCOPE.md + +### T3 — Cert acquisition in manager +- [ ] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]` + - If `cfg.cert_command` is None: return None (static key mode) + - Run `cert_command` via `subprocess.run(shell=True, capture_output=True)` + - Write stdout to `~/.local/state/bridge/-cert.pub` (overwrite each time) + - Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr +- [ ] `build_ssh_command`: accept optional `cert_path`; when set, insert + `-i ` after `-i ` (OpenSSH loads both automatically) +- [ ] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup) + so every reconnect gets a fresh cert + +### T4 — cert_identity in audit log +- [ ] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f ` output to + extract `Key ID` (the `-I` value from signing time) +- [ ] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in + JSON entry when present +- [ ] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events +- [ ] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events + +### T5 — TTL-aware cert refresh +- [ ] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp + from `ssh-keygen -L` output → `cert_expires_at: datetime` +- [ ] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)` + on each iteration +- [ ] When refresh is due: call `proc.terminate()`, break inner loop, let the outer + reconnect loop restart naturally (T3 will re-acquire the cert at the top of the + next iteration) +- [ ] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to + `AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field +- [ ] If `cert_command` is absent, skip all TTL logic entirely + +### T6 — `bridge cert-status` command +- [ ] `cli.py`: add `cert-status [TUNNEL]` subcommand +- [ ] For each tunnel (or the named one): read cert file from state dir if present, + run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until, + time-to-expiry (or "static key / no cert" if absent) +- [ ] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable) +- [ ] `--json` flag for machine-readable output + +### T7 — CertAcquisitionError handling +- [ ] New exception `CertAcquisitionError` in `models.py` +- [ ] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED` + with `detail="cert acquisition failed: "`, apply normal backoff and retry + (cert failures are transient — e.g., Vault briefly unreachable) +- [ ] After `max_attempts` consecutive cert failures, transition to `FAILED` state + +### T8 — SCOPE.md and documentation updates +- [ ] Update `SCOPE.md`: replace "Identity/credential management (uses existing SSH keys)" + with the pluggable cert_command model; add ops-warden as related repo; update + actor terminology to adm/agt/atm; update Current State +- [ ] Update `wiki/OpsBridgeFrs.md` §5.7 (actor attribution): note three-actor model, + cert_identity field, cert_command interface +- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency +- [ ] Update config example in README / `wiki/` to show both static and cert_command modes +- [ ] Update `.claude/rules/architecture.md`: add cert lifecycle to architecture description + +### T9 — Tests +- [ ] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping; + cert_command parse +- [ ] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH + args; verify `CertAcquisitionError` on non-zero exit +- [ ] `test_manager.py`: TTL logic — mock `cert_expires_at` in past; verify refresh triggers +- [ ] `test_audit.py`: `cert_identity` field present in CONNECTED event when cert was used; + absent in static-key mode +- [ ] `test_cli.py`: `cert-status` exit codes; JSON output shape + +--- + +## Config Schema — Before / After + +### Before +```yaml +tunnels: + state-hub-coulombcore: + host: coulombcore + remote_port: 8001 + local_port: 8000 + ssh_user: ops-agent + ssh_key: ~/.ssh/id_ed25519 + actor: automation-agent + +actors: + automation-agent: + class: automation + description: "state hub bridge agent" +``` + +### After (static key mode — unchanged behavior) +```yaml +tunnels: + state-hub-coulombcore: + host: coulombcore + remote_port: 8001 + local_port: 8000 + ssh_user: agt-state-hub-bridge + ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 + actor: agt-state-hub-bridge + +actors: + agt-state-hub-bridge: + class: agt + description: "state hub bridge agent" +``` + +### After (cert_command mode — ops-warden or any CA) +```yaml +tunnels: + state-hub-coulombcore: + host: coulombcore + remote_port: 8001 + local_port: 8000 + ssh_user: agt-state-hub-bridge + ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 + actor: agt-state-hub-bridge + cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub" + +actors: + agt-state-hub-bridge: + class: agt + description: "state hub bridge agent" +``` + +--- + +## Acceptance Criteria + +- [ ] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation + warning only); tunnel behaves identically +- [ ] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError` +- [ ] Config with `cert_command` set: SSH process launched with both `-i key` and + `-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event +- [ ] Config without `cert_command`: no cert file written; `cert_identity` absent in audit; + no TTL logic runs +- [ ] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED` + logged with stderr detail; eventually reaches `FAILED` after `max_attempts` +- [ ] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged +- [ ] `bridge cert-status` shows valid cert info; exits 1 on expired cert +- [ ] All tests pass: `uv run pytest` +- [ ] All lints pass: `uv run ruff check .` diff --git a/workplans/WARDEN-WP-0001-initial-implementation.md b/workplans/WARDEN-WP-0001-initial-implementation.md new file mode 100644 index 0000000..75e9513 --- /dev/null +++ b/workplans/WARDEN-WP-0001-initial-implementation.md @@ -0,0 +1,252 @@ +--- +id: WARDEN-WP-0001 +type: workplan +title: "OpsWarden Initial Implementation" +domain: custodian +repo: ops-warden +status: draft +owner: Bernd +topic_slug: custodian +created: "2026-03-28" +updated: "2026-03-28" +--- + +# WARDEN-WP-0001 — OpsWarden Initial Implementation + +> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist. +> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the +> first commit action. + +**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that +implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`. + +**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment +(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of +the directive when scale requires it). + +--- + +## Goal + +Create a new `ops-warden` repository that owns **credential issuance only** — the CA, +certificate signing, actor identity registry, and scorecard tooling. Its sole public surface +to sibling repos is a well-defined `cert_command` interface that any tool (principally +`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor. + +--- + +## Reference Documents + +| Document | Location | +|---|---| +| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` | +| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` | + +--- + +## Architecture + +``` +ops-warden/ +├── SCOPE.md +├── CLAUDE.md +├── pyproject.toml +├── src/warden/ +│ ├── cli.py # Typer CLI: sign / issue / status / inventory / scorecard +│ ├── models.py # ActorType enum, CertSpec, CertRecord, PrincipalsInventory +│ ├── ca.py # LocalCA backend (file-based, for dev / non-Vault) +│ ├── vault.py # VaultCA backend (Vault SSH engine, for production) +│ ├── inventory.py # YAML principals inventory read/write +│ ├── scorecard.py # §5 compliance checks +│ └── config.py # ~/.config/warden/warden.yaml loader +├── tests/ +└── wiki/ # (symlink or copy of AccessManagementDirective.md) +``` + +**Backends are swappable.** Config key `backend: local | vault` selects which CA +implementation is used. This means the tool is fully functional without Vault for local lab +use, and production-grade with Vault — the same CLI surface, the same `cert_command` +interface, the same principals inventory format. + +**cert_command interface contract:** +``` +warden sign --pubkey +``` +Writes the signed certificate to stdout (the cert text). Exits non-zero on failure. +`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`. + +--- + +## Stack + +- **Language:** Python 3.11+ +- **CLI framework:** Typer +- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading) +- **Vault SDK:** `hvac` (optional; only required for vault backend) +- **Packaging:** `uv tool install` + +--- + +## Tasks + +### T1 — Repository bootstrap +- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add + `workplans/WARDEN-WP-0001-initial-implementation.md` (this file) +- [ ] Write `SCOPE.md` (see template in §SCOPE below) +- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"` +- [ ] Register repo with state-hub (`register_repo`) +- [ ] Create state-hub workstream for this workplan + +### T2 — Models and config +- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path, + ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at) +- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`, + `ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional: + `inventory_path`, `state_dir` +- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`) + +### T3 — LocalCA backend +- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord` + - Calls `ssh-keygen -s -I -n -V +h ` + - Parses `ssh-keygen -L -f ` output to extract `Valid before`, `Key ID`, + `Principals` + - Returns `CertRecord`; writes cert to `~/.local/state/warden/.cert.pub` +- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h + (overridable per actor in inventory) +- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm + actors that do not bring their own key + +### T4 — VaultCA backend +- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord` + - `POST /v1/ssh/sign/` with `public_key`, `valid_principals`, `ttl` + - Parse response `signed_key` field; write to state dir; extract metadata via + `ssh-keygen -L` +- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}` +- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint) + +### T5 — Principals inventory +- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive): + ```yaml + actors: + agt-state-hub-bridge: + type: agt + principals: [agt-task-bridge] + ttl_hours: 24 + description: "ops-bridge tunnel actor" + hosts: + coulombcore: + allowed_principals: + agt: [agt-task-bridge] + atm: [atm-backup-daily] + ``` +- [ ] `warden inventory list` — print table +- [ ] `warden inventory add --type --principals <...>` +- [ ] `warden inventory remove ` + +### T6 — CLI commands +- [ ] `warden sign --pubkey ` — sign existing pubkey; write cert to + stdout (the `cert_command` interface for ops-bridge) +- [ ] `warden issue ` — generate keypair + sign; output JSON with + `privkey`, `cert`, `valid_before`, `identity` +- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL + remaining; `--all` flag to show all actors in state dir +- [ ] `warden scorecard` — run §5 checks (see T7) +- [ ] `warden inventory ` (list / add / remove) + +### T7 — Scorecard runner +- [ ] `scorecard.py`: implement each §5 row as a named check function returning + `CheckResult(name, passed, detail)` +- [ ] Checks in scope for `ops-warden` (local checks, not host-side): + - All certs in state dir respect TTL policy for their `ActorType` + - No actor in inventory lacks a `principals` entry + - Actor name prefix matches declared type + - No cert expired by more than 5 min still present in state dir (stale cleanup) +- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope + — those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra` +- [ ] `warden scorecard --json` for machine-readable output + +### T8 — ops-ssh-wrapper script +- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened): + - Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars + - Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY` + - Loads cert via `ssh-add`; execs the given command +- [ ] Install as part of `uv tool install` entry points + +### T9 — Tests +- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess) +- [ ] Unit tests for inventory YAML round-trip +- [ ] Unit tests for actor name prefix validation +- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH) +- [ ] Scorecard unit tests (mock cert records) + +### T10 — Documentation +- [ ] `SCOPE.md` (see below) +- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/` +- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference +- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.) + +--- + +## SCOPE.md Template + +``` +# SCOPE + +## One-liner +SSH Certificate Authority and credential issuance for the ops fleet — +signs short-lived certs for adm/agt/atm actors; provides the cert_command +interface consumed by ops-bridge and other tooling. + +## Core Idea +Implements AccessManagementDirective §§1–5. Owns the CA key, actor inventory, +signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning, +or SSH key generation for humans. + +## In Scope +- Local CA backend (ssh-keygen -s) for lab / non-Vault use +- Vault SSH engine backend for production +- Actor identity registry (inventory.yaml) +- cert_command CLI interface: `warden sign --pubkey ` +- TTL policy enforcement per ActorType (adm/agt/atm) +- Certificate status and stale-cert cleanup +- Scorecard checks (local / cert-side only) +- ops-ssh-wrapper script for agt/atm startup automation + +## Out of Scope +- Host-side principal deployment (railiance-infra Ansible) +- SSH key generation for human admins (self-service: ssh-keygen) +- Vault cluster setup / HA +- Session recording, audit forwarding to SIEM (host-side) +- Tunnel lifecycle (ops-bridge) +- SSO / Teleport (trigger when §6.2 scale thresholds are hit) + +## Relevant When +- Issuing or refreshing a cert for any adm/agt/atm actor +- Checking cert validity / scorecard compliance +- ops-bridge needs cert_command to be defined +- Adding a new actor to the principals inventory + +## Not Relevant When +- Managing tunnel lifecycle (ops-bridge) +- Deploying SSH config to hosts (railiance-infra) +- All access is via static keys with no TTL (legacy mode) + +## Current State +Status: planned (WARDEN-WP-0001 not yet started) + +## Related Repositories +- ops-bridge — primary consumer of cert_command interface +- railiance-infra — owns host-side principal deployment +- the-custodian/state-hub — registers domain/workstreams +``` + +--- + +## Acceptance Criteria + +- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend) +- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry +- [ ] `warden scorecard` returns 5/5 on a clean test inventory +- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel +- [ ] All tests pass: `uv run pytest` +- [ ] All lints pass: `uv run ruff check .`