docs: align architecture and scope with AccessManagementDirective

Expands architecture constraints and SCOPE.md to reflect the three-actor
vocabulary (adm/agt/atm), two credential modes (static key + cert_command),
and ops-warden boundary. Adds directive wiki doc and two new workplans
(BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-28 00:59:38 +00:00
parent 75a559780e
commit f3a7236c5d
5 changed files with 773 additions and 15 deletions

View File

@@ -17,11 +17,18 @@ The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
targets/, bridges/, docs/}`.
Key design constraints:
- OpsBridge owns lifecycle management only; it does not own identity/credentials
- OpsBridge owns lifecycle management only; it does not own credential issuance or CA
operations (those belong to `ops-warden`)
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
in config, CLI args, and log filenames must stay consistent
- Actor attribution (human operator vs. automation agent) is tracked per bridge
for audit log traceability (FRS §5.7)
- Actor attribution is tracked per bridge using the three-actor vocabulary from the
AccessManagementDirective: `adm` (human), `agt` (LLM agent), `atm` (automation);
actor names must carry the matching prefix (`adm-*`, `agt-*`, `atm-*`) (FRS §5.7)
- Two credential modes are first-class and must remain independently functional:
1. **Static key mode** (default) — `ssh_key` only; no TTL, no cert logic
2. **cert_command mode** — a pluggable shell command that issues a CA-signed cert
before each SSH launch; TTL parsed from the cert; pre-emptive refresh ~5 min
before expiry; `cert_identity` logged in every `BRIDGE_CONNECTED` event
Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).

View File

@@ -8,7 +8,7 @@
## One-liner
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards.
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.
---
@@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## In Scope
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`)
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
- Auto-reconnect with exponential backoff and configurable retry policy
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
- Actor attribution: per-tunnel actor class (human / automation) for audit traceability
- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
works without any CA or external tooling
- **cert_command mode** (optional): pluggable shell command that issues a short-lived
CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
`cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
- PID + state file management in `~/.local/state/bridge/`
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
@@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Out of Scope
- Identity/credential management (uses existing SSH keys)
- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
certs via the `cert_command` interface but never signs anything itself)
- SSH key generation for human admins (self-service: `ssh-keygen`)
- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
- Long-running application hosting on remote machines (port-forward only, not deployment)
- VPN or layer-3 connectivity
- Monitoring/alerting beyond JSON audit logs
@@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Relevant When
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
- Need audit trail of which actor (human vs. automation) started/stopped tunnels
- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub
- Diagnosing connectivity issues between local hub and remote services
- Checking certificate validity for active tunnels (`bridge cert-status`)
- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials
---
@@ -60,8 +71,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Current State
- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped)
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated
- Status: active (v0.1 core complete; directive alignment in progress — BRIDGE-WP-0004)
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health
checks and audit logging complete; OpsCatalog framework present but not populated;
cert_command / ActorType alignment not yet implemented
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures
- Usage: running in lab for daily Railiance/Temporal connectivity
@@ -77,17 +90,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
## Terminology
- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check
- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
cert_command, cert_identity
- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
- Also known as: "the bridge"
- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
- Potentially confusing: "bridge state" is a tunnel-specific state machine
(stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)
---
## Related / Overlapping Repositories
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
`cert_command` when short-lived certificates are required
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
host-side principal deployment (`/etc/ssh/auth_principals/`)
---
@@ -105,5 +125,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge
## Getting Oriented
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files)
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; MCP: `bridge_status()`
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
`~/.local/state/bridge/` (PID/state/cert files)
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
MCP: `bridge_status()`
- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)

View File

@@ -0,0 +1,203 @@
AccessManagementDirective
*Practical host access control management *
# AccessManagementDirective
**Document Title:** SSH Access Management Directive
**Version:** 1.1 (Production-Ready Revision Post-SWOT Improvements)
**Date:** 28 March 2026
**Audience:** Operations Department
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
**Author:** Grok (on behalf of the team)
**Status:** Official Directive All ops personnel, agents, and automation pipelines MUST follow this.
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
## 0. Prerequisites
Before bootstrapping, the following must be in place:
- Ansible (or equivalent config-management tool) with a central inventory.
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
- GitOps repository containing the authoritative principals inventory.
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
## 1. Concept Overview
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
**Why this model?**
- A central CA signs short-lived certificates for every login.
- No more manual key copying, key sprawl, or painful revocation.
- Built-in expiration, role-based principals, and auditability.
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
**Core Principles**
- **Least privilege** Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
- **Short-lived credentials** Certificates expire automatically (2448 h for admins, 424 h for agents, 18 h for automations).
- **One CA, many issuers** A single offline User CA whose public key is trusted by every host.
- **Automation-first** All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
- **Separation of concerns**
- **Admins (adm)**: Human operators (full interactive shell when needed).
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
## 2. Actor Definitions & Access Model
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|------------|-------------------|-------------|------------------------------|---------------------------|
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 2448 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 424 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 18 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
**Certificate Naming Convention**
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
**LLM-Agent Risk Clarification**
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
## 3. Bootstrapping the System (One-Time Setup)
### 3.1. Create the CA (do this once, offline)
```bash
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
```
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
- Rotate the CA key itself every 23 years using the same bootstrap playbook.
- Public key: `ca_user.pub`
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
- Copy `ca_user.pub``/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
- Update `/etc/ssh/sshd_config`:
```bash
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PubkeyAuthentication yes
PasswordAuthentication no
PermitRootLogin no
```
- Create principals directory and files from the central Git inventory.
- `systemctl restart sshd`
### 3.3. Initial Admin Access
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
## 4. Automatic Management of Access Rights
### 4.1. Daily / On-Demand Workflow
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
- Example inventory snippet:
```yaml
hosts:
- name: prod-db-01
allowed_principals:
adm: [adm-full]
agt: [agt-incident-resolver-v2]
atm: [atm-backup-daily, atm-logrotate]
```
3. **Revocation & Rotation**
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
- Agents/automations never store long-lived private keys on disk.
4. **Concrete Agent & Automation Wrapper Example** (Python snippet place in `/usr/local/bin/ops-ssh-wrapper`)
```python
#!/usr/bin/env python3
import subprocess, os, tempfile
# Request short-lived cert from Vault
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
f.write(cert.encode())
cert_path = f.name
# Load into ssh-agent and exec the real command
subprocess.run(["ssh-add", cert_path])
os.execvp(sys.argv[1], sys.argv[1:])
```
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
### 4.2. Human UX Guidance
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
### 4.3. Emergency Break-Glass Procedure
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
4. After recovery, immediately rotate the CA and run a full scorecard.
## 5. AccessManagement Scorecard (Checklist)
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
| Category | Check | Target | Tool |
|----------|-------|--------|------|
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
| **Score** | ≥ 10/10 = **Operational** | - | - |
**Scorecard Execution Command** (run from ops laptop):
```bash
ansible all -m command -a "ssh-access-scorecard.sh" --become
```
## 6. Scope & Operational Boundaries
### 6.1. When Bootstrapping Is Officially Closed
The system is **fully operational** when **ALL** of the following are true:
- Scorecard passes 10/10 on every host.
- Central Git repo contains the authoritative principals inventory.
- First three admins have successfully used signed certificates for 7 consecutive days.
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
- CI/CD pipeline for host config updates is green and runs hourly.
- Emergency break-glass procedure has been tested once.
**Declaration:** Ops Lead signs off with date in the Git commit message.
### 6.2. Scope Boundary When to Switch to Sophisticated Tooling
Stay with **native OpenSSH CA + Ansible + Vault** while:
- ≤ 200 hosts
- ≤ 50 distinct agent/automation identities
- No regulatory requirement for SSO or full session recording
**Switch triggers** (any one):
- > 200 hosts OR rapid daily growth
- Need for human SSO (Okta/Google) integration
- Requirement for audited web-based SSH sessions or just-in-time access approval
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
- Audit/compliance demands central policy engine or session recording
**Recommended next-level tools** (in order):
1. **Teleport** Best for mixed human + agent workloads (SSO + Machine ID).
2. **HashiCorp Vault SSH + Boundary** When you already use Vault heavily.
3. **step-ca + smallstep** If you prefer a pure open-source CA with OIDC.
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
## 7. Enforcement & Review
- **Quarterly review** of this directive and scorecard results.
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
- **Questions / improvements** → create PR against this file in the ops repo.
**End of Document**
Approved for immediate use across all production and staging environments.
xxx

View File

@@ -0,0 +1,272 @@
---
id: BRIDGE-WP-0004
type: workplan
title: "AccessManagementDirective Alignment"
domain: custodian
repo: ops-bridge
status: draft
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
---
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
preserving full backward compatibility with the existing static-key mode.
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
---
## Goal
After this workplan:
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
handled transparently by the tunnel manager.
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
the directive, with config validation that enforces naming conventions.
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
§5 SIEM traceability requirement.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
| PRD | `wiki/OpsBridgePrd.md` |
| FRS | `wiki/OpsBridgeFrs.md` |
---
## Design Decisions
### Static key mode stays first-class
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
explicitly supported for:
- Lab/dev environments without a CA
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
- Environments below the directive's complexity threshold
### cert_command interface
```yaml
# tunnels.yaml — optional cert_command field
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
on tunnel stop.
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
### TTL-aware cert refresh
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
failure, no reconnect backoff triggered.
If `cert_command` is absent, no TTL logic runs.
### Actor type model
`actor_class: str # "human" | "automation"` is replaced by:
```python
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
```
Backward-compat mapping at config load time: `"human"``adm`, `"automation"``atm`.
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
canonical values.
Config validation: if `actor` name is set, it must start with the prefix matching its type
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
SIEM auditability.
---
## Tasks
### T1 — ActorType enum
- [ ] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
- [ ] `config.py`: accept legacy `"human"``ActorType.ADM` and `"automation"`
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
- [ ] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
`atm-*` for ATM; raise `ConfigError` on mismatch
- [ ] Update `manager.py` / `audit.py` call sites: `actor_class``actor_type.value`
- [ ] Update tests
### T2 — cert_command config field
- [ ] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
- [ ] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
content (shell-level freedom intentional)
- [ ] Document in config example / SCOPE.md
### T3 — Cert acquisition in manager
- [ ] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
- If `cfg.cert_command` is None: return None (static key mode)
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
- [ ] `build_ssh_command`: accept optional `cert_path`; when set, insert
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
- [ ] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
so every reconnect gets a fresh cert
### T4 — cert_identity in audit log
- [ ] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
extract `Key ID` (the `-I` value from signing time)
- [ ] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
JSON entry when present
- [ ] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
- [ ] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
### T5 — TTL-aware cert refresh
- [ ] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
from `ssh-keygen -L` output → `cert_expires_at: datetime`
- [ ] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
on each iteration
- [ ] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
next iteration)
- [ ] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
- [ ] If `cert_command` is absent, skip all TTL logic entirely
### T6 — `bridge cert-status` command
- [ ] `cli.py`: add `cert-status [TUNNEL]` subcommand
- [ ] For each tunnel (or the named one): read cert file from state dir if present,
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
time-to-expiry (or "static key / no cert" if absent)
- [ ] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
- [ ] `--json` flag for machine-readable output
### T7 — CertAcquisitionError handling
- [ ] New exception `CertAcquisitionError` in `models.py`
- [ ] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
(cert failures are transient — e.g., Vault briefly unreachable)
- [ ] After `max_attempts` consecutive cert failures, transition to `FAILED` state
### T8 — SCOPE.md and documentation updates
- [ ] Update `SCOPE.md`: replace "Identity/credential management (uses existing SSH keys)"
with the pluggable cert_command model; add ops-warden as related repo; update
actor terminology to adm/agt/atm; update Current State
- [ ] Update `wiki/OpsBridgeFrs.md` §5.7 (actor attribution): note three-actor model,
cert_identity field, cert_command interface
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency
- [ ] Update config example in README / `wiki/` to show both static and cert_command modes
- [ ] Update `.claude/rules/architecture.md`: add cert lifecycle to architecture description
### T9 — Tests
- [ ] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
cert_command parse
- [ ] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
args; verify `CertAcquisitionError` on non-zero exit
- [ ] `test_manager.py`: TTL logic — mock `cert_expires_at` in past; verify refresh triggers
- [ ] `test_audit.py`: `cert_identity` field present in CONNECTED event when cert was used;
absent in static-key mode
- [ ] `test_cli.py`: `cert-status` exit codes; JSON output shape
---
## Config Schema — Before / After
### Before
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: ops-agent
ssh_key: ~/.ssh/id_ed25519
actor: automation-agent
actors:
automation-agent:
class: automation
description: "state hub bridge agent"
```
### After (static key mode — unchanged behavior)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
### After (cert_command mode — ops-warden or any CA)
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
actors:
agt-state-hub-bridge:
class: agt
description: "state hub bridge agent"
```
---
## Acceptance Criteria
- [ ] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
warning only); tunnel behaves identically
- [ ] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
- [ ] Config with `cert_command` set: SSH process launched with both `-i key` and
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
- [ ] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
no TTL logic runs
- [ ] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
- [ ] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
- [ ] `bridge cert-status` shows valid cert info; exits 1 on expired cert
- [ ] All tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`

View File

@@ -0,0 +1,252 @@
---
id: WARDEN-WP-0001
type: workplan
title: "OpsWarden Initial Implementation"
domain: custodian
repo: ops-warden
status: draft
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
---
# WARDEN-WP-0001 — OpsWarden Initial Implementation
> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
> first commit action.
**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
the directive when scale requires it).
---
## Goal
Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
to sibling repos is a well-defined `cert_command` interface that any tool (principally
`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
---
## Architecture
```
ops-warden/
├── SCOPE.md
├── CLAUDE.md
├── pyproject.toml
├── src/warden/
│ ├── cli.py # Typer CLI: sign / issue / status / inventory / scorecard
│ ├── models.py # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
│ ├── ca.py # LocalCA backend (file-based, for dev / non-Vault)
│ ├── vault.py # VaultCA backend (Vault SSH engine, for production)
│ ├── inventory.py # YAML principals inventory read/write
│ ├── scorecard.py # §5 compliance checks
│ └── config.py # ~/.config/warden/warden.yaml loader
├── tests/
└── wiki/ # (symlink or copy of AccessManagementDirective.md)
```
**Backends are swappable.** Config key `backend: local | vault` selects which CA
implementation is used. This means the tool is fully functional without Vault for local lab
use, and production-grade with Vault — the same CLI surface, the same `cert_command`
interface, the same principals inventory format.
**cert_command interface contract:**
```
warden sign <actor-name> --pubkey <path>
```
Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
---
## Stack
- **Language:** Python 3.11+
- **CLI framework:** Typer
- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
- **Vault SDK:** `hvac` (optional; only required for vault backend)
- **Packaging:** `uv tool install`
---
## Tasks
### T1 — Repository bootstrap
- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
`workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
- [ ] Write `SCOPE.md` (see template in §SCOPE below)
- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
- [ ] Register repo with state-hub (`register_repo`)
- [ ] Create state-hub workstream for this workplan
### T2 — Models and config
- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
`ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
`inventory_path`, `state_dir`
- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
### T3 — LocalCA backend
- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
- Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
- Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
`Principals`
- Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
(overridable per actor in inventory)
- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
actors that do not bring their own key
### T4 — VaultCA backend
- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
- `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
- Parse response `signed_key` field; write to state dir; extract metadata via
`ssh-keygen -L`
- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
### T5 — Principals inventory
- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
```yaml
actors:
agt-state-hub-bridge:
type: agt
principals: [agt-task-bridge]
ttl_hours: 24
description: "ops-bridge tunnel actor"
hosts:
coulombcore:
allowed_principals:
agt: [agt-task-bridge]
atm: [atm-backup-daily]
```
- [ ] `warden inventory list` — print table
- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
- [ ] `warden inventory remove <actor-name>`
### T6 — CLI commands
- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
stdout (the `cert_command` interface for ops-bridge)
- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
`privkey`, `cert`, `valid_before`, `identity`
- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
remaining; `--all` flag to show all actors in state dir
- [ ] `warden scorecard` — run §5 checks (see T7)
- [ ] `warden inventory <subcommand>` (list / add / remove)
### T7 — Scorecard runner
- [ ] `scorecard.py`: implement each §5 row as a named check function returning
`CheckResult(name, passed, detail)`
- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
- All certs in state dir respect TTL policy for their `ActorType`
- No actor in inventory lacks a `principals` entry
- Actor name prefix matches declared type
- No cert expired by more than 5 min still present in state dir (stale cleanup)
- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
— those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
- [ ] `warden scorecard --json` for machine-readable output
### T8 — ops-ssh-wrapper script
- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
- Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
- Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
- Loads cert via `ssh-add`; execs the given command
- [ ] Install as part of `uv tool install` entry points
### T9 — Tests
- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
- [ ] Unit tests for inventory YAML round-trip
- [ ] Unit tests for actor name prefix validation
- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
- [ ] Scorecard unit tests (mock cert records)
### T10 — Documentation
- [ ] `SCOPE.md` (see below)
- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
---
## SCOPE.md Template
```
# SCOPE
## One-liner
SSH Certificate Authority and credential issuance for the ops fleet —
signs short-lived certs for adm/agt/atm actors; provides the cert_command
interface consumed by ops-bridge and other tooling.
## Core Idea
Implements AccessManagementDirective §§15. Owns the CA key, actor inventory,
signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
or SSH key generation for humans.
## In Scope
- Local CA backend (ssh-keygen -s) for lab / non-Vault use
- Vault SSH engine backend for production
- Actor identity registry (inventory.yaml)
- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
- TTL policy enforcement per ActorType (adm/agt/atm)
- Certificate status and stale-cert cleanup
- Scorecard checks (local / cert-side only)
- ops-ssh-wrapper script for agt/atm startup automation
## Out of Scope
- Host-side principal deployment (railiance-infra Ansible)
- SSH key generation for human admins (self-service: ssh-keygen)
- Vault cluster setup / HA
- Session recording, audit forwarding to SIEM (host-side)
- Tunnel lifecycle (ops-bridge)
- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
## Relevant When
- Issuing or refreshing a cert for any adm/agt/atm actor
- Checking cert validity / scorecard compliance
- ops-bridge needs cert_command to be defined
- Adding a new actor to the principals inventory
## Not Relevant When
- Managing tunnel lifecycle (ops-bridge)
- Deploying SSH config to hosts (railiance-infra)
- All access is via static keys with no TTL (legacy mode)
## Current State
Status: planned (WARDEN-WP-0001 not yet started)
## Related Repositories
- ops-bridge — primary consumer of cert_command interface
- railiance-infra — owns host-side principal deployment
- the-custodian/state-hub — registers domain/workstreams
```
---
## Acceptance Criteria
- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
- [ ] `warden scorecard` returns 5/5 on a clean test inventory
- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
- [ ] All tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`