generated from coulomb/repo-seed
docs: align architecture and scope with AccessManagementDirective
Expands architecture constraints and SCOPE.md to reflect the three-actor vocabulary (adm/agt/atm), two credential modes (static key + cert_command), and ops-warden boundary. Adds directive wiki doc and two new workplans (BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -17,11 +17,18 @@ The catalog layout follows: `opscatalog/domains/<domain>/{domain.yaml,
|
|||||||
targets/, bridges/, docs/}`.
|
targets/, bridges/, docs/}`.
|
||||||
|
|
||||||
Key design constraints:
|
Key design constraints:
|
||||||
- OpsBridge owns lifecycle management only; it does not own identity/credentials
|
- OpsBridge owns lifecycle management only; it does not own credential issuance or CA
|
||||||
|
operations (those belong to `ops-warden`)
|
||||||
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
|
- Each tunnel is identified by name (e.g. `state-hub-coulombcore`); names used
|
||||||
in config, CLI args, and log filenames must stay consistent
|
in config, CLI args, and log filenames must stay consistent
|
||||||
- Actor attribution (human operator vs. automation agent) is tracked per bridge
|
- Actor attribution is tracked per bridge using the three-actor vocabulary from the
|
||||||
for audit log traceability (FRS §5.7)
|
AccessManagementDirective: `adm` (human), `agt` (LLM agent), `atm` (automation);
|
||||||
|
actor names must carry the matching prefix (`adm-*`, `agt-*`, `atm-*`) (FRS §5.7)
|
||||||
|
- Two credential modes are first-class and must remain independently functional:
|
||||||
|
1. **Static key mode** (default) — `ssh_key` only; no TTL, no cert logic
|
||||||
|
2. **cert_command mode** — a pluggable shell command that issues a CA-signed cert
|
||||||
|
before each SSH launch; TTL parsed from the cert; pre-emptive refresh ~5 min
|
||||||
|
before expiry; `cert_identity` logged in every `BRIDGE_CONNECTED` event
|
||||||
|
|
||||||
Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
|
Specification docs are in `wiki/`: PRD (`OpsBridgePrd.md`), FRS
|
||||||
(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
|
(`OpsBridgeFrs.md`), and OpsCatalog spec (`OpsCatalogSpecification.md`).
|
||||||
|
|||||||
48
SCOPE.md
48
SCOPE.md
@@ -8,7 +8,7 @@
|
|||||||
|
|
||||||
## One-liner
|
## One-liner
|
||||||
|
|
||||||
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards.
|
SSH reverse tunnel lifecycle manager — keeps remote execution environments continuously connected to the local Custodian State Hub via auto-reconnecting port-forwards. Supports both static SSH keys (no TTL) and CA-signed short-lived certificates via a pluggable `cert_command` interface.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -20,11 +20,17 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
|||||||
|
|
||||||
## In Scope
|
## In Scope
|
||||||
|
|
||||||
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs`)
|
- Named SSH reverse tunnel lifecycle (`bridge up/down/restart/status/logs/cert-status`)
|
||||||
- Auto-reconnect with exponential backoff and configurable retry policy
|
- Auto-reconnect with exponential backoff and configurable retry policy
|
||||||
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
|
- Optional HTTP health checks (confirm forwarded service is actually reachable from remote)
|
||||||
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
|
- Structured audit logging: JSON events (connected, disconnected, health_check_failed, etc.)
|
||||||
- Actor attribution: per-tunnel actor class (human / automation) for audit traceability
|
- Actor attribution: per-tunnel actor type (`adm` / `agt` / `atm`) for audit traceability,
|
||||||
|
with naming convention enforcement (`adm-*`, `agt-*`, `atm-*`)
|
||||||
|
- **Static key mode** (default): `ssh_key` passed directly to SSH — no TTL, no cert logic,
|
||||||
|
works without any CA or external tooling
|
||||||
|
- **cert_command mode** (optional): pluggable shell command that issues a short-lived
|
||||||
|
CA-signed certificate before each SSH launch; TTL-aware pre-emptive cert refresh;
|
||||||
|
`cert_identity` recorded in audit log — satisfies AccessManagementDirective §5
|
||||||
- PID + state file management in `~/.local/state/bridge/`
|
- PID + state file management in `~/.local/state/bridge/`
|
||||||
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
|
- MCP server exposing tunnel lifecycle + OpsCatalog queries as Claude Code tools
|
||||||
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
|
- OpsCatalog: optional Git-backed YAML catalog of infrastructure topology (domains/targets/bridges)
|
||||||
@@ -33,7 +39,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
|||||||
|
|
||||||
## Out of Scope
|
## Out of Scope
|
||||||
|
|
||||||
- Identity/credential management (uses existing SSH keys)
|
- Credential issuance and CA management (owned by `ops-warden`; ops-bridge consumes
|
||||||
|
certs via the `cert_command` interface but never signs anything itself)
|
||||||
|
- SSH key generation for human admins (self-service: `ssh-keygen`)
|
||||||
|
- Host-side principal deployment (`/etc/ssh/auth_principals/`) — that is `railiance-infra`
|
||||||
- Long-running application hosting on remote machines (port-forward only, not deployment)
|
- Long-running application hosting on remote machines (port-forward only, not deployment)
|
||||||
- VPN or layer-3 connectivity
|
- VPN or layer-3 connectivity
|
||||||
- Monitoring/alerting beyond JSON audit logs
|
- Monitoring/alerting beyond JSON audit logs
|
||||||
@@ -44,9 +53,11 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
|||||||
## Relevant When
|
## Relevant When
|
||||||
|
|
||||||
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
|
- Remote Temporal workers or Railiance nodes need to reach the local Custodian MCP
|
||||||
- Need audit trail of which actor (human vs. automation) started/stopped tunnels
|
- Need audit trail of which actor (`adm` / `agt` / `atm`) started/stopped tunnels
|
||||||
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub
|
- Setting up a new machine in the Railiance ecosystem that must phone home to the hub
|
||||||
- Diagnosing connectivity issues between local hub and remote services
|
- Diagnosing connectivity issues between local hub and remote services
|
||||||
|
- Checking certificate validity for active tunnels (`bridge cert-status`)
|
||||||
|
- Integrating with a CA (ops-warden or Vault) for short-lived tunnel credentials
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -60,8 +71,10 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
|||||||
|
|
||||||
## Current State
|
## Current State
|
||||||
|
|
||||||
- Status: experimental → active (v0.1 core complete; OpsCatalog planned but not yet shipped)
|
- Status: active (v0.1 core complete; directive alignment in progress — BRIDGE-WP-0004)
|
||||||
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health checks and audit logging complete; OpsCatalog framework present but not populated
|
- Implementation: ~75% — CLI tunneling fully functional, MCP integration working, health
|
||||||
|
checks and audit logging complete; OpsCatalog framework present but not populated;
|
||||||
|
cert_command / ActorType alignment not yet implemented
|
||||||
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures
|
- Stability: stable tunnel lifecycle; tested under network drops and SSH failures
|
||||||
- Usage: running in lab for daily Railiance/Temporal connectivity
|
- Usage: running in lab for daily Railiance/Temporal connectivity
|
||||||
|
|
||||||
@@ -77,17 +90,24 @@ Claude Code sessions run locally; the Custodian State Hub API runs locally. Remo
|
|||||||
|
|
||||||
## Terminology
|
## Terminology
|
||||||
|
|
||||||
- Preferred terms: tunnel, bridge, actor, actor_class, reconnect policy, health check
|
- Preferred terms: tunnel, bridge, actor, actor_type, reconnect policy, health check,
|
||||||
|
cert_command, cert_identity
|
||||||
|
- Actor types: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
|
||||||
- Also known as: "the bridge"
|
- Also known as: "the bridge"
|
||||||
- Potentially confusing terms: "bridge state" is a tunnel-specific state machine (stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
|
- Potentially confusing: "bridge state" is a tunnel-specific state machine
|
||||||
|
(stopped → starting → connected ↔ degraded → reconnecting), not a network bridge
|
||||||
|
- Legacy terms (deprecated): `actor_class: human` (→ `adm`), `actor_class: automation` (→ `atm`)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## Related / Overlapping Repositories
|
## Related / Overlapping Repositories
|
||||||
|
|
||||||
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
|
- `the-custodian` — primary consumer; ops-bridge keeps remote agents connected to it
|
||||||
|
- `ops-warden` — optional upstream; owns CA and cert issuance; ops-bridge calls it via
|
||||||
|
`cert_command` when short-lived certificates are required
|
||||||
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel
|
- `activity-core` — Temporal server on remote reached via ops-bridge tunnel
|
||||||
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home
|
- `railiance-cluster` / `railiance-infra` — remote hosts that need to phone home; owns
|
||||||
|
host-side principal deployment (`/etc/ssh/auth_principals/`)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -105,5 +125,9 @@ keywords: [ssh, tunnel, reverse-tunnel, connectivity, remote, bridge, ops-bridge
|
|||||||
## Getting Oriented
|
## Getting Oriented
|
||||||
|
|
||||||
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
|
- Start with: `README.txt` (architecture, config format, CLI commands, MCP integration)
|
||||||
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config), `~/.local/state/bridge/` (PID/state files)
|
- Key files / directories: `~/.config/bridge/tunnels.yaml` (tunnel config),
|
||||||
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; MCP: `bridge_status()`
|
`~/.local/state/bridge/` (PID/state/cert files)
|
||||||
|
- Entry points: `bridge --help`; `bridge up <tunnel-name>`; `bridge cert-status`;
|
||||||
|
MCP: `bridge_status()`
|
||||||
|
- AccessManagementDirective context: `wiki/AccessManagementDirective.md`
|
||||||
|
- Workplans: BRIDGE-WP-0004 (directive alignment), WARDEN-WP-0001 (ops-warden bootstrap)
|
||||||
|
|||||||
203
wiki/AccessManagementDirective.md
Normal file
203
wiki/AccessManagementDirective.md
Normal file
@@ -0,0 +1,203 @@
|
|||||||
|
AccessManagementDirective
|
||||||
|
|
||||||
|
*Practical host access control management *
|
||||||
|
|
||||||
|
# AccessManagementDirective
|
||||||
|
|
||||||
|
**Document Title:** SSH Access Management Directive
|
||||||
|
**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements)
|
||||||
|
**Date:** 28 March 2026
|
||||||
|
**Audience:** Operations Department
|
||||||
|
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
|
||||||
|
**Author:** Grok (on behalf of the team)
|
||||||
|
**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.
|
||||||
|
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
|
||||||
|
|
||||||
|
## 0. Prerequisites
|
||||||
|
|
||||||
|
Before bootstrapping, the following must be in place:
|
||||||
|
- Ansible (or equivalent config-management tool) with a central inventory.
|
||||||
|
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
|
||||||
|
- GitOps repository containing the authoritative principals inventory.
|
||||||
|
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
|
||||||
|
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
|
||||||
|
|
||||||
|
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
|
||||||
|
|
||||||
|
## 1. Concept Overview
|
||||||
|
|
||||||
|
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
|
||||||
|
|
||||||
|
**Why this model?**
|
||||||
|
- A central CA signs short-lived certificates for every login.
|
||||||
|
- No more manual key copying, key sprawl, or painful revocation.
|
||||||
|
- Built-in expiration, role-based principals, and auditability.
|
||||||
|
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
|
||||||
|
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
|
||||||
|
|
||||||
|
**Core Principles**
|
||||||
|
- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
|
||||||
|
- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).
|
||||||
|
- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host.
|
||||||
|
- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
|
||||||
|
- **Separation of concerns** –
|
||||||
|
- **Admins (adm)**: Human operators (full interactive shell when needed).
|
||||||
|
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
|
||||||
|
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
|
||||||
|
|
||||||
|
## 2. Actor Definitions & Access Model
|
||||||
|
|
||||||
|
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|
||||||
|
|------------|-------------------|-------------|------------------------------|---------------------------|
|
||||||
|
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
|
||||||
|
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
|
||||||
|
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
|
||||||
|
|
||||||
|
**Certificate Naming Convention**
|
||||||
|
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
|
||||||
|
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
|
||||||
|
|
||||||
|
**LLM-Agent Risk Clarification**
|
||||||
|
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
|
||||||
|
|
||||||
|
## 3. Bootstrapping the System (One-Time Setup)
|
||||||
|
|
||||||
|
### 3.1. Create the CA (do this once, offline)
|
||||||
|
```bash
|
||||||
|
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
|
||||||
|
```
|
||||||
|
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
|
||||||
|
- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.
|
||||||
|
- Public key: `ca_user.pub`
|
||||||
|
|
||||||
|
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
|
||||||
|
- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
|
||||||
|
- Update `/etc/ssh/sshd_config`:
|
||||||
|
```bash
|
||||||
|
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
|
||||||
|
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
|
||||||
|
PubkeyAuthentication yes
|
||||||
|
PasswordAuthentication no
|
||||||
|
PermitRootLogin no
|
||||||
|
```
|
||||||
|
- Create principals directory and files from the central Git inventory.
|
||||||
|
- `systemctl restart sshd`
|
||||||
|
|
||||||
|
### 3.3. Initial Admin Access
|
||||||
|
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
|
||||||
|
|
||||||
|
## 4. Automatic Management of Access Rights
|
||||||
|
|
||||||
|
### 4.1. Daily / On-Demand Workflow
|
||||||
|
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
|
||||||
|
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
|
||||||
|
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
|
||||||
|
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
|
||||||
|
|
||||||
|
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
|
||||||
|
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
|
||||||
|
- Example inventory snippet:
|
||||||
|
```yaml
|
||||||
|
hosts:
|
||||||
|
- name: prod-db-01
|
||||||
|
allowed_principals:
|
||||||
|
adm: [adm-full]
|
||||||
|
agt: [agt-incident-resolver-v2]
|
||||||
|
atm: [atm-backup-daily, atm-logrotate]
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Revocation & Rotation**
|
||||||
|
- Short expiry = automatic revocation.
|
||||||
|
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
|
||||||
|
- Agents/automations never store long-lived private keys on disk.
|
||||||
|
|
||||||
|
4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`)
|
||||||
|
```python
|
||||||
|
#!/usr/bin/env python3
|
||||||
|
import subprocess, os, tempfile
|
||||||
|
# Request short-lived cert from Vault
|
||||||
|
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
|
||||||
|
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
|
||||||
|
f.write(cert.encode())
|
||||||
|
cert_path = f.name
|
||||||
|
# Load into ssh-agent and exec the real command
|
||||||
|
subprocess.run(["ssh-add", cert_path])
|
||||||
|
os.execvp(sys.argv[1], sys.argv[1:])
|
||||||
|
```
|
||||||
|
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
|
||||||
|
|
||||||
|
### 4.2. Human UX Guidance
|
||||||
|
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
|
||||||
|
|
||||||
|
### 4.3. Emergency Break-Glass Procedure
|
||||||
|
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
|
||||||
|
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
|
||||||
|
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
|
||||||
|
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
|
||||||
|
4. After recovery, immediately rotate the CA and run a full scorecard.
|
||||||
|
|
||||||
|
## 5. AccessManagement Scorecard (Checklist)
|
||||||
|
|
||||||
|
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
|
||||||
|
|
||||||
|
| Category | Check | Target | Tool |
|
||||||
|
|----------|-------|--------|------|
|
||||||
|
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
|
||||||
|
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
|
||||||
|
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
|
||||||
|
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
|
||||||
|
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
|
||||||
|
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
|
||||||
|
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
|
||||||
|
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
|
||||||
|
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
|
||||||
|
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
|
||||||
|
| **Score** | ≥ 10/10 = **Operational** | - | - |
|
||||||
|
|
||||||
|
**Scorecard Execution Command** (run from ops laptop):
|
||||||
|
```bash
|
||||||
|
ansible all -m command -a "ssh-access-scorecard.sh" --become
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. Scope & Operational Boundaries
|
||||||
|
|
||||||
|
### 6.1. When Bootstrapping Is Officially Closed
|
||||||
|
The system is **fully operational** when **ALL** of the following are true:
|
||||||
|
- Scorecard passes 10/10 on every host.
|
||||||
|
- Central Git repo contains the authoritative principals inventory.
|
||||||
|
- First three admins have successfully used signed certificates for 7 consecutive days.
|
||||||
|
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
|
||||||
|
- CI/CD pipeline for host config updates is green and runs hourly.
|
||||||
|
- Emergency break-glass procedure has been tested once.
|
||||||
|
|
||||||
|
**Declaration:** Ops Lead signs off with date in the Git commit message.
|
||||||
|
|
||||||
|
### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling
|
||||||
|
Stay with **native OpenSSH CA + Ansible + Vault** while:
|
||||||
|
- ≤ 200 hosts
|
||||||
|
- ≤ 50 distinct agent/automation identities
|
||||||
|
- No regulatory requirement for SSO or full session recording
|
||||||
|
|
||||||
|
**Switch triggers** (any one):
|
||||||
|
- > 200 hosts OR rapid daily growth
|
||||||
|
- Need for human SSO (Okta/Google) integration
|
||||||
|
- Requirement for audited web-based SSH sessions or just-in-time access approval
|
||||||
|
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
|
||||||
|
- Audit/compliance demands central policy engine or session recording
|
||||||
|
|
||||||
|
**Recommended next-level tools** (in order):
|
||||||
|
1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID).
|
||||||
|
2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily.
|
||||||
|
3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC.
|
||||||
|
|
||||||
|
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
|
||||||
|
|
||||||
|
## 7. Enforcement & Review
|
||||||
|
- **Quarterly review** of this directive and scorecard results.
|
||||||
|
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
|
||||||
|
- **Questions / improvements** → create PR against this file in the ops repo.
|
||||||
|
|
||||||
|
**End of Document**
|
||||||
|
Approved for immediate use across all production and staging environments.
|
||||||
|
|
||||||
|
xxx
|
||||||
272
workplans/BRIDGE-WP-0004-directive-alignment.md
Normal file
272
workplans/BRIDGE-WP-0004-directive-alignment.md
Normal file
@@ -0,0 +1,272 @@
|
|||||||
|
---
|
||||||
|
id: BRIDGE-WP-0004
|
||||||
|
type: workplan
|
||||||
|
title: "AccessManagementDirective Alignment"
|
||||||
|
domain: custodian
|
||||||
|
repo: ops-bridge
|
||||||
|
status: draft
|
||||||
|
owner: Bernd
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-03-28"
|
||||||
|
updated: "2026-03-28"
|
||||||
|
---
|
||||||
|
|
||||||
|
# BRIDGE-WP-0004 — AccessManagementDirective Alignment
|
||||||
|
|
||||||
|
**Scope:** Align `ops-bridge` with `wiki/AccessManagementDirective.md` — three-actor model,
|
||||||
|
optional CA-signed certificate acquisition, TTL-aware reconnect, richer audit log — while
|
||||||
|
preserving full backward compatibility with the existing static-key mode.
|
||||||
|
|
||||||
|
**Out of scope:** CA/signing logic itself (lives in `ops-warden`), host-side principal
|
||||||
|
deployment, Vault cluster management, OpsCatalog extensions (BRIDGE-WP-0002).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
After this workplan:
|
||||||
|
|
||||||
|
1. `ops-bridge` works unchanged for anyone using plain, non-expiring SSH keys.
|
||||||
|
2. `ops-bridge` works with CA-signed short-lived certs via `ops-warden` (or any compatible
|
||||||
|
`cert_command`) — cert acquisition, cert rotation, and cert identity logging are all
|
||||||
|
handled transparently by the tunnel manager.
|
||||||
|
3. Actor attribution is expressed in the three-actor vocabulary (`adm | agt | atm`) from
|
||||||
|
the directive, with config validation that enforces naming conventions.
|
||||||
|
4. The audit log carries `cert_identity` when a cert was used, satisfying the directive's
|
||||||
|
§5 SIEM traceability requirement.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Reference Documents
|
||||||
|
|
||||||
|
| Document | Location |
|
||||||
|
|---|---|
|
||||||
|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
|
||||||
|
| WARDEN-WP-0001 | `workplans/WARDEN-WP-0001-initial-implementation.md` |
|
||||||
|
| PRD | `wiki/OpsBridgePrd.md` |
|
||||||
|
| FRS | `wiki/OpsBridgeFrs.md` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Design Decisions
|
||||||
|
|
||||||
|
### Static key mode stays first-class
|
||||||
|
|
||||||
|
If `cert_command` is absent from a tunnel config, `ops-bridge` behaves exactly as today:
|
||||||
|
`ssh_key` is passed directly to `ssh -i`. No deprecation, no warnings. Static keys are
|
||||||
|
explicitly supported for:
|
||||||
|
- Lab/dev environments without a CA
|
||||||
|
- Tunnels owned by `adm`-class humans who manage their own cert refresh externally
|
||||||
|
- Environments below the directive's complexity threshold
|
||||||
|
|
||||||
|
### cert_command interface
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# tunnels.yaml — optional cert_command field
|
||||||
|
tunnels:
|
||||||
|
state-hub-coulombcore:
|
||||||
|
host: coulombcore
|
||||||
|
remote_port: 8001
|
||||||
|
local_port: 8000
|
||||||
|
ssh_user: agt-state-hub-bridge
|
||||||
|
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519 # private key (always required)
|
||||||
|
actor: agt-state-hub-bridge
|
||||||
|
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
|
||||||
|
```
|
||||||
|
|
||||||
|
When `cert_command` is present, `manager.py` runs it before every SSH subprocess launch,
|
||||||
|
captures stdout as the cert text, writes it to a tempfile in the state dir, and adds
|
||||||
|
`-i <cert_path>` alongside `-i <key_path>` to the SSH command. The cert file is cleaned up
|
||||||
|
on tunnel stop.
|
||||||
|
|
||||||
|
`cert_command` is a raw shell string, intentionally. The caller decides whether it invokes
|
||||||
|
`warden`, `vault write`, `ssh-keygen -s`, or any other tool. This keeps the interface
|
||||||
|
dependency-free — no Vault SDK, no warden import needed inside ops-bridge.
|
||||||
|
|
||||||
|
### TTL-aware cert refresh
|
||||||
|
|
||||||
|
After acquiring a cert, `manager.py` parses `Valid before:` via `ssh-keygen -L` to
|
||||||
|
determine `cert_expires_at`. It schedules a pre-emptive cert refresh
|
||||||
|
(`cert_expires_at - 5 min`) inside the health-check/wait loop. When the refresh timer
|
||||||
|
fires, the SSH subprocess is gracefully restarted with a freshly signed cert — no auth
|
||||||
|
failure, no reconnect backoff triggered.
|
||||||
|
|
||||||
|
If `cert_command` is absent, no TTL logic runs.
|
||||||
|
|
||||||
|
### Actor type model
|
||||||
|
|
||||||
|
`actor_class: str # "human" | "automation"` is replaced by:
|
||||||
|
|
||||||
|
```python
|
||||||
|
class ActorType(str, Enum):
|
||||||
|
ADM = "adm" # human operator
|
||||||
|
AGT = "agt" # LLM-powered autonomous agent
|
||||||
|
ATM = "atm" # deterministic script / pipeline
|
||||||
|
```
|
||||||
|
|
||||||
|
Backward-compat mapping at config load time: `"human"` → `adm`, `"automation"` → `atm`.
|
||||||
|
The mapping is a one-way migration aid with a deprecation warning; new configs must use the
|
||||||
|
canonical values.
|
||||||
|
|
||||||
|
Config validation: if `actor` name is set, it must start with the prefix matching its type
|
||||||
|
(`adm-*`, `agt-*`, `atm-*`). Hard error, not a warning — the directive requires this for
|
||||||
|
SIEM auditability.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
### T1 — ActorType enum
|
||||||
|
- [ ] `models.py`: replace `actor_class: str` in `ActorInfo` with `actor_type: ActorType`
|
||||||
|
- [ ] `config.py`: accept legacy `"human"` → `ActorType.ADM` and `"automation"` →
|
||||||
|
`ActorType.ATM` with a `DeprecationWarning`; reject unknown values
|
||||||
|
- [ ] `config.py`: enforce actor name prefix: `adm-*` for ADM, `agt-*` for AGT,
|
||||||
|
`atm-*` for ATM; raise `ConfigError` on mismatch
|
||||||
|
- [ ] Update `manager.py` / `audit.py` call sites: `actor_class` → `actor_type.value`
|
||||||
|
- [ ] Update tests
|
||||||
|
|
||||||
|
### T2 — cert_command config field
|
||||||
|
- [ ] `models.py`: add `cert_command: Optional[str] = None` to `TunnelConfig`
|
||||||
|
- [ ] `config.py`: parse `cert_command` from tunnel YAML; no validation of the string
|
||||||
|
content (shell-level freedom intentional)
|
||||||
|
- [ ] Document in config example / SCOPE.md
|
||||||
|
|
||||||
|
### T3 — Cert acquisition in manager
|
||||||
|
- [ ] `manager.py`: extract cert acquisition into `_acquire_cert(cfg) -> Optional[Path]`
|
||||||
|
- If `cfg.cert_command` is None: return None (static key mode)
|
||||||
|
- Run `cert_command` via `subprocess.run(shell=True, capture_output=True)`
|
||||||
|
- Write stdout to `~/.local/state/bridge/<tunnel>-cert.pub` (overwrite each time)
|
||||||
|
- Return path; on non-zero exit code: raise `CertAcquisitionError` with stderr
|
||||||
|
- [ ] `build_ssh_command`: accept optional `cert_path`; when set, insert
|
||||||
|
`-i <cert_path>` after `-i <key_path>` (OpenSSH loads both automatically)
|
||||||
|
- [ ] Call `_acquire_cert` at the top of each reconnect iteration (not once at startup)
|
||||||
|
so every reconnect gets a fresh cert
|
||||||
|
|
||||||
|
### T4 — cert_identity in audit log
|
||||||
|
- [ ] `manager.py`: after cert acquisition, parse `ssh-keygen -L -f <cert>` output to
|
||||||
|
extract `Key ID` (the `-I` value from signing time)
|
||||||
|
- [ ] Add `cert_identity: Optional[str]` to `AuditLogger.log()` signature; include in
|
||||||
|
JSON entry when present
|
||||||
|
- [ ] Log `cert_identity` in `BRIDGE_CONNECTED` and `BRIDGE_STARTED` events
|
||||||
|
- [ ] `AuditEvent`: no new events needed; `cert_identity` is metadata on existing events
|
||||||
|
|
||||||
|
### T5 — TTL-aware cert refresh
|
||||||
|
- [ ] `manager.py`: after successful cert acquisition, parse `Valid before:` timestamp
|
||||||
|
from `ssh-keygen -L` output → `cert_expires_at: datetime`
|
||||||
|
- [ ] In the health-check/wait loop, check `datetime.now(utc) >= cert_expires_at - timedelta(minutes=5)`
|
||||||
|
on each iteration
|
||||||
|
- [ ] When refresh is due: call `proc.terminate()`, break inner loop, let the outer
|
||||||
|
reconnect loop restart naturally (T3 will re-acquire the cert at the top of the
|
||||||
|
next iteration)
|
||||||
|
- [ ] Log a new `AuditEvent.CERT_EXPIRING` event when refresh is triggered (add to
|
||||||
|
`AuditEvent` enum); include `cert_identity` and `cert_expires_at` in detail field
|
||||||
|
- [ ] If `cert_command` is absent, skip all TTL logic entirely
|
||||||
|
|
||||||
|
### T6 — `bridge cert-status` command
|
||||||
|
- [ ] `cli.py`: add `cert-status [TUNNEL]` subcommand
|
||||||
|
- [ ] For each tunnel (or the named one): read cert file from state dir if present,
|
||||||
|
run `ssh-keygen -L`, display: identity, principals, valid-from, valid-until,
|
||||||
|
time-to-expiry (or "static key / no cert" if absent)
|
||||||
|
- [ ] Exit code 1 if any cert is expired; exit code 0 otherwise (scriptable)
|
||||||
|
- [ ] `--json` flag for machine-readable output
|
||||||
|
|
||||||
|
### T7 — CertAcquisitionError handling
|
||||||
|
- [ ] New exception `CertAcquisitionError` in `models.py`
|
||||||
|
- [ ] In `_run_loop`: catch `CertAcquisitionError`, log `AuditEvent.BRIDGE_DISCONNECTED`
|
||||||
|
with `detail="cert acquisition failed: <stderr>"`, apply normal backoff and retry
|
||||||
|
(cert failures are transient — e.g., Vault briefly unreachable)
|
||||||
|
- [ ] After `max_attempts` consecutive cert failures, transition to `FAILED` state
|
||||||
|
|
||||||
|
### T8 — SCOPE.md and documentation updates
|
||||||
|
- [ ] Update `SCOPE.md`: replace "Identity/credential management (uses existing SSH keys)"
|
||||||
|
with the pluggable cert_command model; add ops-warden as related repo; update
|
||||||
|
actor terminology to adm/agt/atm; update Current State
|
||||||
|
- [ ] Update `wiki/OpsBridgeFrs.md` §5.7 (actor attribution): note three-actor model,
|
||||||
|
cert_identity field, cert_command interface
|
||||||
|
- [ ] Update `wiki/OpsBridgePrd.md`: note directive alignment, ops-warden dependency
|
||||||
|
- [ ] Update config example in README / `wiki/` to show both static and cert_command modes
|
||||||
|
- [ ] Update `.claude/rules/architecture.md`: add cert lifecycle to architecture description
|
||||||
|
|
||||||
|
### T9 — Tests
|
||||||
|
- [ ] `test_config.py`: actor name prefix validation (adm/agt/atm); legacy class mapping;
|
||||||
|
cert_command parse
|
||||||
|
- [ ] `test_manager.py`: mock `cert_command` subprocess; verify cert path appended to SSH
|
||||||
|
args; verify `CertAcquisitionError` on non-zero exit
|
||||||
|
- [ ] `test_manager.py`: TTL logic — mock `cert_expires_at` in past; verify refresh triggers
|
||||||
|
- [ ] `test_audit.py`: `cert_identity` field present in CONNECTED event when cert was used;
|
||||||
|
absent in static-key mode
|
||||||
|
- [ ] `test_cli.py`: `cert-status` exit codes; JSON output shape
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Config Schema — Before / After
|
||||||
|
|
||||||
|
### Before
|
||||||
|
```yaml
|
||||||
|
tunnels:
|
||||||
|
state-hub-coulombcore:
|
||||||
|
host: coulombcore
|
||||||
|
remote_port: 8001
|
||||||
|
local_port: 8000
|
||||||
|
ssh_user: ops-agent
|
||||||
|
ssh_key: ~/.ssh/id_ed25519
|
||||||
|
actor: automation-agent
|
||||||
|
|
||||||
|
actors:
|
||||||
|
automation-agent:
|
||||||
|
class: automation
|
||||||
|
description: "state hub bridge agent"
|
||||||
|
```
|
||||||
|
|
||||||
|
### After (static key mode — unchanged behavior)
|
||||||
|
```yaml
|
||||||
|
tunnels:
|
||||||
|
state-hub-coulombcore:
|
||||||
|
host: coulombcore
|
||||||
|
remote_port: 8001
|
||||||
|
local_port: 8000
|
||||||
|
ssh_user: agt-state-hub-bridge
|
||||||
|
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
|
||||||
|
actor: agt-state-hub-bridge
|
||||||
|
|
||||||
|
actors:
|
||||||
|
agt-state-hub-bridge:
|
||||||
|
class: agt
|
||||||
|
description: "state hub bridge agent"
|
||||||
|
```
|
||||||
|
|
||||||
|
### After (cert_command mode — ops-warden or any CA)
|
||||||
|
```yaml
|
||||||
|
tunnels:
|
||||||
|
state-hub-coulombcore:
|
||||||
|
host: coulombcore
|
||||||
|
remote_port: 8001
|
||||||
|
local_port: 8000
|
||||||
|
ssh_user: agt-state-hub-bridge
|
||||||
|
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
|
||||||
|
actor: agt-state-hub-bridge
|
||||||
|
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
|
||||||
|
|
||||||
|
actors:
|
||||||
|
agt-state-hub-bridge:
|
||||||
|
class: agt
|
||||||
|
description: "state hub bridge agent"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
|
||||||
|
- [ ] Existing `tunnels.yaml` with `class: automation` loads without error (deprecation
|
||||||
|
warning only); tunnel behaves identically
|
||||||
|
- [ ] New config with `class: agt` and actor name not prefixed `agt-` raises `ConfigError`
|
||||||
|
- [ ] Config with `cert_command` set: SSH process launched with both `-i key` and
|
||||||
|
`-i cert`; `cert_identity` present in `BRIDGE_CONNECTED` audit event
|
||||||
|
- [ ] Config without `cert_command`: no cert file written; `cert_identity` absent in audit;
|
||||||
|
no TTL logic runs
|
||||||
|
- [ ] `cert_command` exits non-zero: tunnel enters backoff/retry, `BRIDGE_DISCONNECTED`
|
||||||
|
logged with stderr detail; eventually reaches `FAILED` after `max_attempts`
|
||||||
|
- [ ] Cert within 5 min of expiry: SSH restarted with fresh cert; `CERT_EXPIRING` logged
|
||||||
|
- [ ] `bridge cert-status` shows valid cert info; exits 1 on expired cert
|
||||||
|
- [ ] All tests pass: `uv run pytest`
|
||||||
|
- [ ] All lints pass: `uv run ruff check .`
|
||||||
252
workplans/WARDEN-WP-0001-initial-implementation.md
Normal file
252
workplans/WARDEN-WP-0001-initial-implementation.md
Normal file
@@ -0,0 +1,252 @@
|
|||||||
|
---
|
||||||
|
id: WARDEN-WP-0001
|
||||||
|
type: workplan
|
||||||
|
title: "OpsWarden Initial Implementation"
|
||||||
|
domain: custodian
|
||||||
|
repo: ops-warden
|
||||||
|
status: draft
|
||||||
|
owner: Bernd
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-03-28"
|
||||||
|
updated: "2026-03-28"
|
||||||
|
---
|
||||||
|
|
||||||
|
# WARDEN-WP-0001 — OpsWarden Initial Implementation
|
||||||
|
|
||||||
|
> **Note:** This workplan is authored in `ops-bridge` because `ops-warden` does not yet exist.
|
||||||
|
> Move it to `workplans/WARDEN-WP-0001-initial-implementation.md` in the new repo as the
|
||||||
|
> first commit action.
|
||||||
|
|
||||||
|
**Scope:** Bootstrap the `ops-warden` repository and deliver a working `warden` CLI that
|
||||||
|
implements the SSH CA and certificate lifecycle defined in `wiki/AccessManagementDirective.md`.
|
||||||
|
|
||||||
|
**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
|
||||||
|
(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
|
||||||
|
the directive when scale requires it).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
Create a new `ops-warden` repository that owns **credential issuance only** — the CA,
|
||||||
|
certificate signing, actor identity registry, and scorecard tooling. Its sole public surface
|
||||||
|
to sibling repos is a well-defined `cert_command` interface that any tool (principally
|
||||||
|
`ops-bridge`) can call to obtain a short-lived, CA-signed SSH certificate for a named actor.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Reference Documents
|
||||||
|
|
||||||
|
| Document | Location |
|
||||||
|
|---|---|
|
||||||
|
| AccessManagementDirective | `ops-bridge/wiki/AccessManagementDirective.md` |
|
||||||
|
| ops-bridge SCOPE.md | `ops-bridge/SCOPE.md` |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
ops-warden/
|
||||||
|
├── SCOPE.md
|
||||||
|
├── CLAUDE.md
|
||||||
|
├── pyproject.toml
|
||||||
|
├── src/warden/
|
||||||
|
│ ├── cli.py # Typer CLI: sign / issue / status / inventory / scorecard
|
||||||
|
│ ├── models.py # ActorType enum, CertSpec, CertRecord, PrincipalsInventory
|
||||||
|
│ ├── ca.py # LocalCA backend (file-based, for dev / non-Vault)
|
||||||
|
│ ├── vault.py # VaultCA backend (Vault SSH engine, for production)
|
||||||
|
│ ├── inventory.py # YAML principals inventory read/write
|
||||||
|
│ ├── scorecard.py # §5 compliance checks
|
||||||
|
│ └── config.py # ~/.config/warden/warden.yaml loader
|
||||||
|
├── tests/
|
||||||
|
└── wiki/ # (symlink or copy of AccessManagementDirective.md)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Backends are swappable.** Config key `backend: local | vault` selects which CA
|
||||||
|
implementation is used. This means the tool is fully functional without Vault for local lab
|
||||||
|
use, and production-grade with Vault — the same CLI surface, the same `cert_command`
|
||||||
|
interface, the same principals inventory format.
|
||||||
|
|
||||||
|
**cert_command interface contract:**
|
||||||
|
```
|
||||||
|
warden sign <actor-name> --pubkey <path>
|
||||||
|
```
|
||||||
|
Writes the signed certificate to stdout (the cert text). Exits non-zero on failure.
|
||||||
|
`ops-bridge` calls this verbatim via `cert_command` in `tunnels.yaml`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Stack
|
||||||
|
|
||||||
|
- **Language:** Python 3.11+
|
||||||
|
- **CLI framework:** Typer
|
||||||
|
- **Dependencies:** typer, pyyaml, httpx, cryptography (for cert parsing / TTL reading)
|
||||||
|
- **Vault SDK:** `hvac` (optional; only required for vault backend)
|
||||||
|
- **Packaging:** `uv tool install`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
### T1 — Repository bootstrap
|
||||||
|
- [ ] Create `ops-warden` repo; copy CLAUDE.md template from `ops-bridge`; add
|
||||||
|
`workplans/WARDEN-WP-0001-initial-implementation.md` (this file)
|
||||||
|
- [ ] Write `SCOPE.md` (see template in §SCOPE below)
|
||||||
|
- [ ] `pyproject.toml`: `[project.scripts] warden = "warden.cli:app"`
|
||||||
|
- [ ] Register repo with state-hub (`register_repo`)
|
||||||
|
- [ ] Create state-hub workstream for this workplan
|
||||||
|
|
||||||
|
### T2 — Models and config
|
||||||
|
- [ ] `models.py`: `ActorType` enum (`adm | agt | atm`); `CertSpec` (actor_name, pubkey_path,
|
||||||
|
ttl_hours, principals); `CertRecord` (identity, valid_before, cert_path, signed_at)
|
||||||
|
- [ ] `config.py`: load `~/.config/warden/warden.yaml`; required fields: `backend`,
|
||||||
|
`ca_key` (local) or `vault_addr` + `vault_role_map` (vault); optional:
|
||||||
|
`inventory_path`, `state_dir`
|
||||||
|
- [ ] Validate actor name prefix matches `ActorType` (`adm-*`, `agt-*`, `atm-*`)
|
||||||
|
|
||||||
|
### T3 — LocalCA backend
|
||||||
|
- [ ] `ca.py`: `LocalCA.sign(spec: CertSpec) -> CertRecord`
|
||||||
|
- Calls `ssh-keygen -s <ca_key> -I <identity> -n <principals> -V +<ttl>h <pubkey>`
|
||||||
|
- Parses `ssh-keygen -L -f <cert>` output to extract `Valid before`, `Key ID`,
|
||||||
|
`Principals`
|
||||||
|
- Returns `CertRecord`; writes cert to `~/.local/state/warden/<actor>.cert.pub`
|
||||||
|
- [ ] Default TTLs enforced per `ActorType`: adm → 48 h, agt → 24 h, atm → 8 h
|
||||||
|
(overridable per actor in inventory)
|
||||||
|
- [ ] `LocalCA.generate_keypair(actor_name) -> (privkey_path, pubkey_path)` — for agt/atm
|
||||||
|
actors that do not bring their own key
|
||||||
|
|
||||||
|
### T4 — VaultCA backend
|
||||||
|
- [ ] `vault.py`: `VaultCA.sign(spec: CertSpec) -> CertRecord`
|
||||||
|
- `POST /v1/ssh/sign/<role>` with `public_key`, `valid_principals`, `ttl`
|
||||||
|
- Parse response `signed_key` field; write to state dir; extract metadata via
|
||||||
|
`ssh-keygen -L`
|
||||||
|
- [ ] Role map in config: `vault_role_map: {adm: adm-role, agt: agt-role, atm: atm-role}`
|
||||||
|
- [ ] Graceful error message when Vault is unreachable (with `--backend local` fallback hint)
|
||||||
|
|
||||||
|
### T5 — Principals inventory
|
||||||
|
- [ ] `inventory.py`: load/save `inventory.yaml` (format mirrors §4.1 of directive):
|
||||||
|
```yaml
|
||||||
|
actors:
|
||||||
|
agt-state-hub-bridge:
|
||||||
|
type: agt
|
||||||
|
principals: [agt-task-bridge]
|
||||||
|
ttl_hours: 24
|
||||||
|
description: "ops-bridge tunnel actor"
|
||||||
|
hosts:
|
||||||
|
coulombcore:
|
||||||
|
allowed_principals:
|
||||||
|
agt: [agt-task-bridge]
|
||||||
|
atm: [atm-backup-daily]
|
||||||
|
```
|
||||||
|
- [ ] `warden inventory list` — print table
|
||||||
|
- [ ] `warden inventory add <actor-name> --type <adm|agt|atm> --principals <...>`
|
||||||
|
- [ ] `warden inventory remove <actor-name>`
|
||||||
|
|
||||||
|
### T6 — CLI commands
|
||||||
|
- [ ] `warden sign <actor-name> --pubkey <path>` — sign existing pubkey; write cert to
|
||||||
|
stdout (the `cert_command` interface for ops-bridge)
|
||||||
|
- [ ] `warden issue <actor-name>` — generate keypair + sign; output JSON with
|
||||||
|
`privkey`, `cert`, `valid_before`, `identity`
|
||||||
|
- [ ] `warden status [actor-name]` — show cert validity, identity, principals, TTL
|
||||||
|
remaining; `--all` flag to show all actors in state dir
|
||||||
|
- [ ] `warden scorecard` — run §5 checks (see T7)
|
||||||
|
- [ ] `warden inventory <subcommand>` (list / add / remove)
|
||||||
|
|
||||||
|
### T7 — Scorecard runner
|
||||||
|
- [ ] `scorecard.py`: implement each §5 row as a named check function returning
|
||||||
|
`CheckResult(name, passed, detail)`
|
||||||
|
- [ ] Checks in scope for `ops-warden` (local checks, not host-side):
|
||||||
|
- All certs in state dir respect TTL policy for their `ActorType`
|
||||||
|
- No actor in inventory lacks a `principals` entry
|
||||||
|
- Actor name prefix matches declared type
|
||||||
|
- No cert expired by more than 5 min still present in state dir (stale cleanup)
|
||||||
|
- [ ] Host-side checks (password auth disabled, root login disabled, etc.) are out of scope
|
||||||
|
— those live in the Ansible `ssh-access-audit.yml` playbook in `railiance-infra`
|
||||||
|
- [ ] `warden scorecard --json` for machine-readable output
|
||||||
|
|
||||||
|
### T8 — ops-ssh-wrapper script
|
||||||
|
- [ ] Ship `scripts/ops-ssh-wrapper` (the Python snippet from §4.1, hardened):
|
||||||
|
- Reads `WARDEN_ACTOR` and `SSH_PUBKEY` env vars
|
||||||
|
- Calls `warden sign $WARDEN_ACTOR --pubkey $SSH_PUBKEY`
|
||||||
|
- Loads cert via `ssh-add`; execs the given command
|
||||||
|
- [ ] Install as part of `uv tool install` entry points
|
||||||
|
|
||||||
|
### T9 — Tests
|
||||||
|
- [ ] Unit tests for `LocalCA` (mock `ssh-keygen` subprocess)
|
||||||
|
- [ ] Unit tests for inventory YAML round-trip
|
||||||
|
- [ ] Unit tests for actor name prefix validation
|
||||||
|
- [ ] Integration test: `LocalCA.sign` on a real test keypair (requires `ssh-keygen` in PATH)
|
||||||
|
- [ ] Scorecard unit tests (mock cert records)
|
||||||
|
|
||||||
|
### T10 — Documentation
|
||||||
|
- [ ] `SCOPE.md` (see below)
|
||||||
|
- [ ] `wiki/AccessManagementDirective.md` — copy from `ops-bridge/wiki/`
|
||||||
|
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference
|
||||||
|
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (ops-bridge etc.)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## SCOPE.md Template
|
||||||
|
|
||||||
|
```
|
||||||
|
# SCOPE
|
||||||
|
|
||||||
|
## One-liner
|
||||||
|
SSH Certificate Authority and credential issuance for the ops fleet —
|
||||||
|
signs short-lived certs for adm/agt/atm actors; provides the cert_command
|
||||||
|
interface consumed by ops-bridge and other tooling.
|
||||||
|
|
||||||
|
## Core Idea
|
||||||
|
Implements AccessManagementDirective §§1–5. Owns the CA key, actor inventory,
|
||||||
|
signing logic, and scorecard. Does not own tunnel lifecycle, host provisioning,
|
||||||
|
or SSH key generation for humans.
|
||||||
|
|
||||||
|
## In Scope
|
||||||
|
- Local CA backend (ssh-keygen -s) for lab / non-Vault use
|
||||||
|
- Vault SSH engine backend for production
|
||||||
|
- Actor identity registry (inventory.yaml)
|
||||||
|
- cert_command CLI interface: `warden sign <actor> --pubkey <path>`
|
||||||
|
- TTL policy enforcement per ActorType (adm/agt/atm)
|
||||||
|
- Certificate status and stale-cert cleanup
|
||||||
|
- Scorecard checks (local / cert-side only)
|
||||||
|
- ops-ssh-wrapper script for agt/atm startup automation
|
||||||
|
|
||||||
|
## Out of Scope
|
||||||
|
- Host-side principal deployment (railiance-infra Ansible)
|
||||||
|
- SSH key generation for human admins (self-service: ssh-keygen)
|
||||||
|
- Vault cluster setup / HA
|
||||||
|
- Session recording, audit forwarding to SIEM (host-side)
|
||||||
|
- Tunnel lifecycle (ops-bridge)
|
||||||
|
- SSO / Teleport (trigger when §6.2 scale thresholds are hit)
|
||||||
|
|
||||||
|
## Relevant When
|
||||||
|
- Issuing or refreshing a cert for any adm/agt/atm actor
|
||||||
|
- Checking cert validity / scorecard compliance
|
||||||
|
- ops-bridge needs cert_command to be defined
|
||||||
|
- Adding a new actor to the principals inventory
|
||||||
|
|
||||||
|
## Not Relevant When
|
||||||
|
- Managing tunnel lifecycle (ops-bridge)
|
||||||
|
- Deploying SSH config to hosts (railiance-infra)
|
||||||
|
- All access is via static keys with no TTL (legacy mode)
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
Status: planned (WARDEN-WP-0001 not yet started)
|
||||||
|
|
||||||
|
## Related Repositories
|
||||||
|
- ops-bridge — primary consumer of cert_command interface
|
||||||
|
- railiance-infra — owns host-side principal deployment
|
||||||
|
- the-custodian/state-hub — registers domain/workstreams
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Acceptance Criteria
|
||||||
|
|
||||||
|
- [ ] `warden sign agt-test-actor --pubkey /tmp/test.pub` outputs a valid cert (local backend)
|
||||||
|
- [ ] `warden status agt-test-actor` shows correct identity, principals, and time-to-expiry
|
||||||
|
- [ ] `warden scorecard` returns 5/5 on a clean test inventory
|
||||||
|
- [ ] `warden sign` called from ops-bridge `cert_command` in an integration test tunnel
|
||||||
|
- [ ] All tests pass: `uv run pytest`
|
||||||
|
- [ ] All lints pass: `uv run ruff check .`
|
||||||
Reference in New Issue
Block a user