Initial Commit

This commit is contained in:
2026-03-28 00:45:43 +00:00
parent a436a7569d
commit 5ae6b988aa
23 changed files with 2400 additions and 0 deletions

8
CLAUDE.md Normal file
View File

@@ -0,0 +1,8 @@
# ops-warden — Claude Code Instructions
@.claude/rules/repo-identity.md
@.claude/rules/session-protocol.md
@.claude/rules/workplan-convention.md
@.claude/rules/stack-and-commands.md
@.claude/rules/architecture.md
@.claude/rules/repo-boundary.md

129
SCOPE.md Normal file
View File

@@ -0,0 +1,129 @@
# SCOPE
> This file helps you quickly understand what this repository is about,
> when it is relevant, and when it is not.
> It is intentionally lightweight and may be incomplete.
---
## One-liner
SSH Certificate Authority and credential issuance for the ops fleet — signs short-lived
certificates for `adm`/`agt`/`atm` actors; provides the `cert_command` interface consumed
by ops-bridge and other tooling.
---
## Core Idea
Implements `wiki/AccessManagementDirective.md` §§15. Owns the CA key, actor identity
inventory, signing logic, and scorecard. Two backends: `local` (ssh-keygen, for labs /
non-Vault use) and `vault` (HashiCorp Vault SSH engine, for production). Both expose the
same CLI surface and the same `cert_command` interface — callers never need to know which
backend is in use.
---
## In Scope
- Local CA backend (`ssh-keygen -s`) — fully functional without Vault
- Vault SSH engine backend — production-grade signing via Vault API
- Actor identity registry (`inventory.yaml`) — maps actors to principals and TTL policy
- `cert_command` interface: `warden sign <actor> --pubkey <path>` → cert text on stdout
- TTL policy enforcement per `ActorType` (`adm` 48 h, `agt` 24 h, `atm` 8 h)
- Certificate status inspection (`warden status`)
- Stale-cert cleanup and scorecard checks (cert-side; see §5 of directive)
- `warden issue` — generate keypair + sign in one step (for `agt`/`atm` actors)
- `ops-ssh-wrapper` script — wraps SSH commands with automatic cert acquisition
---
## Out of Scope
- Tunnel lifecycle management → `ops-bridge`
- Host-side principal deployment (`/etc/ssh/auth_principals/`) → `railiance-infra` Ansible
- SSH key generation for human admins (self-service: `ssh-keygen`)
- Vault cluster setup, HA, or PKI secrets engine
- Session recording, SIEM forwarding, audit log aggregation
- SSO / Teleport integration (trigger when §6.2 scale thresholds are hit)
- Host-side scorecard checks (password auth disabled, root login disabled) → `railiance-infra`
---
## Relevant When
- Issuing or refreshing a cert for any `adm`/`agt`/`atm` actor
- Checking cert validity or running the compliance scorecard
- `ops-bridge` needs a `cert_command` to be defined for a tunnel
- Adding a new actor to the principals inventory
- Bootstrapping the CA for a new environment
---
## Not Relevant When
- Managing tunnel lifecycle (→ `ops-bridge`)
- Deploying SSH principal config to hosts (→ `railiance-infra`)
- All access is via static keys with no TTL (ops-bridge static key mode handles this)
- Human admins manually managing their own certificates
---
## Current State
- Status: planned — WARDEN-WP-0001 not yet started
- Implementation: scaffolding only (models, config, CA, inventory, scorecard, CLI stubs)
---
## How It Fits
- Upstream: CA key (file or Vault); actor inventory in Git
- Downstream consumers: `ops-bridge` calls `warden sign` via `cert_command`; any other
tool needing short-lived SSH certs can use the same interface
- Often used with: `ops-bridge` (primary consumer), `railiance-infra` (host-side principal sync)
---
## Terminology
- `ActorType`: `adm` (human operator), `agt` (LLM agent), `atm` (deterministic automation)
- `cert_command`: shell command that a caller (e.g. ops-bridge) runs to obtain a cert
- `CertSpec`: signing request (actor name, pubkey path, TTL, principals)
- `CertRecord`: result of signing (identity, valid_before, cert_path, signed_at)
- `principals`: SSH roles embedded in the cert, matched against `/etc/ssh/auth_principals/%u`
- `inventory.yaml`: authoritative registry of actor → principals + TTL policy
- `LocalCA`: file-based CA backend using `ssh-keygen -s`
- `VaultCA`: Vault SSH engine backend
---
## Related / Overlapping Repositories
- `ops-bridge` — primary consumer; calls `warden sign` via `cert_command` in tunnel config
- `railiance-infra` — owns host-side principal deployment and host-side scorecard checks
- `the-custodian/state-hub` — domain/workstream registry
---
## Provided Capabilities
```capability
type: security
title: SSH certificate issuance
description: Issues short-lived CA-signed SSH certificates for adm/agt/atm actors via a
pluggable cert_command interface; supports local CA (ssh-keygen) and Vault SSH engine backends.
keywords: [ssh, certificate, ca, credential, warden, ops-warden, pki, vault]
```
---
## Getting Oriented
- Start with: `SCOPE.md` (this file), then `wiki/AccessManagementDirective.md`
- Config reference: `wiki/OpsWardenConfig.md`
- cert_command contract: `wiki/CertCommandInterface.md`
- Config files: `~/.config/warden/warden.yaml`, `~/.config/warden/inventory.yaml`
- State: `~/.local/state/warden/` (certs, generated keypairs)
- Entry point: `warden --help`
- Workplan: `workplans/WARDEN-WP-0001-initial-implementation.md`

34
pyproject.toml Normal file
View File

@@ -0,0 +1,34 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[project]
name = "ops-warden"
version = "0.1.0"
description = "SSH CA and certificate lifecycle manager for ops actors"
requires-python = ">=3.11"
dependencies = [
"typer[all]>=0.12",
"pyyaml>=6.0",
"httpx>=0.27",
]
[project.scripts]
warden = "warden.cli:app"
ops-ssh-wrapper = "warden.scripts.ops_ssh_wrapper:main"
[tool.hatch.build.targets.wheel]
packages = ["src/warden"]
[tool.pytest.ini_options]
testpaths = ["tests"]
pythonpath = ["src"]
[tool.ruff]
line-length = 88
[dependency-groups]
dev = [
"pytest>=8.0",
"ruff>=0.4",
]

3
src/warden/__init__.py Normal file
View File

@@ -0,0 +1,3 @@
"""OpsWarden — SSH CA and certificate lifecycle manager."""
__version__ = "0.1.0"

164
src/warden/ca.py Normal file
View File

@@ -0,0 +1,164 @@
"""CA backends for OpsWarden: LocalCA (ssh-keygen) and abstract base."""
from __future__ import annotations
import os
import shutil
import subprocess
import tempfile
from abc import ABC, abstractmethod
from datetime import datetime, timezone
from pathlib import Path
from typing import List, Optional
from warden.models import CertRecord, CertSpec
class CAError(Exception):
"""Raised when a CA operation fails."""
class CABackend(ABC):
@abstractmethod
def sign(self, spec: CertSpec) -> CertRecord:
"""Sign the public key in spec and return a CertRecord."""
...
def parse_cert_metadata(cert_path: Path) -> dict:
"""Parse ssh-keygen -L output into identity, valid_before, and principals.
Note: ssh-keygen displays timestamps without explicit timezone; we treat them
as UTC, consistent with how ssh-keygen internally stores certificate validity.
"""
result = subprocess.run(
["ssh-keygen", "-L", "-f", str(cert_path)],
capture_output=True,
text=True,
)
if result.returncode != 0:
raise CAError(f"ssh-keygen -L failed: {result.stderr.strip()}")
identity: Optional[str] = None
valid_before: Optional[datetime] = None
principals: List[str] = []
in_principals = False
for line in result.stdout.splitlines():
stripped = line.strip()
if stripped.startswith("Key ID:"):
# Key ID: "agt-state-hub-bridge"
raw = stripped.split(":", 1)[1].strip()
identity = raw.strip('"')
elif stripped.startswith("Valid:"):
# Valid: from 2026-03-28T10:00:00 to 2026-03-29T10:00:00
parts = stripped.split(" to ", 1)
if len(parts) == 2:
ts_str = parts[1].strip()
try:
dt = datetime.fromisoformat(ts_str)
valid_before = dt.replace(tzinfo=timezone.utc)
except ValueError:
pass
elif stripped == "Principals:":
in_principals = True
elif in_principals:
if stripped and not stripped.endswith(":") and stripped != "(none)":
principals.append(stripped)
else:
in_principals = False
if valid_before is None:
raise CAError(
f"Could not parse valid_before from cert at {cert_path}. "
f"Ensure the cert has a valid TTL."
)
return {
"identity": identity or "",
"valid_before": valid_before,
"principals": principals,
}
class LocalCA(CABackend):
"""File-based CA using ssh-keygen. Requires the CA private key on disk."""
def __init__(self, ca_key: Path, state_dir: Path) -> None:
self._ca_key = Path(os.path.expanduser(str(ca_key)))
self._state_dir = Path(os.path.expanduser(str(state_dir)))
def sign(self, spec: CertSpec) -> CertRecord:
"""Sign the public key in spec. Returns a CertRecord; cert saved to state_dir."""
pubkey = Path(os.path.expanduser(str(spec.pubkey_path)))
if not pubkey.exists():
raise CAError(f"Public key not found: {pubkey}")
if not self._ca_key.exists():
raise CAError(f"CA key not found: {self._ca_key}")
principals_str = ",".join(spec.principals)
with tempfile.TemporaryDirectory() as tmpdir:
tmpdir_path = Path(tmpdir)
pubkey_copy = tmpdir_path / "key.pub"
shutil.copy2(pubkey, pubkey_copy)
# ssh-keygen -s writes cert to <input_stem>-cert.pub
cert_path_tmp = tmpdir_path / "key-cert.pub"
cmd = [
"ssh-keygen",
"-s", str(self._ca_key),
"-I", spec.identity,
"-n", principals_str,
"-V", f"+{spec.ttl_hours}h",
str(pubkey_copy),
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise CAError(f"Signing failed: {result.stderr.strip()}")
if not cert_path_tmp.exists():
raise CAError(
f"Expected cert not written after signing: {cert_path_tmp}. "
f"ssh-keygen stderr: {result.stderr.strip()}"
)
meta = parse_cert_metadata(cert_path_tmp)
self._state_dir.mkdir(parents=True, exist_ok=True)
dest = self._state_dir / f"{spec.actor_name}-cert.pub"
shutil.copy2(cert_path_tmp, dest)
return CertRecord(
identity=meta["identity"] or spec.identity,
valid_before=meta["valid_before"],
cert_path=dest,
signed_at=datetime.now(timezone.utc),
principals=meta["principals"],
actor_name=spec.actor_name,
)
def generate_keypair(self, actor_name: str) -> tuple[Path, Path]:
"""Generate an ed25519 keypair for an actor.
Returns (privkey_path, pubkey_path). Overwrites existing files.
"""
key_dir = self._state_dir / "keys"
key_dir.mkdir(parents=True, exist_ok=True)
privkey = key_dir / f"{actor_name}_ed25519"
pubkey = key_dir / f"{actor_name}_ed25519.pub"
for p in (privkey, pubkey):
if p.exists():
p.unlink()
cmd = [
"ssh-keygen", "-t", "ed25519",
"-f", str(privkey),
"-N", "", # no passphrase
"-C", actor_name,
]
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
raise CAError(f"Key generation failed: {result.stderr.strip()}")
return privkey, pubkey

397
src/warden/cli.py Normal file
View File

@@ -0,0 +1,397 @@
"""OpsWarden CLI."""
from __future__ import annotations
import json
import sys
from datetime import datetime, timezone
from pathlib import Path
from typing import Annotated, List, Optional
import typer
from rich.console import Console
from rich.table import Table
from warden.ca import CAError, LocalCA, parse_cert_metadata
from warden.config import ConfigError, WardenConfig, load_config
from warden.inventory import ActorEntry, InventoryError, PrincipalsInventory, load_inventory, save_inventory
from warden.models import ActorType, CertSpec, DEFAULT_TTL_HOURS, validate_actor_name
from warden.scorecard import run_scorecard
app = typer.Typer(
help="OpsWarden — SSH CA and certificate lifecycle manager",
no_args_is_help=True,
)
inventory_app = typer.Typer(help="Manage principals inventory", no_args_is_help=True)
app.add_typer(inventory_app, name="inventory")
console = Console()
err = Console(stderr=True)
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _load_cfg() -> WardenConfig:
try:
return load_config()
except ConfigError as e:
err.print(f"[red]Config error:[/red] {e}")
raise typer.Exit(1)
def _load_inventory(cfg: WardenConfig) -> PrincipalsInventory:
try:
return load_inventory(cfg.inventory_path)
except InventoryError as e:
err.print(f"[red]Inventory error:[/red] {e}")
raise typer.Exit(1)
def _get_ca(cfg: WardenConfig):
if cfg.backend == "vault":
from warden.vault import VaultCA
return VaultCA(cfg.vault, cfg.state_dir)
return LocalCA(cfg.ca_key, cfg.state_dir)
# ---------------------------------------------------------------------------
# warden sign
# ---------------------------------------------------------------------------
@app.command()
def sign(
actor_name: Annotated[str, typer.Argument(help="Actor name (e.g. agt-state-hub-bridge)")],
pubkey: Annotated[Path, typer.Option("--pubkey", help="Path to actor's public key file")],
ttl: Annotated[Optional[int], typer.Option("--ttl", help="Override TTL in hours")] = None,
) -> None:
"""Sign a public key for the given actor. Writes cert text to stdout.
This is the cert_command interface: ops-bridge calls this and uses stdout
as the certificate passed to SSH alongside the private key.
"""
cfg = _load_cfg()
inventory = _load_inventory(cfg)
entry = inventory.actors.get(actor_name)
if entry is None:
err.print(
f"[red]Actor {actor_name!r} not found in inventory.[/red] "
f"Add it with: warden inventory add"
)
raise typer.Exit(1)
spec = CertSpec(
actor_name=actor_name,
actor_type=entry.actor_type,
pubkey_path=pubkey,
ttl_hours=ttl or entry.ttl_hours,
principals=entry.principals,
identity=actor_name,
)
ca = _get_ca(cfg)
try:
record = ca.sign(spec)
except CAError as e:
err.print(f"[red]Signing failed:[/red] {e}")
raise typer.Exit(1)
# cert_command interface: write cert text to stdout only
print(record.cert_path.read_text().strip())
# ---------------------------------------------------------------------------
# warden issue
# ---------------------------------------------------------------------------
@app.command()
def issue(
actor_name: Annotated[str, typer.Argument(help="Actor name")],
ttl: Annotated[Optional[int], typer.Option("--ttl", help="Override TTL in hours")] = None,
output_json: Annotated[bool, typer.Option("--json", help="Output JSON")] = False,
) -> None:
"""Generate a new keypair and sign it for the given actor.
Only supported with the local backend. Outputs keypair + cert paths and metadata.
"""
cfg = _load_cfg()
if cfg.backend != "local":
err.print("[red]warden issue is only supported with the local backend.[/red]")
raise typer.Exit(1)
inventory = _load_inventory(cfg)
entry = inventory.actors.get(actor_name)
if entry is None:
err.print(f"[red]Actor {actor_name!r} not found in inventory.[/red]")
raise typer.Exit(1)
ca = LocalCA(cfg.ca_key, cfg.state_dir)
try:
privkey_path, pubkey_path = ca.generate_keypair(actor_name)
except CAError as e:
err.print(f"[red]Key generation failed:[/red] {e}")
raise typer.Exit(1)
spec = CertSpec(
actor_name=actor_name,
actor_type=entry.actor_type,
pubkey_path=pubkey_path,
ttl_hours=ttl or entry.ttl_hours,
principals=entry.principals,
identity=actor_name,
)
try:
record = ca.sign(spec)
except CAError as e:
err.print(f"[red]Signing failed:[/red] {e}")
raise typer.Exit(1)
result = {
"actor": actor_name,
"privkey": str(privkey_path),
"cert": str(record.cert_path),
"identity": record.identity,
"principals": record.principals,
"valid_before": record.valid_before.isoformat(),
"signed_at": record.signed_at.isoformat(),
}
if output_json:
print(json.dumps(result, indent=2))
else:
console.print(f"[green]Issued credentials for {actor_name}[/green]")
for k, v in result.items():
console.print(f" {k}: {v}")
# ---------------------------------------------------------------------------
# warden status
# ---------------------------------------------------------------------------
@app.command()
def status(
actor_name: Annotated[Optional[str], typer.Argument(help="Actor name (omit for all)")] = None,
output_json: Annotated[bool, typer.Option("--json", help="Output JSON")] = False,
) -> None:
"""Show certificate status. Exits 1 if any cert is expired."""
cfg = _load_cfg()
now = datetime.now(timezone.utc)
if actor_name:
cert_path = cfg.state_dir / f"{actor_name}-cert.pub"
paths = [cert_path] if cert_path.exists() else []
else:
paths = sorted(cfg.state_dir.glob("*-cert.pub")) if cfg.state_dir.exists() else []
if not paths:
msg = (
f"No certificate found for {actor_name!r} (static key / no cert)"
if actor_name
else "No certificates in state dir."
)
console.print(msg)
return
rows = []
for cert_path in paths:
name = cert_path.stem.replace("-cert", "")
try:
meta = parse_cert_metadata(cert_path)
valid_before = meta["valid_before"]
remaining = valid_before - now
secs = remaining.total_seconds()
if secs > 0:
h, rem = divmod(int(secs), 3600)
m = rem // 60
remaining_str = f"{h}h {m}m"
expired = False
else:
remaining_str = "EXPIRED"
expired = True
rows.append({
"actor": name,
"identity": meta["identity"],
"principals": ", ".join(meta["principals"]),
"valid_before": valid_before.isoformat(),
"remaining": remaining_str,
"expired": expired,
})
except Exception as e:
rows.append({"actor": name, "error": str(e), "expired": False})
if output_json:
print(json.dumps(rows, indent=2))
else:
table = Table(title="Certificate Status")
table.add_column("Actor")
table.add_column("Identity")
table.add_column("Principals")
table.add_column("Valid Before (UTC)")
table.add_column("Remaining")
for row in rows:
if "error" in row:
table.add_row(row["actor"], "[red]parse error[/red]", "", "", row["error"])
else:
rem_styled = (
f"[red]{row['remaining']}[/red]" if row["expired"] else row["remaining"]
)
table.add_row(
row["actor"],
row["identity"],
row["principals"],
row["valid_before"],
rem_styled,
)
console.print(table)
if any(r.get("expired") for r in rows):
raise typer.Exit(1)
# ---------------------------------------------------------------------------
# warden scorecard
# ---------------------------------------------------------------------------
@app.command()
def scorecard(
output_json: Annotated[bool, typer.Option("--json", help="Output JSON")] = False,
) -> None:
"""Run compliance scorecard checks (AccessManagementDirective §5, cert-side)."""
cfg = _load_cfg()
inventory = _load_inventory(cfg)
results = run_scorecard(cfg.state_dir, inventory)
passed = sum(1 for r in results if r.passed)
total = len(results)
if output_json:
print(json.dumps(
[{"check": r.name, "passed": r.passed, "detail": r.detail} for r in results],
indent=2,
))
else:
table = Table(title=f"OpsWarden Scorecard ({passed}/{total})")
table.add_column("Check")
table.add_column("Status")
table.add_column("Detail")
for r in results:
status_str = "[green]PASS[/green]" if r.passed else "[red]FAIL[/red]"
table.add_row(r.name, status_str, r.detail)
console.print(table)
console.print(
f"\nScore: {passed}/{total} "
+ ("[green]Operational[/green]" if passed == total else "[yellow]Needs attention[/yellow]")
)
if passed < total:
raise typer.Exit(1)
# ---------------------------------------------------------------------------
# warden inventory
# ---------------------------------------------------------------------------
@inventory_app.command("list")
def inventory_list(
output_json: Annotated[bool, typer.Option("--json")] = False,
) -> None:
"""List all actors in the principals inventory."""
cfg = _load_cfg()
inventory = _load_inventory(cfg)
if not inventory.actors:
console.print("No actors in inventory.")
return
if output_json:
print(json.dumps({
name: {
"type": e.actor_type.value,
"principals": e.principals,
"ttl_hours": e.ttl_hours,
"description": e.description,
}
for name, e in inventory.actors.items()
}, indent=2))
return
table = Table(title=f"Principals Inventory ({cfg.inventory_path})")
table.add_column("Actor")
table.add_column("Type")
table.add_column("Principals")
table.add_column("TTL (h)")
table.add_column("Description")
for name, e in inventory.actors.items():
table.add_row(
name,
e.actor_type.value,
", ".join(e.principals),
str(e.ttl_hours),
e.description,
)
console.print(table)
@inventory_app.command("add")
def inventory_add(
actor_name: Annotated[str, typer.Argument(help="Actor name (e.g. agt-state-hub-bridge)")],
actor_type: Annotated[ActorType, typer.Option("--type", "-t", help="adm | agt | atm")],
principals: Annotated[
Optional[List[str]],
typer.Option("--principal", "-p", help="Principal (repeat for multiple)"),
] = None,
ttl: Annotated[Optional[int], typer.Option("--ttl", help="TTL in hours")] = None,
description: Annotated[str, typer.Option("--description", "-d")] = "",
) -> None:
"""Add an actor to the principals inventory."""
cfg = _load_cfg()
try:
validate_actor_name(actor_name, actor_type)
except ValueError as e:
err.print(f"[red]{e}[/red]")
raise typer.Exit(1)
resolved_principals: List[str] = principals or [actor_name]
inventory = _load_inventory(cfg)
inventory.actors[actor_name] = ActorEntry(
name=actor_name,
actor_type=actor_type,
principals=resolved_principals,
ttl_hours=ttl or DEFAULT_TTL_HOURS[actor_type],
description=description,
)
try:
save_inventory(inventory, cfg.inventory_path)
except Exception as e:
err.print(f"[red]Failed to save inventory:[/red] {e}")
raise typer.Exit(1)
console.print(
f"[green]Added[/green] {actor_name} "
f"(type={actor_type.value}, principals={resolved_principals}, ttl={ttl or DEFAULT_TTL_HOURS[actor_type]}h)"
)
@inventory_app.command("remove")
def inventory_remove(
actor_name: Annotated[str, typer.Argument(help="Actor name to remove")],
) -> None:
"""Remove an actor from the principals inventory."""
cfg = _load_cfg()
inventory = _load_inventory(cfg)
if actor_name not in inventory.actors:
err.print(f"[red]Actor {actor_name!r} not in inventory.[/red]")
raise typer.Exit(1)
del inventory.actors[actor_name]
try:
save_inventory(inventory, cfg.inventory_path)
except Exception as e:
err.print(f"[red]Failed to save inventory:[/red] {e}")
raise typer.Exit(1)
console.print(f"[green]Removed[/green] {actor_name}")

114
src/warden/config.py Normal file
View File

@@ -0,0 +1,114 @@
"""Config loading for OpsWarden."""
from __future__ import annotations
import os
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, Optional
import yaml
class ConfigError(Exception):
"""Raised when config is invalid or missing."""
@dataclass
class VaultConfig:
addr: str
role_map: Dict[str, str] # ActorType.value -> vault role name
token_env: str = "VAULT_TOKEN" # env var holding the Vault token
mount: str = "ssh" # Vault secrets engine mount path
@dataclass
class WardenConfig:
backend: str # "local" or "vault"
ca_key: Optional[Path] = None # required for local backend
vault: Optional[VaultConfig] = None # required for vault backend
inventory_path: Path = field(
default_factory=lambda: Path.home() / ".config" / "warden" / "inventory.yaml"
)
state_dir: Path = field(
default_factory=lambda: Path.home() / ".local" / "state" / "warden"
)
def _default_config_path() -> Path:
return Path.home() / ".config" / "warden" / "warden.yaml"
def load_config(path: Optional[Path] = None) -> WardenConfig:
"""Load and validate warden.yaml. Respects WARDEN_CONFIG env var."""
config_path = path or Path(
os.environ.get("WARDEN_CONFIG", str(_default_config_path()))
)
if not config_path.exists():
raise ConfigError(f"Config not found: {config_path}")
try:
with config_path.open() as f:
raw = yaml.safe_load(f)
except yaml.YAMLError as e:
raise ConfigError(f"Invalid YAML in {config_path}: {e}") from e
if not isinstance(raw, dict):
raise ConfigError("Config must be a YAML mapping")
backend = str(raw.get("backend", "local"))
if backend not in ("local", "vault"):
raise ConfigError(
f"backend must be 'local' or 'vault', got: {backend!r}"
)
ca_key = None
if "ca_key" in raw and raw["ca_key"]:
ca_key = Path(os.path.expanduser(str(raw["ca_key"])))
vault_cfg = None
if backend == "vault":
v = raw.get("vault") or {}
if "addr" not in v:
raise ConfigError("vault backend requires vault.addr")
role_map = v.get("role_map") or {
"adm": "adm-role",
"agt": "agt-role",
"atm": "atm-role",
}
vault_cfg = VaultConfig(
addr=str(v["addr"]),
role_map=dict(role_map),
token_env=str(v.get("token_env", "VAULT_TOKEN")),
mount=str(v.get("mount", "ssh")),
)
elif backend == "local" and ca_key is None:
raise ConfigError("local backend requires ca_key")
inventory_path = Path(
os.path.expanduser(
str(
raw.get(
"inventory_path",
str(Path.home() / ".config" / "warden" / "inventory.yaml"),
)
)
)
)
state_dir = Path(
os.path.expanduser(
str(
raw.get(
"state_dir",
str(Path.home() / ".local" / "state" / "warden"),
)
)
)
)
return WardenConfig(
backend=backend,
ca_key=ca_key,
vault=vault_cfg,
inventory_path=inventory_path,
state_dir=state_dir,
)

108
src/warden/inventory.py Normal file
View File

@@ -0,0 +1,108 @@
"""Principals inventory — actor registry with type, principals, and TTL policy."""
from __future__ import annotations
from dataclasses import dataclass, field
from pathlib import Path
from typing import Dict, List
import yaml
from warden.models import ActorType, DEFAULT_TTL_HOURS, validate_actor_name
class InventoryError(Exception):
"""Raised when inventory is invalid."""
@dataclass
class ActorEntry:
name: str
actor_type: ActorType
principals: List[str]
ttl_hours: int
description: str = ""
@dataclass
class HostEntry:
name: str
allowed_principals: Dict[str, List[str]] # actor_type.value -> [principal, ...]
@dataclass
class PrincipalsInventory:
actors: Dict[str, ActorEntry] = field(default_factory=dict)
hosts: Dict[str, HostEntry] = field(default_factory=dict)
def load_inventory(path: Path) -> PrincipalsInventory:
"""Load inventory.yaml. Returns empty inventory if path does not exist."""
if not path.exists():
return PrincipalsInventory()
try:
with path.open() as f:
raw = yaml.safe_load(f) or {}
except yaml.YAMLError as e:
raise InventoryError(f"Invalid YAML in {path}: {e}") from e
actors: Dict[str, ActorEntry] = {}
for name, data in (raw.get("actors") or {}).items():
if not isinstance(data, dict):
raise InventoryError(f"Actor {name!r} must be a mapping")
type_raw = str(data.get("type", ""))
try:
actor_type = ActorType(type_raw)
except ValueError:
raise InventoryError(
f"Actor {name!r} has invalid type: {type_raw!r}. "
f"Must be one of: adm, agt, atm"
)
try:
validate_actor_name(name, actor_type)
except ValueError as e:
raise InventoryError(str(e)) from e
ttl = int(data.get("ttl_hours", DEFAULT_TTL_HOURS[actor_type]))
principals = list(data.get("principals") or [name])
actors[name] = ActorEntry(
name=name,
actor_type=actor_type,
principals=principals,
ttl_hours=ttl,
description=str(data.get("description", "")),
)
hosts: Dict[str, HostEntry] = {}
for hostname, data in (raw.get("hosts") or {}).items():
if not isinstance(data, dict):
raise InventoryError(f"Host {hostname!r} must be a mapping")
hosts[hostname] = HostEntry(
name=hostname,
allowed_principals=dict(data.get("allowed_principals") or {}),
)
return PrincipalsInventory(actors=actors, hosts=hosts)
def save_inventory(inventory: PrincipalsInventory, path: Path) -> None:
"""Write inventory to path, creating parent directories as needed."""
path.parent.mkdir(parents=True, exist_ok=True)
raw: dict = {
"actors": {
name: {
"type": e.actor_type.value,
"principals": e.principals,
"ttl_hours": e.ttl_hours,
**({"description": e.description} if e.description else {}),
}
for name, e in inventory.actors.items()
},
}
if inventory.hosts:
raw["hosts"] = {
name: {"allowed_principals": h.allowed_principals}
for name, h in inventory.hosts.items()
}
with path.open("w") as f:
yaml.dump(raw, f, default_flow_style=False, sort_keys=False)

67
src/warden/models.py Normal file
View File

@@ -0,0 +1,67 @@
"""Domain models for OpsWarden."""
from __future__ import annotations
from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from pathlib import Path
from typing import List
class ActorType(str, Enum):
ADM = "adm" # human operator
AGT = "agt" # LLM-powered autonomous agent
ATM = "atm" # deterministic script / pipeline
# Default certificate TTLs per ActorType (AccessManagementDirective §2)
DEFAULT_TTL_HOURS: dict[ActorType, int] = {
ActorType.ADM: 48,
ActorType.AGT: 24,
ActorType.ATM: 8,
}
# Required name prefixes per ActorType (directive §2 naming convention)
ACTOR_PREFIX: dict[ActorType, str] = {
ActorType.ADM: "adm-",
ActorType.AGT: "agt-",
ActorType.ATM: "atm-",
}
def validate_actor_name(name: str, actor_type: ActorType) -> None:
"""Raise ValueError if name does not carry the required prefix for actor_type."""
prefix = ACTOR_PREFIX[actor_type]
if not name.startswith(prefix):
raise ValueError(
f"Actor name {name!r} must start with {prefix!r} for type {actor_type.value!r}. "
f"(AccessManagementDirective §2 naming convention)"
)
@dataclass
class CertSpec:
"""Signing request passed to a CABackend."""
actor_name: str
actor_type: ActorType
pubkey_path: Path
ttl_hours: int
principals: List[str]
identity: str = "" # defaults to actor_name if empty
def __post_init__(self) -> None:
if not self.identity:
self.identity = self.actor_name
@dataclass
class CertRecord:
"""Result returned by a CABackend after signing."""
identity: str
valid_before: datetime
cert_path: Path
signed_at: datetime
principals: List[str] = field(default_factory=list)
actor_name: str = ""

98
src/warden/scorecard.py Normal file
View File

@@ -0,0 +1,98 @@
"""Compliance scorecard — cert-side checks (AccessManagementDirective §5)."""
from __future__ import annotations
from dataclasses import dataclass
from datetime import datetime, timedelta, timezone
from pathlib import Path
from typing import List
from warden.ca import CAError, parse_cert_metadata
from warden.inventory import PrincipalsInventory
from warden.models import ACTOR_PREFIX, ActorType
@dataclass
class CheckResult:
name: str
passed: bool
detail: str = ""
def check_actor_name_prefixes(inventory: PrincipalsInventory) -> CheckResult:
"""All actor names must carry the prefix matching their type."""
violations = []
for name, entry in inventory.actors.items():
expected = ACTOR_PREFIX[entry.actor_type]
if not name.startswith(expected):
violations.append(f"{name!r} should start with {expected!r}")
return CheckResult(
name="actor_name_prefixes",
passed=len(violations) == 0,
detail=(
"; ".join(violations) if violations else "all actor names match prefix convention"
),
)
def check_all_actors_have_principals(inventory: PrincipalsInventory) -> CheckResult:
"""Every actor in inventory must have at least one principal."""
missing = [name for name, e in inventory.actors.items() if not e.principals]
return CheckResult(
name="actors_have_principals",
passed=len(missing) == 0,
detail=f"missing principals: {missing}" if missing else "all actors have principals",
)
def check_no_expired_certs(state_dir: Path) -> CheckResult:
"""No cert in state_dir should be currently expired."""
if not state_dir.exists():
return CheckResult("no_expired_certs", passed=True, detail="no state dir")
now = datetime.now(timezone.utc)
expired = []
for cert_path in state_dir.glob("*-cert.pub"):
try:
meta = parse_cert_metadata(cert_path)
except CAError:
continue
if meta["valid_before"] < now:
expired.append(cert_path.stem.replace("-cert", ""))
return CheckResult(
name="no_expired_certs",
passed=len(expired) == 0,
detail=f"expired: {expired}" if expired else "no expired certs",
)
def check_no_stale_certs(state_dir: Path) -> CheckResult:
"""Certs expired by more than 5 minutes should have been cleaned up."""
if not state_dir.exists():
return CheckResult("no_stale_certs", passed=True, detail="no state dir")
cutoff = datetime.now(timezone.utc) - timedelta(minutes=5)
stale = []
for cert_path in state_dir.glob("*-cert.pub"):
try:
meta = parse_cert_metadata(cert_path)
except CAError:
continue
if meta["valid_before"] < cutoff:
stale.append(cert_path.name)
return CheckResult(
name="no_stale_certs",
passed=len(stale) == 0,
detail=f"stale certs present: {stale}" if stale else "no stale certs",
)
def run_scorecard(state_dir: Path, inventory: PrincipalsInventory) -> List[CheckResult]:
"""Run all cert-side scorecard checks. Returns list of CheckResult."""
return [
check_actor_name_prefixes(inventory),
check_all_actors_have_principals(inventory),
check_no_expired_certs(state_dir),
check_no_stale_certs(state_dir),
]

View File

View File

@@ -0,0 +1,82 @@
"""ops-ssh-wrapper — acquire a warden cert and exec the given SSH command.
Usage:
WARDEN_ACTOR=agt-my-agent SSH_PUBKEY=~/.ssh/agt-my-agent_ed25519.pub \\
ops-ssh-wrapper ssh -R 8001:127.0.0.1:8000 agt-my-agent@host
Environment:
WARDEN_ACTOR Actor name in the warden inventory (e.g. agt-state-hub-bridge)
SSH_PUBKEY Path to the actor's SSH public key file
The wrapper requests a fresh cert from warden on every invocation, loads it into
ssh-agent, then execs the given command. Equivalent to the pattern in
AccessManagementDirective §4.1, hardened for production use.
"""
from __future__ import annotations
import os
import subprocess
import sys
import tempfile
from pathlib import Path
def main() -> None:
actor = os.environ.get("WARDEN_ACTOR")
pubkey = os.environ.get("SSH_PUBKEY")
if not actor:
print("ops-ssh-wrapper: WARDEN_ACTOR not set", file=sys.stderr)
sys.exit(1)
if not pubkey:
print("ops-ssh-wrapper: SSH_PUBKEY not set", file=sys.stderr)
sys.exit(1)
pubkey_path = Path(os.path.expanduser(pubkey))
if not pubkey_path.exists():
print(f"ops-ssh-wrapper: SSH_PUBKEY not found: {pubkey_path}", file=sys.stderr)
sys.exit(1)
try:
cert_text = subprocess.check_output(
["warden", "sign", actor, "--pubkey", str(pubkey_path)],
text=True,
).strip()
except subprocess.CalledProcessError as e:
print(
f"ops-ssh-wrapper: warden sign failed (exit {e.returncode})", file=sys.stderr
)
sys.exit(1)
except FileNotFoundError:
print(
"ops-ssh-wrapper: 'warden' not found in PATH. "
"Install ops-warden: uv tool install ops-warden",
file=sys.stderr,
)
sys.exit(1)
with tempfile.NamedTemporaryFile(
suffix="-cert.pub", mode="w", delete=False, prefix=f"{actor}-"
) as f:
f.write(cert_text + "\n")
cert_path = f.name
try:
result = subprocess.run(
["ssh-add", cert_path], capture_output=True, text=True
)
if result.returncode != 0:
print(
f"ops-ssh-wrapper: ssh-add warning: {result.stderr.strip()} "
f"(ssh-agent may not be running — continuing anyway)",
file=sys.stderr,
)
finally:
os.unlink(cert_path)
if len(sys.argv) > 1:
os.execvp(sys.argv[1], sys.argv[1:])
if __name__ == "__main__":
main()

97
src/warden/vault.py Normal file
View File

@@ -0,0 +1,97 @@
"""VaultCA backend — HashiCorp Vault SSH engine."""
from __future__ import annotations
import os
import tempfile
from datetime import datetime, timezone
from pathlib import Path
import httpx
from warden.ca import CABackend, CAError, parse_cert_metadata
from warden.config import VaultConfig
from warden.models import CertRecord, CertSpec
class VaultCA(CABackend):
"""CA backend that signs via HashiCorp Vault SSH secrets engine."""
def __init__(self, vault_cfg: VaultConfig, state_dir: Path) -> None:
self._cfg = vault_cfg
self._state_dir = Path(os.path.expanduser(str(state_dir)))
def _token(self) -> str:
token = os.environ.get(self._cfg.token_env, "")
if not token:
raise CAError(
f"Vault token not found. Set the {self._cfg.token_env!r} "
f"environment variable, or run: vault login"
)
return token
def sign(self, spec: CertSpec) -> CertRecord:
"""Sign the public key via Vault SSH engine. Returns a CertRecord."""
pubkey_path = Path(os.path.expanduser(str(spec.pubkey_path)))
if not pubkey_path.exists():
raise CAError(f"Public key not found: {pubkey_path}")
pubkey_text = pubkey_path.read_text().strip()
role = self._cfg.role_map.get(spec.actor_type.value)
if not role:
raise CAError(
f"No Vault role mapped for actor type {spec.actor_type.value!r}. "
f"Add it to vault.role_map in warden.yaml."
)
url = f"{self._cfg.addr}/v1/{self._cfg.mount}/sign/{role}"
try:
response = httpx.post(
url,
json={
"public_key": pubkey_text,
"valid_principals": ",".join(spec.principals),
"ttl": f"{spec.ttl_hours}h",
"cert_type": "user",
"key_id": spec.identity,
},
headers={"X-Vault-Token": self._token()},
timeout=10.0,
)
response.raise_for_status()
except httpx.HTTPStatusError as e:
raise CAError(
f"Vault signing failed (HTTP {e.response.status_code}): "
f"{e.response.text}"
) from e
except httpx.RequestError as e:
raise CAError(
f"Vault unreachable at {self._cfg.addr}. "
f"Is Vault running? Consider --backend local as a fallback.\n{e}"
) from e
cert_text = response.json()["data"]["signed_key"].strip()
self._state_dir.mkdir(parents=True, exist_ok=True)
dest = self._state_dir / f"{spec.actor_name}-cert.pub"
dest.write_text(cert_text + "\n")
# Parse metadata by writing to a tempfile and running ssh-keygen -L
with tempfile.NamedTemporaryFile(
suffix="-cert.pub", mode="w", delete=False
) as f:
f.write(cert_text + "\n")
tmp_cert = Path(f.name)
try:
meta = parse_cert_metadata(tmp_cert)
finally:
tmp_cert.unlink(missing_ok=True)
return CertRecord(
identity=meta["identity"] or spec.identity,
valid_before=meta["valid_before"],
cert_path=dest,
signed_at=datetime.now(timezone.utc),
principals=meta["principals"],
actor_name=spec.actor_name,
)

0
tests/__init__.py Normal file
View File

180
tests/test_ca.py Normal file
View File

@@ -0,0 +1,180 @@
"""Tests for warden.ca — LocalCA and parse_cert_metadata."""
from datetime import datetime, timezone
from pathlib import Path
from unittest.mock import MagicMock, patch
import pytest
from warden.ca import CAError, LocalCA, parse_cert_metadata
from warden.models import ActorType, CertSpec
SAMPLE_SSHKEYGEN_L = """\
/tmp/key-cert.pub:
Type: ssh-ed25519-cert-v01@openssh.com user certificate
Public key: ED25519-CERT SHA256:abc123
Signing CA: ED25519 SHA256:xyz (using ssh-ed25519)
Key ID: "agt-state-hub-bridge"
Serial: 0
Valid: from 2026-03-28T10:00:00 to 2026-03-29T10:00:00
Principals:
agt-task-bridge
Critical Options: (none)
Extensions:
permit-pty
"""
CERT_CONTENT = "ssh-ed25519-cert-v01@openssh.com AAAA_fake_cert_data"
def _mock_run_factory(cert_content: str):
"""Returns a mock subprocess.run that writes the cert file on sign and returns
SAMPLE_SSHKEYGEN_L on -L."""
def mock_run(cmd, **kwargs):
result = MagicMock()
result.returncode = 0
result.stdout = ""
result.stderr = ""
if not isinstance(cmd, list) or not cmd:
return result
if cmd[0] == "ssh-keygen" and "-s" in cmd:
# Signing: write cert next to the pubkey copy (last arg)
pubkey_path = Path(cmd[-1])
cert_path = pubkey_path.parent / (pubkey_path.stem + "-cert.pub")
cert_path.write_text(cert_content)
elif cmd[0] == "ssh-keygen" and "-L" in cmd:
result.stdout = SAMPLE_SSHKEYGEN_L
return result
return mock_run
# ---------------------------------------------------------------------------
# parse_cert_metadata
# ---------------------------------------------------------------------------
def test_parse_cert_metadata(tmp_path):
cert_path = tmp_path / "key-cert.pub"
cert_path.write_text(CERT_CONTENT)
mock_result = MagicMock(returncode=0, stdout=SAMPLE_SSHKEYGEN_L, stderr="")
with patch("warden.ca.subprocess.run", return_value=mock_result):
meta = parse_cert_metadata(cert_path)
assert meta["identity"] == "agt-state-hub-bridge"
assert meta["principals"] == ["agt-task-bridge"]
assert meta["valid_before"] == datetime(2026, 3, 29, 10, 0, 0, tzinfo=timezone.utc)
def test_parse_cert_metadata_failure(tmp_path):
cert_path = tmp_path / "key-cert.pub"
cert_path.write_text("not a cert")
mock_result = MagicMock(returncode=1, stdout="", stderr="not a certificate")
with patch("warden.ca.subprocess.run", return_value=mock_result):
with pytest.raises(CAError, match="ssh-keygen -L failed"):
parse_cert_metadata(cert_path)
def test_parse_cert_metadata_missing_valid_before(tmp_path):
cert_path = tmp_path / "key-cert.pub"
cert_path.write_text(CERT_CONTENT)
output_no_valid = SAMPLE_SSHKEYGEN_L.replace(
" Valid: from 2026-03-28T10:00:00 to 2026-03-29T10:00:00\n", ""
)
mock_result = MagicMock(returncode=0, stdout=output_no_valid, stderr="")
with patch("warden.ca.subprocess.run", return_value=mock_result):
with pytest.raises(CAError, match="valid_before"):
parse_cert_metadata(cert_path)
# ---------------------------------------------------------------------------
# LocalCA.sign
# ---------------------------------------------------------------------------
def test_local_ca_sign(tmp_path):
ca_key = tmp_path / "ca_key"
ca_key.write_text("fake-ca-private-key")
pubkey = tmp_path / "key.pub"
pubkey.write_text("ssh-ed25519 AAAA actor-key")
spec = CertSpec(
actor_name="agt-state-hub-bridge",
actor_type=ActorType.AGT,
pubkey_path=pubkey,
ttl_hours=24,
principals=["agt-task-bridge"],
identity="agt-state-hub-bridge",
)
with patch("warden.ca.subprocess.run", side_effect=_mock_run_factory(CERT_CONTENT)):
ca = LocalCA(ca_key, tmp_path / "state")
record = ca.sign(spec)
assert record.identity == "agt-state-hub-bridge"
assert record.actor_name == "agt-state-hub-bridge"
assert record.principals == ["agt-task-bridge"]
cert_dest = tmp_path / "state" / "agt-state-hub-bridge-cert.pub"
assert cert_dest.exists()
assert cert_dest.read_text().strip() == CERT_CONTENT
def test_local_ca_sign_missing_pubkey(tmp_path):
ca_key = tmp_path / "ca_key"
ca_key.write_text("fake-ca")
spec = CertSpec(
actor_name="agt-test",
actor_type=ActorType.AGT,
pubkey_path=tmp_path / "nonexistent.pub",
ttl_hours=24,
principals=["agt-test"],
)
ca = LocalCA(ca_key, tmp_path / "state")
with pytest.raises(CAError, match="Public key not found"):
ca.sign(spec)
def test_local_ca_sign_missing_ca_key(tmp_path):
pubkey = tmp_path / "key.pub"
pubkey.write_text("ssh-ed25519 AAAA")
spec = CertSpec(
actor_name="agt-test",
actor_type=ActorType.AGT,
pubkey_path=pubkey,
ttl_hours=24,
principals=["agt-test"],
)
ca = LocalCA(tmp_path / "nonexistent_ca", tmp_path / "state")
with pytest.raises(CAError, match="CA key not found"):
ca.sign(spec)
def test_local_ca_sign_ssh_keygen_failure(tmp_path):
ca_key = tmp_path / "ca_key"
ca_key.write_text("fake-ca")
pubkey = tmp_path / "key.pub"
pubkey.write_text("ssh-ed25519 AAAA")
spec = CertSpec(
actor_name="agt-test",
actor_type=ActorType.AGT,
pubkey_path=pubkey,
ttl_hours=24,
principals=["agt-test"],
)
def fail_run(cmd, **kwargs):
result = MagicMock()
result.returncode = 1
result.stderr = "load key: invalid format"
result.stdout = ""
return result
ca = LocalCA(ca_key, tmp_path / "state")
with patch("warden.ca.subprocess.run", side_effect=fail_run):
with pytest.raises(CAError, match="Signing failed"):
ca.sign(spec)

84
tests/test_config.py Normal file
View File

@@ -0,0 +1,84 @@
"""Tests for warden.config."""
from pathlib import Path
import pytest
import yaml
from warden.config import ConfigError, load_config
def write_yaml(path: Path, content: dict) -> None:
with path.open("w") as f:
yaml.dump(content, f)
def test_load_local_config(tmp_path):
cfg_path = tmp_path / "warden.yaml"
write_yaml(cfg_path, {"backend": "local", "ca_key": str(tmp_path / "ca")})
cfg = load_config(cfg_path)
assert cfg.backend == "local"
assert cfg.ca_key == tmp_path / "ca"
def test_local_backend_missing_ca_key_raises(tmp_path):
cfg_path = tmp_path / "warden.yaml"
write_yaml(cfg_path, {"backend": "local"})
with pytest.raises(ConfigError, match="ca_key"):
load_config(cfg_path)
def test_invalid_backend_raises(tmp_path):
cfg_path = tmp_path / "warden.yaml"
write_yaml(cfg_path, {"backend": "magic", "ca_key": "/tmp/ca"})
with pytest.raises(ConfigError, match="backend"):
load_config(cfg_path)
def test_vault_backend(tmp_path):
cfg_path = tmp_path / "warden.yaml"
write_yaml(cfg_path, {
"backend": "vault",
"vault": {
"addr": "https://vault.example.com",
"role_map": {"adm": "adm-role", "agt": "agt-role", "atm": "atm-role"},
},
})
cfg = load_config(cfg_path)
assert cfg.backend == "vault"
assert cfg.vault is not None
assert cfg.vault.addr == "https://vault.example.com"
assert cfg.vault.role_map["agt"] == "agt-role"
def test_vault_backend_missing_addr_raises(tmp_path):
cfg_path = tmp_path / "warden.yaml"
write_yaml(cfg_path, {"backend": "vault", "vault": {}})
with pytest.raises(ConfigError, match="addr"):
load_config(cfg_path)
def test_missing_config_raises():
with pytest.raises(ConfigError, match="not found"):
load_config(Path("/nonexistent/path/warden.yaml"))
def test_custom_state_dir(tmp_path):
cfg_path = tmp_path / "warden.yaml"
custom_state = tmp_path / "my-state"
write_yaml(cfg_path, {
"backend": "local",
"ca_key": str(tmp_path / "ca"),
"state_dir": str(custom_state),
})
cfg = load_config(cfg_path)
assert cfg.state_dir == custom_state
def test_default_vault_token_env(tmp_path):
cfg_path = tmp_path / "warden.yaml"
write_yaml(cfg_path, {
"backend": "vault",
"vault": {"addr": "https://vault.example.com"},
})
cfg = load_config(cfg_path)
assert cfg.vault.token_env == "VAULT_TOKEN"

87
tests/test_inventory.py Normal file
View File

@@ -0,0 +1,87 @@
"""Tests for warden.inventory."""
from pathlib import Path
import pytest
from warden.inventory import (
ActorEntry,
InventoryError,
PrincipalsInventory,
load_inventory,
save_inventory,
)
from warden.models import ActorType
def test_empty_inventory_on_missing_file(tmp_path):
inv = load_inventory(tmp_path / "nonexistent.yaml")
assert inv.actors == {}
assert inv.hosts == {}
def test_roundtrip(tmp_path):
inv = PrincipalsInventory()
inv.actors["agt-test"] = ActorEntry(
name="agt-test",
actor_type=ActorType.AGT,
principals=["agt-task-test"],
ttl_hours=24,
description="test actor",
)
path = tmp_path / "inventory.yaml"
save_inventory(inv, path)
loaded = load_inventory(path)
assert "agt-test" in loaded.actors
entry = loaded.actors["agt-test"]
assert entry.actor_type == ActorType.AGT
assert entry.principals == ["agt-task-test"]
assert entry.ttl_hours == 24
assert entry.description == "test actor"
def test_roundtrip_multiple_actors(tmp_path):
inv = PrincipalsInventory()
inv.actors["adm-bernd"] = ActorEntry("adm-bernd", ActorType.ADM, ["adm-full"], 48)
inv.actors["atm-backup"] = ActorEntry("atm-backup", ActorType.ATM, ["atm-backup-daily"], 8)
path = tmp_path / "inventory.yaml"
save_inventory(inv, path)
loaded = load_inventory(path)
assert set(loaded.actors) == {"adm-bernd", "atm-backup"}
assert loaded.actors["adm-bernd"].actor_type == ActorType.ADM
def test_invalid_actor_type_raises(tmp_path):
path = tmp_path / "inventory.yaml"
path.write_text("actors:\n agt-test:\n type: bogus\n principals: []\n")
with pytest.raises(InventoryError, match="invalid type"):
load_inventory(path)
def test_actor_name_prefix_violation_raises(tmp_path):
path = tmp_path / "inventory.yaml"
path.write_text("actors:\n wrong-name:\n type: agt\n principals: [x]\n")
with pytest.raises(InventoryError):
load_inventory(path)
def test_default_principal_is_actor_name(tmp_path):
path = tmp_path / "inventory.yaml"
path.write_text("actors:\n agt-bridge:\n type: agt\n")
inv = load_inventory(path)
assert inv.actors["agt-bridge"].principals == ["agt-bridge"]
def test_default_ttl_applied(tmp_path):
path = tmp_path / "inventory.yaml"
path.write_text("actors:\n atm-cron:\n type: atm\n principals: [atm-cron]\n")
inv = load_inventory(path)
assert inv.actors["atm-cron"].ttl_hours == 8 # DEFAULT_TTL_HOURS[ATM]
def test_invalid_yaml_raises(tmp_path):
path = tmp_path / "inventory.yaml"
path.write_text(": : : invalid yaml :::")
with pytest.raises(InventoryError, match="Invalid YAML"):
load_inventory(path)

67
tests/test_models.py Normal file
View File

@@ -0,0 +1,67 @@
"""Tests for warden.models."""
from pathlib import Path
import pytest
from warden.models import (
ACTOR_PREFIX,
DEFAULT_TTL_HOURS,
ActorType,
CertSpec,
validate_actor_name,
)
def test_default_ttl_per_type():
assert DEFAULT_TTL_HOURS[ActorType.ADM] == 48
assert DEFAULT_TTL_HOURS[ActorType.AGT] == 24
assert DEFAULT_TTL_HOURS[ActorType.ATM] == 8
def test_actor_prefix_map():
assert ACTOR_PREFIX[ActorType.ADM] == "adm-"
assert ACTOR_PREFIX[ActorType.AGT] == "agt-"
assert ACTOR_PREFIX[ActorType.ATM] == "atm-"
@pytest.mark.parametrize("name,actor_type", [
("adm-bernd", ActorType.ADM),
("agt-incident-resolver-v2", ActorType.AGT),
("atm-backup-daily", ActorType.ATM),
])
def test_validate_actor_name_valid(name, actor_type):
validate_actor_name(name, actor_type) # should not raise
@pytest.mark.parametrize("name,actor_type", [
("bernd", ActorType.ADM),
("automation-backup", ActorType.ATM),
("agt-bridge", ActorType.ADM), # wrong type for prefix
("atm-backup", ActorType.AGT),
])
def test_validate_actor_name_invalid(name, actor_type):
with pytest.raises(ValueError, match="must start with"):
validate_actor_name(name, actor_type)
def test_certspec_default_identity():
spec = CertSpec(
actor_name="agt-test",
actor_type=ActorType.AGT,
pubkey_path=Path("/tmp/key.pub"),
ttl_hours=24,
principals=["agt-task-bridge"],
)
assert spec.identity == "agt-test"
def test_certspec_explicit_identity():
spec = CertSpec(
actor_name="agt-test",
actor_type=ActorType.AGT,
pubkey_path=Path("/tmp/key.pub"),
ttl_hours=24,
principals=["agt-task-bridge"],
identity="custom-identity",
)
assert spec.identity == "custom-identity"

100
tests/test_scorecard.py Normal file
View File

@@ -0,0 +1,100 @@
"""Tests for warden.scorecard."""
from pathlib import Path
import pytest
from warden.inventory import ActorEntry, PrincipalsInventory
from warden.models import ActorType
from warden.scorecard import (
check_actor_name_prefixes,
check_all_actors_have_principals,
check_no_stale_certs,
check_no_expired_certs,
run_scorecard,
)
def make_inventory(*actors):
inv = PrincipalsInventory()
for name, atype, principals in actors:
inv.actors[name] = ActorEntry(
name=name, actor_type=atype, principals=principals, ttl_hours=24
)
return inv
# ---------------------------------------------------------------------------
# check_actor_name_prefixes
# ---------------------------------------------------------------------------
def test_prefix_check_pass():
inv = make_inventory(
("adm-bernd", ActorType.ADM, ["adm-full"]),
("agt-bridge", ActorType.AGT, ["agt-task-bridge"]),
("atm-cron", ActorType.ATM, ["atm-cron"]),
)
result = check_actor_name_prefixes(inv)
assert result.passed
def test_prefix_check_fail_bad_name():
# Bypass validate_actor_name by inserting directly
inv = PrincipalsInventory()
inv.actors["bad-name"] = ActorEntry(
name="bad-name", actor_type=ActorType.AGT, principals=["x"], ttl_hours=24
)
result = check_actor_name_prefixes(inv)
assert not result.passed
assert "bad-name" in result.detail
# ---------------------------------------------------------------------------
# check_all_actors_have_principals
# ---------------------------------------------------------------------------
def test_principals_check_pass():
inv = make_inventory(("agt-bridge", ActorType.AGT, ["agt-task-bridge"]))
result = check_all_actors_have_principals(inv)
assert result.passed
def test_principals_check_fail_empty():
inv = PrincipalsInventory()
inv.actors["agt-bridge"] = ActorEntry(
name="agt-bridge", actor_type=ActorType.AGT, principals=[], ttl_hours=24
)
result = check_all_actors_have_principals(inv)
assert not result.passed
assert "agt-bridge" in result.detail
# ---------------------------------------------------------------------------
# check_no_stale_certs
# ---------------------------------------------------------------------------
def test_no_stale_certs_nonexistent_dir():
result = check_no_stale_certs(Path("/nonexistent/state/dir"))
assert result.passed
def test_no_stale_certs_empty_dir(tmp_path):
result = check_no_stale_certs(tmp_path)
assert result.passed
def test_no_expired_certs_empty_dir(tmp_path):
result = check_no_expired_certs(tmp_path)
assert result.passed
# ---------------------------------------------------------------------------
# run_scorecard
# ---------------------------------------------------------------------------
def test_run_scorecard_clean(tmp_path):
inv = make_inventory(
("agt-bridge", ActorType.AGT, ["agt-task-bridge"]),
)
results = run_scorecard(tmp_path, inv)
assert all(r.passed for r in results)
assert len(results) == 4

View File

@@ -0,0 +1,203 @@
AccessManagementDirective
*Practical host access control management *
# AccessManagementDirective
**Document Title:** SSH Access Management Directive
**Version:** 1.1 (Production-Ready Revision Post-SWOT Improvements)
**Date:** 28 March 2026
**Audience:** Operations Department
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
**Author:** Grok (on behalf of the team)
**Status:** Official Directive All ops personnel, agents, and automation pipelines MUST follow this.
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
## 0. Prerequisites
Before bootstrapping, the following must be in place:
- Ansible (or equivalent config-management tool) with a central inventory.
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
- GitOps repository containing the authoritative principals inventory.
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
## 1. Concept Overview
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
**Why this model?**
- A central CA signs short-lived certificates for every login.
- No more manual key copying, key sprawl, or painful revocation.
- Built-in expiration, role-based principals, and auditability.
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
**Core Principles**
- **Least privilege** Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
- **Short-lived credentials** Certificates expire automatically (2448 h for admins, 424 h for agents, 18 h for automations).
- **One CA, many issuers** A single offline User CA whose public key is trusted by every host.
- **Automation-first** All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
- **Separation of concerns**
- **Admins (adm)**: Human operators (full interactive shell when needed).
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
## 2. Actor Definitions & Access Model
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|------------|-------------------|-------------|------------------------------|---------------------------|
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 2448 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 424 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 18 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
**Certificate Naming Convention**
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
**LLM-Agent Risk Clarification**
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
## 3. Bootstrapping the System (One-Time Setup)
### 3.1. Create the CA (do this once, offline)
```bash
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
```
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
- Rotate the CA key itself every 23 years using the same bootstrap playbook.
- Public key: `ca_user.pub`
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
- Copy `ca_user.pub``/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
- Update `/etc/ssh/sshd_config`:
```bash
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PubkeyAuthentication yes
PasswordAuthentication no
PermitRootLogin no
```
- Create principals directory and files from the central Git inventory.
- `systemctl restart sshd`
### 3.3. Initial Admin Access
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
## 4. Automatic Management of Access Rights
### 4.1. Daily / On-Demand Workflow
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
- Example inventory snippet:
```yaml
hosts:
- name: prod-db-01
allowed_principals:
adm: [adm-full]
agt: [agt-incident-resolver-v2]
atm: [atm-backup-daily, atm-logrotate]
```
3. **Revocation & Rotation**
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
- Agents/automations never store long-lived private keys on disk.
4. **Concrete Agent & Automation Wrapper Example** (Python snippet place in `/usr/local/bin/ops-ssh-wrapper`)
```python
#!/usr/bin/env python3
import subprocess, os, tempfile
# Request short-lived cert from Vault
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
f.write(cert.encode())
cert_path = f.name
# Load into ssh-agent and exec the real command
subprocess.run(["ssh-add", cert_path])
os.execvp(sys.argv[1], sys.argv[1:])
```
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
### 4.2. Human UX Guidance
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
### 4.3. Emergency Break-Glass Procedure
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
4. After recovery, immediately rotate the CA and run a full scorecard.
## 5. AccessManagement Scorecard (Checklist)
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
| Category | Check | Target | Tool |
|----------|-------|--------|------|
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
| **Score** | ≥ 10/10 = **Operational** | - | - |
**Scorecard Execution Command** (run from ops laptop):
```bash
ansible all -m command -a "ssh-access-scorecard.sh" --become
```
## 6. Scope & Operational Boundaries
### 6.1. When Bootstrapping Is Officially Closed
The system is **fully operational** when **ALL** of the following are true:
- Scorecard passes 10/10 on every host.
- Central Git repo contains the authoritative principals inventory.
- First three admins have successfully used signed certificates for 7 consecutive days.
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
- CI/CD pipeline for host config updates is green and runs hourly.
- Emergency break-glass procedure has been tested once.
**Declaration:** Ops Lead signs off with date in the Git commit message.
### 6.2. Scope Boundary When to Switch to Sophisticated Tooling
Stay with **native OpenSSH CA + Ansible + Vault** while:
- ≤ 200 hosts
- ≤ 50 distinct agent/automation identities
- No regulatory requirement for SSO or full session recording
**Switch triggers** (any one):
- > 200 hosts OR rapid daily growth
- Need for human SSO (Okta/Google) integration
- Requirement for audited web-based SSH sessions or just-in-time access approval
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
- Audit/compliance demands central policy engine or session recording
**Recommended next-level tools** (in order):
1. **Teleport** Best for mixed human + agent workloads (SSO + Machine ID).
2. **HashiCorp Vault SSH + Boundary** When you already use Vault heavily.
3. **step-ca + smallstep** If you prefer a pure open-source CA with OIDC.
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
## 7. Enforcement & Review
- **Quarterly review** of this directive and scorecard results.
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
- **Questions / improvements** → create PR against this file in the ops repo.
**End of Document**
Approved for immediate use across all production and staging environments.
xxx

View File

@@ -0,0 +1,105 @@
# cert_command Interface
**Version:** 1.0
**Date:** 2026-03-28
**Purpose:** Define the contract between OpsWarden (issuer) and callers such as ops-bridge
(consumer) for just-in-time SSH certificate acquisition.
---
## Overview
`cert_command` is a shell string that a caller executes to obtain a short-lived, CA-signed
SSH certificate for a named actor. The caller passes the cert to the SSH process alongside
the actor's private key.
This interface is intentionally tool-agnostic: the caller (`ops-bridge`, a script, a CI
pipeline) does not need to know whether the CA is a local file or HashiCorp Vault. Any
command that writes a cert to stdout and exits 0 satisfies the contract.
---
## Contract
### Invocation
```
warden sign <actor-name> --pubkey <path/to/actor.pub>
```
Or any equivalent shell command:
```
vault write -field=signed_key ssh/sign/agt-role public_key=@/tmp/key.pub
ssh-keygen -s /path/to/ca -I agt-test -n agt-task -V +24h /tmp/key.pub && cat /tmp/key-cert.pub
```
### Success (exit 0)
- Stdout: certificate text only — a single line starting with the key type, e.g.:
```
ssh-ed25519-cert-v01@openssh.com AAAA...
```
- Stderr: ignored by the caller (warden may print warnings there)
- Side effect: cert is also written to `~/.local/state/warden/<actor>-cert.pub` by warden
(for use by `warden status` and `warden scorecard`)
### Failure (exit non-zero)
- Exit code: any non-zero value
- Stdout: ignored
- Stderr: passed through to caller logs / audit detail field
- Caller behaviour: treat as a transient error; apply reconnect backoff and retry
---
## Caller Responsibilities (ops-bridge)
1. Run `cert_command` via `subprocess.run(shell=True)` before each SSH subprocess launch
2. Write stdout to a tempfile in the state dir: `~/.local/state/bridge/<tunnel>-cert.pub`
3. Add `-i <cert_path>` after `-i <key_path>` in the `ssh` command
4. Parse `ssh-keygen -L -f <cert>` to extract `Key ID` → log as `cert_identity` in audit
5. Parse `Valid before:` → schedule pre-emptive cert refresh ~5 min before expiry
6. On `cert_command` failure: log `BRIDGE_DISCONNECTED` with stderr; apply backoff
## What the Caller Must NOT Do
- Cache or reuse a cert across reconnects (always re-run `cert_command` per reconnect)
- Write the cert to disk with world-readable permissions (mode 600)
- Ignore a non-zero exit from `cert_command` (must treat as failure, trigger backoff)
---
## Example: ops-bridge tunnels.yaml
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
# cert_command is optional. When absent, ssh_key is used directly (static key mode).
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
---
## TTL Guidelines (AccessManagementDirective §2)
| Actor type | Max TTL | Pre-emptive refresh |
|---|---|---|
| `adm` | 48 h | 5 min before expiry |
| `agt` | 24 h | 5 min before expiry |
| `atm` | 8 h | 5 min before expiry |
ops-bridge enforces the refresh schedule. OpsWarden enforces the max TTL at signing time.
---
## Backward Compatibility
Callers that do not set `cert_command` continue to use the static key (`ssh_key`) with no
TTL, cert logic, or refresh. The two modes are fully independent.

147
wiki/OpsWardenConfig.md Normal file
View File

@@ -0,0 +1,147 @@
# OpsWarden Configuration Reference
Config file: `~/.config/warden/warden.yaml` (override with `WARDEN_CONFIG` env var)
---
## Local Backend (lab / non-Vault)
```yaml
# Backend selection. "local" uses ssh-keygen -s with a CA key on disk.
backend: local
# Path to the CA private key. Keep this file mode 600 and never commit it.
ca_key: ~/.ssh/ops-ca-user
# Path to the principals inventory (default shown).
inventory_path: ~/.config/warden/inventory.yaml
# Where to store signed certs and generated keypairs (default shown).
state_dir: ~/.local/state/warden
```
### Bootstrapping the local CA key
```bash
# Generate CA keypair once (offline, secure location)
ssh-keygen -t ed25519 -f ~/.ssh/ops-ca-user -C "Ops SSH User CA (2026)" -N ""
chmod 600 ~/.ssh/ops-ca-user
chmod 644 ~/.ssh/ops-ca-user.pub
# Distribute ops-ca-user.pub to every host:
# TrustedUserCAKeys /etc/ssh/ca/ca_user.pub (in sshd_config)
# See railiance-infra bootstrap-ssh-ca.yml playbook.
```
---
## Vault Backend (production)
```yaml
backend: vault
vault:
# Vault server address.
addr: https://vault.example.com
# Vault SSH secrets engine mount path (default: ssh).
mount: ssh
# Map from ActorType to Vault signing role name.
role_map:
adm: adm-role
agt: agt-role
atm: atm-role
# Environment variable holding the Vault token (default: VAULT_TOKEN).
token_env: VAULT_TOKEN
inventory_path: ~/.config/warden/inventory.yaml
state_dir: ~/.local/state/warden
```
### Vault setup snippet
```bash
vault secrets enable ssh
vault write ssh/roles/agt-role \
key_type=ca \
allowed_users="*" \
allow_user_certificates=true \
default_user="agt" \
ttl=24h max_ttl=24h
export VAULT_TOKEN=$(vault token create -field=token)
```
---
## Principals Inventory (`inventory.yaml`)
```yaml
actors:
# Actor name must carry the prefix matching its type:
# adm-* for adm, agt-* for agt, atm-* for atm
agt-state-hub-bridge:
type: agt
# Principals embedded in the cert; matched against /etc/ssh/auth_principals/%u
principals:
- agt-task-bridge
# Certificate TTL in hours. Defaults: adm=48, agt=24, atm=8
ttl_hours: 24
description: "ops-bridge tunnel agent for state-hub"
adm-bernd:
type: adm
principals:
- adm-full
ttl_hours: 48
atm-backup-daily:
type: atm
principals:
- atm-backup-daily
ttl_hours: 8
description: "nightly backup automation"
hosts:
# Optional: documents which principals are allowed on each host.
# Not enforced by warden; used for reference and future tooling.
coulombcore:
allowed_principals:
agt:
- agt-task-bridge
atm:
- atm-backup-daily
```
---
## Environment Variables
| Variable | Default | Description |
|---|---|---|
| `WARDEN_CONFIG` | `~/.config/warden/warden.yaml` | Config file path |
| `VAULT_TOKEN` | — | Vault token (vault backend only; env var name is configurable) |
---
## cert_command integration with ops-bridge
Add `cert_command` to a tunnel in `~/.config/bridge/tunnels.yaml`:
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
`ops-bridge` runs `cert_command` before each SSH launch, captures stdout as the cert,
and passes it alongside the private key via `ssh -i <key> -i <cert>`.
See `wiki/CertCommandInterface.md` for the full contract.

View File

@@ -0,0 +1,126 @@
---
id: WARDEN-WP-0001
type: workplan
title: "OpsWarden Initial Implementation"
domain: custodian
repo: ops-warden
status: draft
owner: Bernd
topic_slug: custodian
created: "2026-03-28"
updated: "2026-03-28"
---
# WARDEN-WP-0001 — OpsWarden Initial Implementation
**Scope:** Deliver a working `warden` CLI that implements the SSH CA and certificate
lifecycle defined in `wiki/AccessManagementDirective.md`. Scaffolding (models, config,
CA backends, inventory, scorecard, CLI) is already present in the repo; this workplan
tracks the remaining implementation, testing, and integration work.
**Out of scope:** Vault HA/cluster setup, Ansible playbooks for host principal deployment
(those live in `railiance-infra`), session recording, and SSO integration (trigger §6.2 of
the directive when scale requires it).
---
## Goal
After this workplan:
1. `warden sign agt-test --pubkey /tmp/test.pub` outputs a valid cert (local backend).
2. `warden status agt-test` shows correct identity, principals, and time-to-expiry.
3. `warden scorecard` returns 4/4 on a clean test inventory.
4. `warden sign` called from ops-bridge `cert_command` works end-to-end in an integration
test tunnel.
5. All tests pass (`uv run pytest`) and lints pass (`uv run ruff check .`).
---
## Reference Documents
| Document | Location |
|---|---|
| AccessManagementDirective | `wiki/AccessManagementDirective.md` |
| cert_command interface | `wiki/CertCommandInterface.md` |
| Config reference | `wiki/OpsWardenConfig.md` |
| ops-bridge alignment workplan | `../ops-bridge/workplans/BRIDGE-WP-0004-directive-alignment.md` |
---
## Architecture Summary
```
~/.config/warden/warden.yaml # backend, ca_key, inventory_path, state_dir
~/.config/warden/inventory.yaml # actor registry (name → type, principals, ttl_hours)
~/.local/state/warden/ # signed certs (*-cert.pub); keypairs (keys/)
```
Two swappable CA backends — both expose the same `sign(spec) -> CertRecord` interface:
- `LocalCA``ssh-keygen -s`; no Vault dependency; default for dev/lab
- `VaultCA` — Vault SSH engine via httpx
cert_command interface (consumed by ops-bridge):
```
warden sign <actor-name> --pubkey <path> # → cert text to stdout
```
---
## Tasks
### T1 — Repository registration
- [ ] Register repo with state-hub (`register_repo`); assign Repo ID; update
`.claude/rules/repo-identity.md`
- [ ] Create state-hub workstream for this workplan
### T2 — LocalCA integration test
- [ ] Generate a test CA key: `ssh-keygen -t ed25519 -f /tmp/test-ca -N ""`
- [ ] Run `warden sign` against a real pubkey with the test CA (requires `ssh-keygen` in PATH)
- [ ] Verify cert parses correctly with `ssh-keygen -L`
- [ ] Add to `tests/test_ca.py` as an integration test (skipped if `ssh-keygen` not in PATH)
### T3 — VaultCA integration test
- [ ] Set up a local Vault dev server (`vault server -dev`)
- [ ] Enable SSH secrets engine: `vault secrets enable ssh`
- [ ] Configure a signing role for `agt`
- [ ] Run `warden sign` with `backend: vault` config
- [ ] Add to `tests/test_vault.py` as an integration test (skipped if Vault not reachable)
### T4 — CLI end-to-end smoke tests
- [ ] `warden inventory add agt-test --type agt --principal agt-task-test`
- [ ] `warden inventory list` shows the actor
- [ ] `warden issue agt-test` (local backend) produces keypair + cert
- [ ] `warden status agt-test` shows valid cert
- [ ] `warden scorecard` returns 4/4
- [ ] `warden inventory remove agt-test` removes actor
### T5 — ops-bridge cert_command integration
- [ ] Add `agt-state-hub-bridge` to inventory (or use existing from ops-bridge config)
- [ ] Set `cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"`
in a test `tunnels.yaml`
- [ ] Run `bridge up state-hub-coulombcore`; confirm cert is present in
`~/.local/state/bridge/` and `cert_identity` appears in the audit log
- [ ] Document result in a progress event
### T6 — CI/CD setup
- [ ] Add `.github/workflows/ci.yml` (or equivalent) running `uv run pytest` and
`uv run ruff check .` on push
- [ ] Tests must pass without Vault (VaultCA integration tests skipped via pytest marker)
### T7 — Documentation
- [ ] `wiki/OpsWardenConfig.md` — annotated `warden.yaml` reference (already stubbed)
- [ ] `wiki/CertCommandInterface.md` — contract for `cert_command` callers (already stubbed)
- [ ] Ensure `wiki/AccessManagementDirective.md` is in sync with `ops-bridge/wiki/`
---
## Acceptance Criteria
- [ ] `warden sign agt-test --pubkey /tmp/test.pub` → valid cert on stdout (local backend)
- [ ] `warden status agt-test` → identity, principals, time-to-expiry shown correctly
- [ ] `warden scorecard` → 4/4 on clean inventory
- [ ] `warden sign` works as `cert_command` in ops-bridge tunnel config
- [ ] All unit tests pass: `uv run pytest`
- [ ] All lints pass: `uv run ruff check .`
- [ ] No secrets (CA private key, certs) committed to repo