Files
ops-warden/wiki/CertCommandInterface.md
2026-03-28 00:45:43 +00:00

106 lines
3.3 KiB
Markdown

# cert_command Interface
**Version:** 1.0
**Date:** 2026-03-28
**Purpose:** Define the contract between OpsWarden (issuer) and callers such as ops-bridge
(consumer) for just-in-time SSH certificate acquisition.
---
## Overview
`cert_command` is a shell string that a caller executes to obtain a short-lived, CA-signed
SSH certificate for a named actor. The caller passes the cert to the SSH process alongside
the actor's private key.
This interface is intentionally tool-agnostic: the caller (`ops-bridge`, a script, a CI
pipeline) does not need to know whether the CA is a local file or HashiCorp Vault. Any
command that writes a cert to stdout and exits 0 satisfies the contract.
---
## Contract
### Invocation
```
warden sign <actor-name> --pubkey <path/to/actor.pub>
```
Or any equivalent shell command:
```
vault write -field=signed_key ssh/sign/agt-role public_key=@/tmp/key.pub
ssh-keygen -s /path/to/ca -I agt-test -n agt-task -V +24h /tmp/key.pub && cat /tmp/key-cert.pub
```
### Success (exit 0)
- Stdout: certificate text only — a single line starting with the key type, e.g.:
```
ssh-ed25519-cert-v01@openssh.com AAAA...
```
- Stderr: ignored by the caller (warden may print warnings there)
- Side effect: cert is also written to `~/.local/state/warden/<actor>-cert.pub` by warden
(for use by `warden status` and `warden scorecard`)
### Failure (exit non-zero)
- Exit code: any non-zero value
- Stdout: ignored
- Stderr: passed through to caller logs / audit detail field
- Caller behaviour: treat as a transient error; apply reconnect backoff and retry
---
## Caller Responsibilities (ops-bridge)
1. Run `cert_command` via `subprocess.run(shell=True)` before each SSH subprocess launch
2. Write stdout to a tempfile in the state dir: `~/.local/state/bridge/<tunnel>-cert.pub`
3. Add `-i <cert_path>` after `-i <key_path>` in the `ssh` command
4. Parse `ssh-keygen -L -f <cert>` to extract `Key ID` → log as `cert_identity` in audit
5. Parse `Valid before:` → schedule pre-emptive cert refresh ~5 min before expiry
6. On `cert_command` failure: log `BRIDGE_DISCONNECTED` with stderr; apply backoff
## What the Caller Must NOT Do
- Cache or reuse a cert across reconnects (always re-run `cert_command` per reconnect)
- Write the cert to disk with world-readable permissions (mode 600)
- Ignore a non-zero exit from `cert_command` (must treat as failure, trigger backoff)
---
## Example: ops-bridge tunnels.yaml
```yaml
tunnels:
state-hub-coulombcore:
host: coulombcore
remote_port: 8001
local_port: 8000
ssh_user: agt-state-hub-bridge
ssh_key: ~/.ssh/agt-state-hub-bridge_ed25519
actor: agt-state-hub-bridge
# cert_command is optional. When absent, ssh_key is used directly (static key mode).
cert_command: "warden sign agt-state-hub-bridge --pubkey ~/.ssh/agt-state-hub-bridge_ed25519.pub"
```
---
## TTL Guidelines (AccessManagementDirective §2)
| Actor type | Max TTL | Pre-emptive refresh |
|---|---|---|
| `adm` | 48 h | 5 min before expiry |
| `agt` | 24 h | 5 min before expiry |
| `atm` | 8 h | 5 min before expiry |
ops-bridge enforces the refresh schedule. OpsWarden enforces the max TTL at signing time.
---
## Backward Compatibility
Callers that do not set `cert_command` continue to use the static key (`ssh_key`) with no
TTL, cert logic, or refresh. The two modes are fully independent.