Files
ops-warden/wiki/PolicyGatedSigning.md
tegwick d6088e4e16 Implement WP-0022 audit trail and WP-0023 INTENT–SCOPE closeout
Add unified metadata-only audit.jsonl with secret-material guard, instrument
sign/access/worker paths, and expose warden activity CLI. Surface broker hint
when VAULT_TOKEN is unset, refresh INTENT/SCOPE docs, and add production
integration checklists plus catalog lane promotion playbook.
2026-07-01 23:32:38 +02:00

254 lines
8.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Policy-Gated SSH Signing
Date: 2026-06-23
Status: **implemented (opt-in)** — WARDEN-WP-0007; policy package confirmed FLEX-WP-0006
By default `warden sign` authorizes via **inventory allow-list** and TTL policy
only. When `policy.enabled: true` in `warden.yaml`, ops-warden calls flex-auth
before signing and records the decision id in `signatures.log`.
---
## Flow
```text
warden sign <actor> --pubkey <path>
|
v
Load actor from inventory (type, principals, ttl)
|
v
policy.enabled?
no -> skip
yes -> flex-auth POST /v1/check
|
+-- DENY / unreachable (fail_closed) -> CAError
|
v ALLOW
CABackend.sign() (local or OpenBao SSH engine)
|
v
Append signatures.log (+ policy_decision_id when set)
```
The same gate runs for `warden issue` (local backend only).
---
## flex-auth request shape
| Field | Source |
| --- | --- |
| `subject.id` | `WARDEN_POLICY_SUBJECT` env var, or actor name |
| `subject.type` | Actor type (`adm` / `agt` / `atm`) |
| `tenant` | `policy.tenant` (default `tenant:platform`) |
| `resource.id` | `ssh-cert:actor/<actor-name>` |
| `resource.type` | `ssh-certificate` |
| `action` | `sign` |
| `context.principals` | From inventory |
| `context.actor_type` | adm \| agt \| atm |
| `context.pubkey_fingerprint` | SHA256 of pubkey text |
| `context.ttl_hours` | Requested TTL |
flex-auth must return `effect: allow` and an `id` (or `request_id`) on allow.
Deny responses include a `reason` surfaced in the CLI error.
---
## Configuration
```yaml
# warden.yaml — policy gate (opt-in, default off)
policy:
enabled: false
flex_auth_url: http://127.0.0.1:8080
fail_closed: true
tenant: tenant:platform
subject_env: WARDEN_POLICY_SUBJECT
system: ops-warden
```
| Key | Default | Description |
| --- | --- | --- |
| `enabled` | `false` | When `true`, call flex-auth before every sign/issue |
| `flex_auth_url` | `http://127.0.0.1:8080` | flex-auth base URL |
| `fail_closed` | `true` | Deny sign when flex-auth is unreachable or returns HTTP error |
| `tenant` | `tenant:platform` | Tenant sent in subject and resource |
| `subject_env` | `WARDEN_POLICY_SUBJECT` | Env var for IAM subject id override |
| `system` | `ops-warden` | Resource system identifier |
Set `WARDEN_POLICY_SUBJECT` to the caller's IAM profile `sub` when available.
If unset, the actor name is used as subject id.
---
## Versioning
| Version | Gate | Status |
| --- | --- | --- |
| **v1** | Inventory + TTL max | Shipped |
| **v2** | flex-auth opt-in via `policy.enabled` | Shipped (WP-0007) |
| **v2.1** | Identity claims required for `adm` signs | Planned |
| **v3** | Tenant-scoped policies per `tenant:*` | Planned |
---
## What stays in inventory
- Actor registration (name, type, default principals, default TTL)
- Host reference documentation
- Scorecard local checks
flex-auth decides **whether this sign request is allowed now**; inventory
defines **what the actor is allowed to request**.
---
## flex-auth policy package (FLEX-WP-0006)
flex-auth owns the `ssh-certificate` / `sign` policy package. ops-warden consumes
it via `POST /v1/check` when `policy.enabled: true`.
**Handoff (canonical):** `~/flex-auth/docs/ops-warden-policy-gate-handoff.md`
| Asset | flex-auth path |
| --- | --- |
| Policy package | `examples/ops-warden/policy_package.md` |
| Allow/deny fixtures | `examples/ops-warden/policy_fixtures.yaml` |
| Registry snapshot | `examples/ops-warden/registry_snapshot.json` |
| Subject manifest | `examples/ops-warden/subject_manifest.yaml` |
| Resource manifest | `examples/ops-warden/resource_manifest.yaml` |
### Tenant and subject bindings
| Field | Value |
| --- | --- |
| Tenant | `tenant:platform` (`policy.tenant`) |
| Resource system | `ops-warden` (`policy.system`) |
| Resource type | `ssh-certificate` |
| Action | `sign` |
| Resource id | `ssh-cert:actor/<actor-name>` |
| Actor type | Example flex-auth subject | ops-warden inventory name pattern |
| --- | --- | --- |
| `adm` | `platform-steward` | `adm-*` |
| `agt` | `ci-deploy-agent` | `agt-*` |
| `atm` | `backup-automation` | `atm-*` |
**Subject id sent to flex-auth:** `WARDEN_POLICY_SUBJECT` when set, otherwise the
inventory actor name. flex-auth may also allow `iam:<actor-name>` when listed in
`allowed_subjects` on the resource.
**Principals and TTL:** Taken from the sign request (inventory defaults). flex-auth
denies when principals are empty/disallowed or TTL exceeds `max_ttl_hours` on the
registered resource.
### Fixture coverage (flex-auth)
Allow: `fixture:ops-warden-adm-sign-allow`, `fixture:ops-warden-agt-sign-allow`,
`fixture:ops-warden-atm-sign-allow`.
Deny: `fixture:ops-warden-unknown-subject-deny`,
`fixture:ops-warden-actor-type-mismatch-deny`, `fixture:ops-warden-ttl-above-max-deny`,
`fixture:ops-warden-disallowed-principal-deny`,
`fixture:ops-warden-missing-fingerprint-deny`.
### Local smoke
```bash
# flex-auth (from ~/flex-auth)
flex-auth serve --addr 127.0.0.1:8080 \
--registry examples/ops-warden/registry_snapshot.json \
--policy examples/ops-warden/policy_package.md \
--log /tmp/flex-auth-ops-warden-decisions.jsonl
# warden.yaml — policy.enabled: true, flex_auth_url pointing at flex-auth
# Use an actor registered in the flex-auth registry (example fixtures use
# template names; production needs a registry slice for real inventory actors).
```
Local end-to-end evidence: `history/2026-06-23-flex-auth-policy-gate-local-smoke.md`.
### Production registry from inventory
Build a flex-auth registry snapshot that mirrors `inventory.yaml` actors:
```bash
python scripts/build_flex_auth_registry.py ~/.config/warden/inventory.yaml \
-o registry/flex-auth/production_registry_snapshot.json
flex-auth load-registry --file registry/flex-auth/production_registry_snapshot.json
```
Re-run after adding or changing actors. Deploy the snapshot to the production
flex-auth runtime together with `~/flex-auth/examples/ops-warden/policy_package.md`.
Smoke (non-secret):
```bash
./scripts/policy_gate_production_smoke.sh
# OpenBao-backed — preferred: credential broker (no manual VAULT_TOKEN):
cd ~/railiance-platform && make credential-exec-ops-warden-smoke
# Manual fallback when broker unavailable:
SMOKE_VAULT=1 ./scripts/policy_gate_production_smoke.sh
```
Evidence: `history/2026-06-23-flex-auth-policy-gate-production-smoke.md`.
---
## Production rollout
**Keep `policy.enabled: false` until flex-auth is reachable** at `policy.flex_auth_url`
with `fail_closed: true`, unreachable flex-auth blocks all signs.
### Operator checklist
| Step | Owner | Action |
| --- | --- | --- |
| 1 | flex-auth | Deploy runtime; confirm `curl <flex_auth_url>/healthz` → 200 (**FLEX-WP-0007**) |
| 2 | flex-auth | Load production registry + policy package (`~/flex-auth/examples/ops-warden/`) |
| 3 | ops-warden | Regenerate registry from inventory: `scripts/build_flex_auth_registry.py` |
| 4 | ops-warden | Local smoke: `./scripts/policy_gate_production_smoke.sh` |
| 5 | operator | Vault smoke: `make credential-exec-ops-warden-smoke` in `railiance-platform` (or manual `SMOKE_VAULT=1` fallback) |
| 6 | operator | Set `policy.flex_auth_url` in `~/.config/warden/warden.yaml` |
| 7 | operator | Set `policy.enabled: true`; keep `fail_closed: true` |
| 8 | operator | Allow smoke: `warden sign <actor>``signatures.log` has `policy_decision_id` |
| 9 | operator | Deny smoke: e.g. `--ttl` above max — CLI shows flex-auth `reason`, no cert |
Cross-repo references:
- `~/flex-auth/workplans/FLEX-WP-0007-ops-warden-policy-gate-production-deployment.md`
- `history/2026-06-23-flex-auth-production-pickup-suggestion.md`
- `history/2026-06-23-flex-auth-policy-gate-production-smoke.md`
### Summary
1. Deploy the flex-auth registry and policy package to the production flex-auth
runtime — **not** only the example fixtures.
2. Set `policy.flex_auth_url` to the production flex-auth base URL.
3. Enable `policy.enabled: true` only after steps 15 pass.
4. Keep `fail_closed: true` unless an explicit break-glass procedure exists.
5. Smoke allow and deny paths; preserve non-secret evidence only.
### Rollback
If signs are blocked after enabling the gate:
1. Set `policy.enabled: false` in `warden.yaml` (inventory + TTL gate only).
2. Confirm `warden sign` succeeds without flex-auth.
3. File a State Hub note to `flex-auth` with non-secret symptoms (HTTP status,
`fail_closed` behaviour, actor name).
4. Re-enable only after flex-auth runtime and registry are verified.
Evidence fields for the flip: flex-auth health URL, smoke script exit codes,
`warden activity --kind sign --json` showing `policy_decision_id` on allow path.
---
## See also
- `wiki/OpsWardenConfig.md` — full config reference
- `wiki/CredentialRouting.md`
- `~/flex-auth/docs/ops-warden-policy-gate-handoff.md` — flex-auth handoff
- `flex-auth/INTENT.md`
- `net-kingdom/docs/platform-identity-security-architecture.md`