Files

tegwick 90007c2cda feat: close WP-0009/WP-0013 production integration stewardship strand

Ship flex-auth policy gate registry and smoke evidence, archive WP-0009
through WP-0013, and add integration docs: ops-bridge cert_command
migration playbook, operator OpenBao token hygiene, principals drift
check script, and 2026-06-24 INTENT/SCOPE gap analysis.

2026-06-24 12:44:32 +02:00

8.1 KiB

Raw Blame History

Policy-Gated SSH Signing

Date: 2026-06-23
Status: implemented (opt-in) — WARDEN-WP-0007; policy package confirmed FLEX-WP-0006

By default warden sign authorizes via inventory allow-list and TTL policy only. When policy.enabled: true in warden.yaml, ops-warden calls flex-auth before signing and records the decision id in signatures.log.

Flow

warden sign <actor> --pubkey <path>
        |
        v
Load actor from inventory (type, principals, ttl)
        |
        v
policy.enabled?
  no  -> skip
  yes -> flex-auth POST /v1/check
        |
        +-- DENY / unreachable (fail_closed) -> CAError
        |
        v ALLOW
CABackend.sign()  (local or OpenBao SSH engine)
        |
        v
Append signatures.log (+ policy_decision_id when set)

The same gate runs for warden issue (local backend only).

flex-auth request shape

Field	Source
`subject.id`	`WARDEN_POLICY_SUBJECT` env var, or actor name
`subject.type`	Actor type (`adm` / `agt` / `atm`)
`tenant`	`policy.tenant` (default `tenant:platform`)
`resource.id`	`ssh-cert:actor/<actor-name>`
`resource.type`	`ssh-certificate`
`action`	`sign`
`context.principals`	From inventory
`context.actor_type`	adm \| agt \| atm
`context.pubkey_fingerprint`	SHA256 of pubkey text
`context.ttl_hours`	Requested TTL

flex-auth must return effect: allow and an id (or request_id) on allow. Deny responses include a reason surfaced in the CLI error.

Configuration

# warden.yaml — policy gate (opt-in, default off)
policy:
  enabled: false
  flex_auth_url: http://127.0.0.1:8080
  fail_closed: true
  tenant: tenant:platform
  subject_env: WARDEN_POLICY_SUBJECT
  system: ops-warden

Key	Default	Description
`enabled`	`false`	When `true`, call flex-auth before every sign/issue
`flex_auth_url`	`http://127.0.0.1:8080`	flex-auth base URL
`fail_closed`	`true`	Deny sign when flex-auth is unreachable or returns HTTP error
`tenant`	`tenant:platform`	Tenant sent in subject and resource
`subject_env`	`WARDEN_POLICY_SUBJECT`	Env var for IAM subject id override
`system`	`ops-warden`	Resource system identifier

Set WARDEN_POLICY_SUBJECT to the caller's IAM profile sub when available. If unset, the actor name is used as subject id.

Versioning

Version	Gate	Status
v1	Inventory + TTL max	Shipped
v2	flex-auth opt-in via `policy.enabled`	Shipped (WP-0007)
v2.1	Identity claims required for `adm` signs	Planned
v3	Tenant-scoped policies per `tenant:*`	Planned

What stays in inventory

Actor registration (name, type, default principals, default TTL)
Host reference documentation
Scorecard local checks

flex-auth decides whether this sign request is allowed now; inventory defines what the actor is allowed to request.

flex-auth policy package (FLEX-WP-0006)

flex-auth owns the ssh-certificate / sign policy package. ops-warden consumes it via POST /v1/check when policy.enabled: true.

Handoff (canonical): ~/flex-auth/docs/ops-warden-policy-gate-handoff.md

Asset	flex-auth path
Policy package	`examples/ops-warden/policy_package.md`
Allow/deny fixtures	`examples/ops-warden/policy_fixtures.yaml`
Registry snapshot	`examples/ops-warden/registry_snapshot.json`
Subject manifest	`examples/ops-warden/subject_manifest.yaml`
Resource manifest	`examples/ops-warden/resource_manifest.yaml`

Tenant and subject bindings

Field	Value
Tenant	`tenant:platform` (`policy.tenant`)
Resource system	`ops-warden` (`policy.system`)
Resource type	`ssh-certificate`
Action	`sign`
Resource id	`ssh-cert:actor/<actor-name>`

Actor type	Example flex-auth subject	ops-warden inventory name pattern
`adm`	`platform-steward`	`adm-*`
`agt`	`ci-deploy-agent`	`agt-*`
`atm`	`backup-automation`	`atm-*`

Subject id sent to flex-auth: WARDEN_POLICY_SUBJECT when set, otherwise the inventory actor name. flex-auth may also allow iam:<actor-name> when listed in allowed_subjects on the resource.

Principals and TTL: Taken from the sign request (inventory defaults). flex-auth denies when principals are empty/disallowed or TTL exceeds max_ttl_hours on the registered resource.

Fixture coverage (flex-auth)

Allow: fixture:ops-warden-adm-sign-allow, fixture:ops-warden-agt-sign-allow, fixture:ops-warden-atm-sign-allow.

Deny: fixture:ops-warden-unknown-subject-deny, fixture:ops-warden-actor-type-mismatch-deny, fixture:ops-warden-ttl-above-max-deny, fixture:ops-warden-disallowed-principal-deny, fixture:ops-warden-missing-fingerprint-deny.

Local smoke

# flex-auth (from ~/flex-auth)
flex-auth serve --addr 127.0.0.1:8080 \
  --registry examples/ops-warden/registry_snapshot.json \
  --policy examples/ops-warden/policy_package.md \
  --log /tmp/flex-auth-ops-warden-decisions.jsonl

# warden.yaml — policy.enabled: true, flex_auth_url pointing at flex-auth
# Use an actor registered in the flex-auth registry (example fixtures use
# template names; production needs a registry slice for real inventory actors).

Local end-to-end evidence: history/2026-06-23-flex-auth-policy-gate-local-smoke.md.

Production registry from inventory

Build a flex-auth registry snapshot that mirrors inventory.yaml actors:

python scripts/build_flex_auth_registry.py ~/.config/warden/inventory.yaml \
  -o registry/flex-auth/production_registry_snapshot.json
flex-auth load-registry --file registry/flex-auth/production_registry_snapshot.json

Re-run after adding or changing actors. Deploy the snapshot to the production flex-auth runtime together with ~/flex-auth/examples/ops-warden/policy_package.md.

Smoke (non-secret):

./scripts/policy_gate_production_smoke.sh
# OpenBao-backed when VAULT_TOKEN is valid:
SMOKE_VAULT=1 ./scripts/policy_gate_production_smoke.sh

Evidence: history/2026-06-23-flex-auth-policy-gate-production-smoke.md.

Production rollout

Keep policy.enabled: false until flex-auth is reachable at policy.flex_auth_url with fail_closed: true, unreachable flex-auth blocks all signs.

Operator checklist

Step	Owner	Action
1	flex-auth	Deploy runtime; confirm `curl <flex_auth_url>/healthz` → 200 (FLEX-WP-0007)
2	flex-auth	Load production registry + policy package (`~/flex-auth/examples/ops-warden/`)
3	ops-warden	Regenerate registry from inventory: `scripts/build_flex_auth_registry.py`
4	ops-warden	Local smoke: `./scripts/policy_gate_production_smoke.sh`
5	operator	Vault smoke: `SMOKE_VAULT=1 ./scripts/policy_gate_production_smoke.sh` (valid `VAULT_TOKEN`)
6	operator	Set `policy.flex_auth_url` in `~/.config/warden/warden.yaml`
7	operator	Set `policy.enabled: true`; keep `fail_closed: true`
8	operator	Allow smoke: `warden sign <actor>` — `signatures.log` has `policy_decision_id`
9	operator	Deny smoke: e.g. `--ttl` above max — CLI shows flex-auth `reason`, no cert

Cross-repo references:

~/flex-auth/workplans/FLEX-WP-0007-ops-warden-policy-gate-production-deployment.md
history/2026-06-23-flex-auth-production-pickup-suggestion.md
history/2026-06-23-flex-auth-policy-gate-production-smoke.md

Summary

Deploy the flex-auth registry and policy package to the production flex-auth runtime — not only the example fixtures.
Set policy.flex_auth_url to the production flex-auth base URL.
Enable policy.enabled: true only after steps 1–5 pass.
Keep fail_closed: true unless an explicit break-glass procedure exists.
Smoke allow and deny paths; preserve non-secret evidence only.

8.1 KiB Raw Blame History Unescape Escape