generated from coulomb/repo-seed
Expands architecture constraints and SCOPE.md to reflect the three-actor vocabulary (adm/agt/atm), two credential modes (static key + cert_command), and ops-warden boundary. Adds directive wiki doc and two new workplans (BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
204 lines
11 KiB
Markdown
204 lines
11 KiB
Markdown
AccessManagementDirective
|
||
|
||
*Practical host access control management *
|
||
|
||
# AccessManagementDirective
|
||
|
||
**Document Title:** SSH Access Management Directive
|
||
**Version:** 1.1 (Production-Ready Revision – Post-SWOT Improvements)
|
||
**Date:** 28 March 2026
|
||
**Audience:** Operations Department
|
||
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
|
||
**Author:** Grok (on behalf of the team)
|
||
**Status:** Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.
|
||
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
|
||
|
||
## 0. Prerequisites
|
||
|
||
Before bootstrapping, the following must be in place:
|
||
- Ansible (or equivalent config-management tool) with a central inventory.
|
||
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
|
||
- GitOps repository containing the authoritative principals inventory.
|
||
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
|
||
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
|
||
|
||
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
|
||
|
||
## 1. Concept Overview
|
||
|
||
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
|
||
|
||
**Why this model?**
|
||
- A central CA signs short-lived certificates for every login.
|
||
- No more manual key copying, key sprawl, or painful revocation.
|
||
- Built-in expiration, role-based principals, and auditability.
|
||
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
|
||
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
|
||
|
||
**Core Principles**
|
||
- **Least privilege** – Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
|
||
- **Short-lived credentials** – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).
|
||
- **One CA, many issuers** – A single offline User CA whose public key is trusted by every host.
|
||
- **Automation-first** – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
|
||
- **Separation of concerns** –
|
||
- **Admins (adm)**: Human operators (full interactive shell when needed).
|
||
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
|
||
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
|
||
|
||
## 2. Actor Definitions & Access Model
|
||
|
||
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|
||
|------------|-------------------|-------------|------------------------------|---------------------------|
|
||
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 24–48 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
|
||
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
|
||
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 1–8 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
|
||
|
||
**Certificate Naming Convention**
|
||
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
|
||
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
|
||
|
||
**LLM-Agent Risk Clarification**
|
||
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
|
||
|
||
## 3. Bootstrapping the System (One-Time Setup)
|
||
|
||
### 3.1. Create the CA (do this once, offline)
|
||
```bash
|
||
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
|
||
```
|
||
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
|
||
- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.
|
||
- Public key: `ca_user.pub`
|
||
|
||
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
|
||
- Copy `ca_user.pub` → `/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
|
||
- Update `/etc/ssh/sshd_config`:
|
||
```bash
|
||
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
|
||
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
|
||
PubkeyAuthentication yes
|
||
PasswordAuthentication no
|
||
PermitRootLogin no
|
||
```
|
||
- Create principals directory and files from the central Git inventory.
|
||
- `systemctl restart sshd`
|
||
|
||
### 3.3. Initial Admin Access
|
||
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
|
||
|
||
## 4. Automatic Management of Access Rights
|
||
|
||
### 4.1. Daily / On-Demand Workflow
|
||
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
|
||
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
|
||
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
|
||
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
|
||
|
||
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
|
||
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
|
||
- Example inventory snippet:
|
||
```yaml
|
||
hosts:
|
||
- name: prod-db-01
|
||
allowed_principals:
|
||
adm: [adm-full]
|
||
agt: [agt-incident-resolver-v2]
|
||
atm: [atm-backup-daily, atm-logrotate]
|
||
```
|
||
|
||
3. **Revocation & Rotation**
|
||
- Short expiry = automatic revocation.
|
||
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
|
||
- Agents/automations never store long-lived private keys on disk.
|
||
|
||
4. **Concrete Agent & Automation Wrapper Example** (Python snippet – place in `/usr/local/bin/ops-ssh-wrapper`)
|
||
```python
|
||
#!/usr/bin/env python3
|
||
import subprocess, os, tempfile
|
||
# Request short-lived cert from Vault
|
||
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
|
||
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
|
||
f.write(cert.encode())
|
||
cert_path = f.name
|
||
# Load into ssh-agent and exec the real command
|
||
subprocess.run(["ssh-add", cert_path])
|
||
os.execvp(sys.argv[1], sys.argv[1:])
|
||
```
|
||
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
|
||
|
||
### 4.2. Human UX Guidance
|
||
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
|
||
|
||
### 4.3. Emergency Break-Glass Procedure
|
||
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
|
||
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
|
||
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
|
||
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
|
||
4. After recovery, immediately rotate the CA and run a full scorecard.
|
||
|
||
## 5. AccessManagement Scorecard (Checklist)
|
||
|
||
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
|
||
|
||
| Category | Check | Target | Tool |
|
||
|----------|-------|--------|------|
|
||
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
|
||
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
|
||
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
|
||
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
|
||
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
|
||
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
|
||
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
|
||
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
|
||
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
|
||
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
|
||
| **Score** | ≥ 10/10 = **Operational** | - | - |
|
||
|
||
**Scorecard Execution Command** (run from ops laptop):
|
||
```bash
|
||
ansible all -m command -a "ssh-access-scorecard.sh" --become
|
||
```
|
||
|
||
## 6. Scope & Operational Boundaries
|
||
|
||
### 6.1. When Bootstrapping Is Officially Closed
|
||
The system is **fully operational** when **ALL** of the following are true:
|
||
- Scorecard passes 10/10 on every host.
|
||
- Central Git repo contains the authoritative principals inventory.
|
||
- First three admins have successfully used signed certificates for 7 consecutive days.
|
||
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
|
||
- CI/CD pipeline for host config updates is green and runs hourly.
|
||
- Emergency break-glass procedure has been tested once.
|
||
|
||
**Declaration:** Ops Lead signs off with date in the Git commit message.
|
||
|
||
### 6.2. Scope Boundary – When to Switch to Sophisticated Tooling
|
||
Stay with **native OpenSSH CA + Ansible + Vault** while:
|
||
- ≤ 200 hosts
|
||
- ≤ 50 distinct agent/automation identities
|
||
- No regulatory requirement for SSO or full session recording
|
||
|
||
**Switch triggers** (any one):
|
||
- > 200 hosts OR rapid daily growth
|
||
- Need for human SSO (Okta/Google) integration
|
||
- Requirement for audited web-based SSH sessions or just-in-time access approval
|
||
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
|
||
- Audit/compliance demands central policy engine or session recording
|
||
|
||
**Recommended next-level tools** (in order):
|
||
1. **Teleport** – Best for mixed human + agent workloads (SSO + Machine ID).
|
||
2. **HashiCorp Vault SSH + Boundary** – When you already use Vault heavily.
|
||
3. **step-ca + smallstep** – If you prefer a pure open-source CA with OIDC.
|
||
|
||
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
|
||
|
||
## 7. Enforcement & Review
|
||
- **Quarterly review** of this directive and scorecard results.
|
||
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
|
||
- **Questions / improvements** → create PR against this file in the ops repo.
|
||
|
||
**End of Document**
|
||
Approved for immediate use across all production and staging environments.
|
||
|
||
xxx
|