Files
ops-bridge/wiki/AccessManagementDirective.md
Bernd Worsch f3a7236c5d docs: align architecture and scope with AccessManagementDirective
Expands architecture constraints and SCOPE.md to reflect the three-actor
vocabulary (adm/agt/atm), two credential modes (static key + cert_command),
and ops-warden boundary. Adds directive wiki doc and two new workplans
(BRIDGE-WP-0004 directive alignment, WARDEN-WP-0001 ops-warden bootstrap).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-28 00:59:38 +00:00

204 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
AccessManagementDirective
*Practical host access control management *
# AccessManagementDirective
**Document Title:** SSH Access Management Directive
**Version:** 1.1 (Production-Ready Revision Post-SWOT Improvements)
**Date:** 28 March 2026
**Audience:** Operations Department
**Purpose:** Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
**Author:** Grok (on behalf of the team)
**Status:** Official Directive All ops personnel, agents, and automation pipelines MUST follow this.
**Changes in v1.1:** Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
## 0. Prerequisites
Before bootstrapping, the following must be in place:
- Ansible (or equivalent config-management tool) with a central inventory.
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
- GitOps repository containing the authoritative principals inventory.
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
## 1. Concept Overview
This directive replaces the legacy practice of scattering static SSH public keys in `~/.ssh/authorized_keys` files. Instead, we adopt **SSH Certificate Authority (CA) based authentication** as the single source of truth.
**Why this model?**
- A central CA signs short-lived certificates for every login.
- No more manual key copying, key sprawl, or painful revocation.
- Built-in expiration, role-based principals, and auditability.
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
**Core Principles**
- **Least privilege** Every certificate carries explicit *principals* (roles) and optional `force-command` / `source-address` restrictions.
- **Short-lived credentials** Certificates expire automatically (2448 h for admins, 424 h for agents, 18 h for automations).
- **One CA, many issuers** A single offline User CA whose public key is trusted by every host.
- **Automation-first** All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
- **Separation of concerns**
- **Admins (adm)**: Human operators (full interactive shell when needed).
- **Agents (agt)**: LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- **Automations (atm)**: Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
## 2. Actor Definitions & Access Model
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|------------|-------------------|-------------|------------------------------|---------------------------|
| **Admin (adm)** | `adm-` | Human operator (on-call engineers) | 2448 hours (renewable) | `adm-full`, `adm-readonly` + optional `force-command` |
| **Agent (agt)** | `agt-` | LLM-powered autonomous agent (can schedule own wake-ups) | 424 hours (auto-refresh) | `agt-task-<name>`, limited to specific scripts/directories |
| **Automation (atm)** | `atm-` | Deterministic script / pipeline | 18 hours (per invocation) | `atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh` |
**Certificate Naming Convention**
- Identity string (`-I`): `adm-bernd`, `agt-incident-resolver-v2`, `atm-backup-daily`
- Principals (`-n`): comma-separated list of allowed roles (stored in `/etc/ssh/auth_principals/%u` on hosts)
**LLM-Agent Risk Clarification**
Agent signing policy MUST enforce least-privilege principals + `force-command` wrappers; never grant blanket shell access to autonomous agents.
## 3. Bootstrapping the System (One-Time Setup)
### 3.1. Create the CA (do this once, offline)
```bash
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
```
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with **4-eyes approval** required for any signing operation.
- Rotate the CA key itself every 23 years using the same bootstrap playbook.
- Public key: `ca_user.pub`
### 3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)
- Copy `ca_user.pub``/etc/ssh/ca/ca_user.pub` (mode 644, root-owned).
- Update `/etc/ssh/sshd_config`:
```bash
TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PubkeyAuthentication yes
PasswordAuthentication no
PermitRootLogin no
```
- Create principals directory and files from the central Git inventory.
- `systemctl restart sshd`
### 3.3. Initial Admin Access
First admin generates personal keypair → submits `.pub` → CA signs a bootstrap certificate valid for 48 hours with principal `adm-bootstrap`. This is the ONLY manual step.
## 4. Automatic Management of Access Rights
### 4.1. Daily / On-Demand Workflow
1. **Key/Certificate Issuance Pipeline** (GitOps + Vault)
- **Humans (adm)**: Use the recommended CLI wrapper `ops-ssh-sign` (or Teleport `tsh` if adopted early) so signing feels invisible.
- **Agents (agt)**: At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- **Automations (atm)**: Just-in-time cert request via Vault inside a thin wrapper script.
2. **Ansible-Driven Host Updates** (run hourly via CI/CD)
- `auth_principals/` files are rendered from a central inventory (JSON/YAML in Git).
- Example inventory snippet:
```yaml
hosts:
- name: prod-db-01
allowed_principals:
adm: [adm-full]
agt: [agt-incident-resolver-v2]
atm: [atm-backup-daily, atm-logrotate]
```
3. **Revocation & Rotation**
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (`RevokedKeys` directive in `sshd_config`).
- Agents/automations never store long-lived private keys on disk.
4. **Concrete Agent & Automation Wrapper Example** (Python snippet place in `/usr/local/bin/ops-ssh-wrapper`)
```python
#!/usr/bin/env python3
import subprocess, os, tempfile
# Request short-lived cert from Vault
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
f.write(cert.encode())
cert_path = f.name
# Load into ssh-agent and exec the real command
subprocess.run(["ssh-add", cert_path])
os.execvp(sys.argv[1], sys.argv[1:])
```
Agents call this wrapper; it auto-refreshes the cert on every wake-up.
### 4.2. Human UX Guidance
Admins are encouraged to use the `ops-ssh-sign` wrapper script (provided in the ops repo) or Teleport `tsh ssh` for seamless experience. Manual `ssh-keygen -s` is only for edge cases.
### 4.3. Emergency Break-Glass Procedure
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
1. Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
2. Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
3. Document the exact recovery playbook in the same Git repo under `emergency/break-glass.md`.
4. After recovery, immediately rotate the CA and run a full scorecard.
## 5. AccessManagement Scorecard (Checklist)
Run via Ansible `ssh-access-audit.yml`. Each item is pass/fail.
| Category | Check | Target | Tool |
|----------|-------|--------|------|
| **CA Trust** | `TrustedUserCAKeys` points to correct file | All hosts | `ssh-audit` |
| **No Static Keys** | `authorized_keys` files are empty or contain only emergency bootstrap keys | All hosts | `find /home -name authorized_keys -size +0` |
| **Principals Config** | `/etc/ssh/auth_principals/%u` exists and is up-to-date | All hosts | Ansible inventory diff |
| **Expiry Policy** | All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm) | Last 100 certs | `ssh-keygen -L -f *.pub` |
| **Password Auth** | Disabled globally | All hosts | `sshd -T \| grep password` |
| **Root Login** | Disabled | All hosts | `sshd -T \| grep permitroot` |
| **Agent/Automation Wrapper** | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
| **Audit Logging** | Every SSH connection logs certificate identity (`-I`) to central SIEM | All hosts | `journalctl -u sshd` + SIEM query |
| **CA Security** | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
| **Bootstrap Complete** | No `adm-bootstrap` principal in use | All hosts | Scorecard run |
| **Score** | ≥ 10/10 = **Operational** | - | - |
**Scorecard Execution Command** (run from ops laptop):
```bash
ansible all -m command -a "ssh-access-scorecard.sh" --become
```
## 6. Scope & Operational Boundaries
### 6.1. When Bootstrapping Is Officially Closed
The system is **fully operational** when **ALL** of the following are true:
- Scorecard passes 10/10 on every host.
- Central Git repo contains the authoritative principals inventory.
- First three admins have successfully used signed certificates for 7 consecutive days.
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
- CI/CD pipeline for host config updates is green and runs hourly.
- Emergency break-glass procedure has been tested once.
**Declaration:** Ops Lead signs off with date in the Git commit message.
### 6.2. Scope Boundary When to Switch to Sophisticated Tooling
Stay with **native OpenSSH CA + Ansible + Vault** while:
- ≤ 200 hosts
- ≤ 50 distinct agent/automation identities
- No regulatory requirement for SSO or full session recording
**Switch triggers** (any one):
- > 200 hosts OR rapid daily growth
- Need for human SSO (Okta/Google) integration
- Requirement for audited web-based SSH sessions or just-in-time access approval
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
- Audit/compliance demands central policy engine or session recording
**Recommended next-level tools** (in order):
1. **Teleport** Best for mixed human + agent workloads (SSO + Machine ID).
2. **HashiCorp Vault SSH + Boundary** When you already use Vault heavily.
3. **step-ca + smallstep** If you prefer a pure open-source CA with OIDC.
**Migration path:** The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
## 7. Enforcement & Review
- **Quarterly review** of this directive and scorecard results.
- **Violations** (e.g., adding static keys) trigger immediate access revocation and incident ticket.
- **Questions / improvements** → create PR against this file in the ops repo.
**End of Document**
Approved for immediate use across all production and staging environments.
xxx