11 KiB
AccessManagementDirective
*Practical host access control management *
AccessManagementDirective
Document Title: SSH Access Management Directive
Version: 1.1 (Production-Ready Revision – Post-SWOT Improvements)
Date: 28 March 2026
Audience: Operations Department
Purpose: Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
Author: Grok (on behalf of the team)
Status: Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.
Changes in v1.1: Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.
0. Prerequisites
Before bootstrapping, the following must be in place:
- Ansible (or equivalent config-management tool) with a central inventory.
- HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
- GitOps repository containing the authoritative principals inventory.
- Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
- At least two ops personnel trained on Vault SSH signing and Ansible playbooks.
If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.
1. Concept Overview
This directive replaces the legacy practice of scattering static SSH public keys in ~/.ssh/authorized_keys files. Instead, we adopt SSH Certificate Authority (CA) based authentication as the single source of truth.
Why this model?
- A central CA signs short-lived certificates for every login.
- No more manual key copying, key sprawl, or painful revocation.
- Built-in expiration, role-based principals, and auditability.
- Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
- Scales from 5 hosts to 500+ with almost zero per-host maintenance.
Core Principles
- Least privilege – Every certificate carries explicit principals (roles) and optional
force-command/source-addressrestrictions. - Short-lived credentials – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).
- One CA, many issuers – A single offline User CA whose public key is trusted by every host.
- Automation-first – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
- Separation of concerns –
- Admins (adm): Human operators (full interactive shell when needed).
- Agents (agt): LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- Automations (atm): Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.
2. Actor Definitions & Access Model
| Actor Type | Identifier Prefix | Description | Typical Certificate Lifetime | Principals / Restrictions |
|---|---|---|---|---|
| Admin (adm) | adm- |
Human operator (on-call engineers) | 24–48 hours (renewable) | adm-full, adm-readonly + optional force-command |
| Agent (agt) | agt- |
LLM-powered autonomous agent (can schedule own wake-ups) | 4–24 hours (auto-refresh) | agt-task-<name>, limited to specific scripts/directories |
| Automation (atm) | atm- |
Deterministic script / pipeline | 1–8 hours (per invocation) | atm-<jobname>, force-command=/usr/local/bin/atm-wrapper.sh |
Certificate Naming Convention
- Identity string (
-I):adm-bernd,agt-incident-resolver-v2,atm-backup-daily - Principals (
-n): comma-separated list of allowed roles (stored in/etc/ssh/auth_principals/%uon hosts)
LLM-Agent Risk Clarification
Agent signing policy MUST enforce least-privilege principals + force-command wrappers; never grant blanket shell access to autonomous agents.
3. Bootstrapping the System (One-Time Setup)
3.1. Create the CA (do this once, offline)
ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""
- Store the private key in an HSM-backed Vault (or air-gapped offline storage) with 4-eyes approval required for any signing operation.
- Rotate the CA key itself every 2–3 years using the same bootstrap playbook.
- Public key:
ca_user.pub
3.2. Deploy Trust on Every Host (Ansible playbook bootstrap-ssh-ca.yml)
- Copy
ca_user.pub→/etc/ssh/ca/ca_user.pub(mode 644, root-owned). - Update
/etc/ssh/sshd_config:TrustedUserCAKeys /etc/ssh/ca/ca_user.pub AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u PubkeyAuthentication yes PasswordAuthentication no PermitRootLogin no - Create principals directory and files from the central Git inventory.
systemctl restart sshd
3.3. Initial Admin Access
First admin generates personal keypair → submits .pub → CA signs a bootstrap certificate valid for 48 hours with principal adm-bootstrap. This is the ONLY manual step.
4. Automatic Management of Access Rights
4.1. Daily / On-Demand Workflow
-
Key/Certificate Issuance Pipeline (GitOps + Vault)
- Humans (adm): Use the recommended CLI wrapper
ops-ssh-sign(or Teleporttshif adopted early) so signing feels invisible. - Agents (agt): At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- Automations (atm): Just-in-time cert request via Vault inside a thin wrapper script.
- Humans (adm): Use the recommended CLI wrapper
-
Ansible-Driven Host Updates (run hourly via CI/CD)
auth_principals/files are rendered from a central inventory (JSON/YAML in Git).- Example inventory snippet:
hosts: - name: prod-db-01 allowed_principals: adm: [adm-full] agt: [agt-incident-resolver-v2] atm: [atm-backup-daily, atm-logrotate]
-
Revocation & Rotation
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (
RevokedKeysdirective insshd_config). - Agents/automations never store long-lived private keys on disk.
-
Concrete Agent & Automation Wrapper Example (Python snippet – place in
/usr/local/bin/ops-ssh-wrapper)#!/usr/bin/env python3 import subprocess, os, tempfile # Request short-lived cert from Vault cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip() with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f: f.write(cert.encode()) cert_path = f.name # Load into ssh-agent and exec the real command subprocess.run(["ssh-add", cert_path]) os.execvp(sys.argv[1], sys.argv[1:])Agents call this wrapper; it auto-refreshes the cert on every wake-up.
4.2. Human UX Guidance
Admins are encouraged to use the ops-ssh-sign wrapper script (provided in the ops repo) or Teleport tsh ssh for seamless experience. Manual ssh-keygen -s is only for edge cases.
4.3. Emergency Break-Glass Procedure
In case of total lockout (CA offline, misconfigured Ansible push, etc.):
- Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
- Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
- Document the exact recovery playbook in the same Git repo under
emergency/break-glass.md. - After recovery, immediately rotate the CA and run a full scorecard.
5. AccessManagement Scorecard (Checklist)
Run via Ansible ssh-access-audit.yml. Each item is pass/fail.
| Category | Check | Target | Tool |
|---|---|---|---|
| CA Trust | TrustedUserCAKeys points to correct file |
All hosts | ssh-audit |
| No Static Keys | authorized_keys files are empty or contain only emergency bootstrap keys |
All hosts | find /home -name authorized_keys -size +0 |
| Principals Config | /etc/ssh/auth_principals/%u exists and is up-to-date |
All hosts | Ansible inventory diff |
| Expiry Policy | All issued certs have Valid: < 48h (adm) or < 24h (agt/atm) |
Last 100 certs | ssh-keygen -L -f *.pub |
| Password Auth | Disabled globally | All hosts | sshd -T | grep password |
| Root Login | Disabled | All hosts | sshd -T | grep permitroot |
| Agent/Automation Wrapper | Every agt/atm binary calls Vault for cert | All pipelines | Code review + runtime trace |
| Audit Logging | Every SSH connection logs certificate identity (-I) to central SIEM |
All hosts | journalctl -u sshd + SIEM query |
| CA Security | CA key access is 4-eyes / HSM-backed | Vault policy | Vault audit log |
| Bootstrap Complete | No adm-bootstrap principal in use |
All hosts | Scorecard run |
| Score | ≥ 10/10 = Operational | - | - |
Scorecard Execution Command (run from ops laptop):
ansible all -m command -a "ssh-access-scorecard.sh" --become
6. Scope & Operational Boundaries
6.1. When Bootstrapping Is Officially Closed
The system is fully operational when ALL of the following are true:
- Scorecard passes 10/10 on every host.
- Central Git repo contains the authoritative principals inventory.
- First three admins have successfully used signed certificates for 7 consecutive days.
- At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
- CI/CD pipeline for host config updates is green and runs hourly.
- Emergency break-glass procedure has been tested once.
Declaration: Ops Lead signs off with date in the Git commit message.
6.2. Scope Boundary – When to Switch to Sophisticated Tooling
Stay with native OpenSSH CA + Ansible + Vault while:
- ≤ 200 hosts
- ≤ 50 distinct agent/automation identities
- No regulatory requirement for SSO or full session recording
Switch triggers (any one):
-
200 hosts OR rapid daily growth
- Need for human SSO (Okta/Google) integration
- Requirement for audited web-based SSH sessions or just-in-time access approval
- Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
- Audit/compliance demands central policy engine or session recording
Recommended next-level tools (in order):
- Teleport – Best for mixed human + agent workloads (SSO + Machine ID).
- HashiCorp Vault SSH + Boundary – When you already use Vault heavily.
- step-ca + smallstep – If you prefer a pure open-source CA with OIDC.
Migration path: The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.
7. Enforcement & Review
- Quarterly review of this directive and scorecard results.
- Violations (e.g., adding static keys) trigger immediate access revocation and incident ticket.
- Questions / improvements → create PR against this file in the ops repo.
End of Document
Approved for immediate use across all production and staging environments.
xxx