coulomb/ops-warden

Fork 0

generated from coulomb/repo-seed

Files

Bernd Worsch 5ae6b988aa Initial Commit

2026-03-28 00:45:43 +00:00

11 KiB

Raw Blame History

AccessManagementDirective

*Practical host access control management *

AccessManagementDirective

Document Title: SSH Access Management Directive
Version: 1.1 (Production-Ready Revision – Post-SWOT Improvements)
Date: 28 March 2026
Audience: Operations Department
Purpose: Establish a simple, efficient, scalable, and secure standard for managing SSH access across all hosts for three actor types: Admins (adm), Agents (agt), and Automations (atm).
Author: Grok (on behalf of the team)
Status: Official Directive – All ops personnel, agents, and automation pipelines MUST follow this.
Changes in v1.1: Added prerequisites, emergency break-glass procedure, concrete issuance examples, strengthened CA security, enhanced scorecard, human UX guidance, agent risk clarification, KRL support, and tighter TTL recommendations.

0. Prerequisites

Before bootstrapping, the following must be in place:

Ansible (or equivalent config-management tool) with a central inventory.
HashiCorp Vault (or equivalent secrets manager) with the SSH secrets engine enabled.
GitOps repository containing the authoritative principals inventory.
Basic monitoring/alerting for Vault and SSH logs (e.g., Prometheus + Loki or equivalent).
At least two ops personnel trained on Vault SSH signing and Ansible playbooks.

If any of these are missing, complete them first or the “automatic” parts of this directive will not function reliably.

1. Concept Overview

This directive replaces the legacy practice of scattering static SSH public keys in ~/.ssh/authorized_keys files. Instead, we adopt SSH Certificate Authority (CA) based authentication as the single source of truth.

Why this model?

A central CA signs short-lived certificates for every login.
No more manual key copying, key sprawl, or painful revocation.
Built-in expiration, role-based principals, and auditability.
Works identically for humans, LLM-powered autonomous agents, and deterministic scripts.
Scales from 5 hosts to 500+ with almost zero per-host maintenance.

Core Principles

Least privilege – Every certificate carries explicit principals (roles) and optional force-command / source-address restrictions.
Short-lived credentials – Certificates expire automatically (24–48 h for admins, 4–24 h for agents, 1–8 h for automations).
One CA, many issuers – A single offline User CA whose public key is trusted by every host.
Automation-first – All key issuance, rotation, and host configuration is driven by code (Ansible + Vault).
Separation of concerns –
- Admins (adm): Human operators (full interactive shell when needed).
- Agents (agt): LLM-powered autonomous entities that can self-register wake-up triggers and execute tasks.
- Automations (atm): Deterministic scripts / cron jobs / pipelines with narrow, purpose-specific rights.

2. Actor Definitions & Access Model

Actor Type	Identifier Prefix	Description	Typical Certificate Lifetime	Principals / Restrictions
Admin (adm)	`adm-`	Human operator (on-call engineers)	24–48 hours (renewable)	`adm-full`, `adm-readonly` + optional `force-command`
Agent (agt)	`agt-`	LLM-powered autonomous agent (can schedule own wake-ups)	4–24 hours (auto-refresh)	`agt-task-<name>`, limited to specific scripts/directories
Automation (atm)	`atm-`	Deterministic script / pipeline	1–8 hours (per invocation)	`atm-<jobname>`, `force-command=/usr/local/bin/atm-wrapper.sh`

Certificate Naming Convention

Identity string (-I): adm-bernd, agt-incident-resolver-v2, atm-backup-daily
Principals (-n): comma-separated list of allowed roles (stored in /etc/ssh/auth_principals/%u on hosts)

LLM-Agent Risk Clarification
Agent signing policy MUST enforce least-privilege principals + force-command wrappers; never grant blanket shell access to autonomous agents.

3. Bootstrapping the System (One-Time Setup)

3.1. Create the CA (do this once, offline)

ssh-keygen -t ed25519 -f /secure/vault/ca_user -C "Ops SSH User CA (2026)" -N ""

Store the private key in an HSM-backed Vault (or air-gapped offline storage) with 4-eyes approval required for any signing operation.
Rotate the CA key itself every 2–3 years using the same bootstrap playbook.
Public key: ca_user.pub

3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)

Copy ca_user.pub → /etc/ssh/ca/ca_user.pub (mode 644, root-owned).

Update /etc/ssh/sshd_config:

TrustedUserCAKeys /etc/ssh/ca/ca_user.pub
AuthorizedPrincipalsFile /etc/ssh/auth_principals/%u
PubkeyAuthentication yes
PasswordAuthentication no
PermitRootLogin no

Create principals directory and files from the central Git inventory.
systemctl restart sshd

3.3. Initial Admin Access

First admin generates personal keypair → submits .pub → CA signs a bootstrap certificate valid for 48 hours with principal adm-bootstrap. This is the ONLY manual step.

4. Automatic Management of Access Rights

4.1. Daily / On-Demand Workflow

Key/Certificate Issuance Pipeline (GitOps + Vault)
- Humans (adm): Use the recommended CLI wrapper ops-ssh-sign (or Teleport tsh if adopted early) so signing feels invisible.
- Agents (agt): At startup, call Vault SSH engine API (auto-refreshed by a wrapper daemon).
- Automations (atm): Just-in-time cert request via Vault inside a thin wrapper script.

Ansible-Driven Host Updates (run hourly via CI/CD)

auth_principals/ files are rendered from a central inventory (JSON/YAML in Git).

Example inventory snippet:

hosts:
  - name: prod-db-01
    allowed_principals:
      adm: [adm-full]
      agt: [agt-incident-resolver-v2]
      atm: [atm-backup-daily, atm-logrotate]

Revocation & Rotation
- Short expiry = automatic revocation.
- For emergency revocation of a still-valid cert, maintain a Key Revocation List (KRL) and push it via Ansible (RevokedKeys directive in sshd_config).
- Agents/automations never store long-lived private keys on disk.

Concrete Agent & Automation Wrapper Example (Python snippet – place in /usr/local/bin/ops-ssh-wrapper)

#!/usr/bin/env python3
import subprocess, os, tempfile
# Request short-lived cert from Vault
cert = subprocess.check_output(["vault", "write", "-field=signed_key", "ssh/sign/agt-role", f"public_key={os.environ['SSH_PUBKEY']}"]).decode().strip()
with tempfile.NamedTemporaryFile(suffix="-cert.pub", delete=False) as f:
    f.write(cert.encode())
    cert_path = f.name
# Load into ssh-agent and exec the real command
subprocess.run(["ssh-add", cert_path])
os.execvp(sys.argv[1], sys.argv[1:])

Agents call this wrapper; it auto-refreshes the cert on every wake-up.

4.2. Human UX Guidance

Admins are encouraged to use the ops-ssh-sign wrapper script (provided in the ops repo) or Teleport tsh ssh for seamless experience. Manual ssh-keygen -s is only for edge cases.

4.3. Emergency Break-Glass Procedure

In case of total lockout (CA offline, misconfigured Ansible push, etc.):

Use the pre-documented static emergency key pair on a separate bastion host (rotated quarterly, stored in Vault with 4-eyes access).
Or fall back to cloud-provider console access (AWS SSM Session Manager, GCP IAP, Azure Bastion).
Document the exact recovery playbook in the same Git repo under emergency/break-glass.md.
After recovery, immediately rotate the CA and run a full scorecard.

5. AccessManagement Scorecard (Checklist)

Run via Ansible ssh-access-audit.yml. Each item is pass/fail.

Category	Check	Target	Tool
CA Trust	`TrustedUserCAKeys` points to correct file	All hosts	`ssh-audit`
No Static Keys	`authorized_keys` files are empty or contain only emergency bootstrap keys	All hosts	`find /home -name authorized_keys -size +0`
Principals Config	`/etc/ssh/auth_principals/%u` exists and is up-to-date	All hosts	Ansible inventory diff
Expiry Policy	All issued certs have `Valid: < 48h` (adm) or `< 24h` (agt/atm)	Last 100 certs	`ssh-keygen -L -f *.pub`
Password Auth	Disabled globally	All hosts	`sshd -T \| grep password`
Root Login	Disabled	All hosts	`sshd -T \| grep permitroot`
Agent/Automation Wrapper	Every agt/atm binary calls Vault for cert	All pipelines	Code review + runtime trace
Audit Logging	Every SSH connection logs certificate identity (`-I`) to central SIEM	All hosts	`journalctl -u sshd` + SIEM query
CA Security	CA key access is 4-eyes / HSM-backed	Vault policy	Vault audit log
Bootstrap Complete	No `adm-bootstrap` principal in use	All hosts	Scorecard run
Score	≥ 10/10 = Operational	-	-

Scorecard Execution Command (run from ops laptop):

ansible all -m command -a "ssh-access-scorecard.sh" --become

6. Scope & Operational Boundaries

6.1. When Bootstrapping Is Officially Closed

The system is fully operational when ALL of the following are true:

Scorecard passes 10/10 on every host.
Central Git repo contains the authoritative principals inventory.
First three admins have successfully used signed certificates for 7 consecutive days.
At least one agent (agt) and one automation (atm) have executed a task using a CA-signed certificate.
CI/CD pipeline for host config updates is green and runs hourly.
Emergency break-glass procedure has been tested once.

Declaration: Ops Lead signs off with date in the Git commit message.

6.2. Scope Boundary – When to Switch to Sophisticated Tooling

Stay with native OpenSSH CA + Ansible + Vault while:

≤ 200 hosts
≤ 50 distinct agent/automation identities
No regulatory requirement for SSO or full session recording

Switch triggers (any one):

200 hosts OR rapid daily growth
Need for human SSO (Okta/Google) integration
Requirement for audited web-based SSH sessions or just-in-time access approval
Agents need built-in Machine-ID / workload identity (e.g., Teleport tbot)
Audit/compliance demands central policy engine or session recording

Recommended next-level tools (in order):

Teleport – Best for mixed human + agent workloads (SSO + Machine ID).
HashiCorp Vault SSH + Boundary – When you already use Vault heavily.
step-ca + smallstep – If you prefer a pure open-source CA with OIDC.

Migration path: The CA public key and principals model are fully compatible; you can import the existing CA into Teleport/Vault without re-issuing keys to users.

7. Enforcement & Review

Quarterly review of this directive and scorecard results.
Violations (e.g., adding static keys) trigger immediate access revocation and incident ticket.
Questions / improvements → create PR against this file in the ops repo.

End of Document
Approved for immediate use across all production and staging environments.

xxx

11 KiB Raw Blame History Unescape Escape

AccessManagementDirective

0. Prerequisites

1. Concept Overview

2. Actor Definitions & Access Model

3. Bootstrapping the System (One-Time Setup)

3.1. Create the CA (do this once, offline)

3.2. Deploy Trust on Every Host (Ansible playbook bootstrap-ssh-ca.yml)

3.3. Initial Admin Access

4. Automatic Management of Access Rights

4.1. Daily / On-Demand Workflow

4.2. Human UX Guidance

4.3. Emergency Break-Glass Procedure

5. AccessManagement Scorecard (Checklist)

6. Scope & Operational Boundaries

6.1. When Bootstrapping Is Officially Closed

6.2. Scope Boundary – When to Switch to Sophisticated Tooling

7. Enforcement & Review

11 KiB

Raw Blame History

3.2. Deploy Trust on Every Host (Ansible playbook `bootstrap-ssh-ca.yml`)