Files
net-kingdom/sso-mfa/k8s/backup/DR-RUNBOOK.md
Bernd Worsch 6c062e1295 feat(sso-mfa): T07/T08 user mgmt, backups, DR & break-glass (NK-WP-0001-T07/T08)
T07 — User management & self-service:
- k8s/lldap/bootstrap-users.sh: creates net-kingdom-users and net-kingdom-admins
  groups in LLDAP via GraphQL API; idempotent.
- k8s/lldap/break-glass.sh: creates break-glass bypass account in LLDAP,
  sets BREAKGLASS_PASSWORD, assigns to net-kingdom-admins.
- k8s/verify-t07.sh: 6 checks — groups, break-glass, self-service portal,
  KeyCape OIDC client registrations.

T08 — Backups, DR, break-glass:
- k8s/backup/cronjob-sqlite-backups.yaml: daily CronJobs for LLDAP SQLite,
  Authelia SQLite (with scale-down/up RBAC), and privacyIDEA enckey backup.
  7-day retention, 03:00/03:15/03:30 UTC staggered schedule.
- k8s/backup/DR-RUNBOOK.md: full restore runbook — scenarios, restore order,
  LLDAP/Authelia/PI SQLite restore procedure, full node rebuild sequence,
  offsite age-encrypted export.
- k8s/verify-t08.sh: 9 checks — CronJobs, RBAC, run history, backup files
  on PVCs, DR runbook presence, offsite backup (manual confirmation).
- WORKPLAN.md: T07/T08 sections with done-criteria added.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-19 09:17:03 +00:00

7.0 KiB

Disaster Recovery Runbook — net-kingdom SSO/MFA Platform

Stack: LLDAP + Authelia + KeyCape (sso namespace) + privacyIDEA (mfa namespace) PostgreSQL: Managed separately by CNPG (postgresql/scheduled-backup.yaml)


Recovery scenarios

Scenario Impact Recovery
Pod crash / OOM Stateless pods (KeyCape) recover automatically. Stateful pods (LLDAP, Authelia, PI) restart and reload from PVC. K8s self-heals. Verify with verify-t05.sh.
PVC data corruption Users/sessions/tokens lost. Restore from SQLite backup (see below).
Node failure (single-node K3s) All pods lost. PVCs intact on host. Re-apply all manifests (idempotent). Pods re-attach to PVCs.
Node total loss (disk gone) Everything lost. Full restore from backup + KeePassXC.
Stack locked out (SSO broken, can't log in) No user access to OIDC-protected apps. Use break-glass account.
enckey lost (privacyIDEA) All enrolled MFA tokens invalid. Users must re-enroll. Restore from enckey backup or re-enroll all tokens.

Break-glass access

When the SSO stack is broken and no user can authenticate:

# 1. Access LLDAP admin UI directly (requires VPN / IP-allowlisted access)
#    URL: https://lldap.coulomb.social
#    Username: break-glass
#    Password: from KeePassXC → net-kingdom/Break-glass/break-glass
#
# 2. Or access LLDAP via kubectl exec (no network required)
kubectl exec -n sso deployment/lldap -- /bin/sh
# Inside container: use ldapwhoami / ldapsearch to verify directory state

# 3. Access privacyIDEA admin UI
#    URL: https://pink.coulomb.social
#    Username: pi-admin
#    Password: from KeePassXC → net-kingdom/privacyIDEA/pi-admin
#    NOTE: pi-admin has MFA enrolled — if privacyIDEA MFA is down, use:
kubectl exec -n mfa deployment/privacyidea -- pi-manage admin list

Restore order

CRITICAL: Always restore in this order. Components depend on each other at startup: privacyIDEA needs PostgreSQL, KeyCape needs all three.

1. PostgreSQL (databases ns)    — CNPG operator handles restore
2. privacyIDEA (mfa ns)        — needs PG + enckey PVC
3. LLDAP (sso ns)              — standalone
4. Authelia (sso ns)           — needs LLDAP (LDAP bind at startup check)
5. KeyCape (sso ns)            — needs Authelia + LLDAP + privacyIDEA

Restore from SQLite backup (PVC data corruption)

LLDAP

# 1. Scale down LLDAP
kubectl scale deployment/lldap -n sso --replicas=0

# 2. Start a restore pod on the lldap-data PVC
kubectl run -n sso lldap-restore --image=nouchka/sqlite3:latest \
  --restart=Never \
  --overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"lldap-data"}}],"containers":[{"name":"lldap-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'

# 3. Copy backup file into the pod (or it's already on the PVC under /data/backups/)
kubectl exec -n sso lldap-restore -- ls /data/backups/

# 4. Restore from the chosen backup
kubectl exec -n sso lldap-restore -- \
  sqlite3 /data/backups/users.backup.YYYY-MM-DD ".dump" | \
  sqlite3 /data/users.db

# 5. Clean up and restart
kubectl delete pod -n sso lldap-restore
kubectl scale deployment/lldap -n sso --replicas=1
kubectl rollout status deployment/lldap -n sso --timeout=120s

Authelia

# Same pattern as LLDAP, using authelia-data PVC and authelia.backup.YYYY-MM-DD
kubectl scale deployment/authelia -n sso --replicas=0
# ... (run restore pod, restore db.sqlite3, scale back up)
kubectl scale deployment/authelia -n sso --replicas=1

privacyIDEA enckey

# If the enckey is lost, restore it from KeePassXC binary attachment PI_ENCFILE.
# Extract it to a local file first, then:
kubectl create secret generic privacyidea-enckey \
  --from-file=PI_ENCFILE=./pi.enc \
  --namespace mfa \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart privacyIDEA to pick up the restored key
kubectl rollout restart deployment/privacyidea -n mfa

# If the enckey is truly lost and unrecoverable:
#   All enrolled MFA tokens are invalid.
#   Generate a new enckey with: kubectl exec -n mfa ... -- pi-manage create_enckey
#   All users must re-enroll their TOTP/hardware tokens.

Full node restore (new host)

# Prerequisites on new host:
#   - K3s installed
#   - Traefik ingress (bundled with K3s)
#   - cert-manager installed (helm install cert-manager ...)
#   - DNS records pointing to new node IP
#   - KeePassXC vault accessible (offline copy or age-encrypted bundle)

# 1. Restore PostgreSQL from CNPG backup
#    (See CNPG documentation for cluster restore from barmanObjectStore)

# 2. Re-apply all manifests in order
cd sso-mfa/k8s
kubectl apply -f namespaces/namespaces.yaml
kubectl apply -f network-policies/
kubectl apply -f cert-manager/issuers.yaml

# 3. Restore secrets from KeePassXC
#    Run each create-secrets.sh in order:
cd postgresql   && ./create-secrets.sh  && cd ..
cd privacyidea  && ./create-secrets.sh  && cd ..
cd lldap        && ./create-secrets.sh  && cd ..
cd authelia     && ./create-secrets.sh  && cd ..
cd keycape      && ./create-secrets.sh  && cd ..

# 4. Apply workloads in restore order
kubectl apply -f postgresql/cluster.yaml
kubectl apply -f privacyidea/{pvc.yaml,configmap.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
kubectl apply -f lldap/{pvc.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
kubectl apply -f authelia/{pvc.yaml,configmap.yaml,deployment.yaml,ingress.yaml}
kubectl apply -f keycape/{deployment.yaml,middleware.yaml,ingress.yaml}

# 5. Wait for everything to be Ready
kubectl rollout status deployment/privacyidea -n mfa --timeout=300s
kubectl rollout status deployment/lldap       -n sso --timeout=120s
kubectl rollout status deployment/authelia    -n sso --timeout=120s
kubectl rollout status deployment/keycape     -n sso --timeout=60s

# 6. Re-run bootstrap scripts if PVC data was lost
cd privacyidea && ./enckey-bootstrap.sh && ./bootstrap-admin.sh && ./bootstrap-realm.sh
cd ../lldap    && ./bootstrap-users.sh  && ./break-glass.sh
cd ../keycape  && ./create-pi-token.sh  && ./create-secrets.sh
kubectl rollout restart deployment/keycape -n sso

# 7. Verify
./verify-t04.sh && ./verify-t05.sh && ./verify-t06.sh && ./verify-t07.sh && ./verify-t08.sh

Backup offsite export

The SQLite backup files land on the PVCs but are not offsite until exported. Run this on the node host to pull them out and encrypt for offsite storage:

# Pull backup files from pods
kubectl exec -n sso deployment/lldap -- \
  cat /data/backups/users.backup.$(date +%Y-%m-%d) > /tmp/lldap-backup.db
kubectl exec -n sso deployment/authelia -- \
  cat /data/backups/authelia.backup.$(date +%Y-%m-%d) > /tmp/authelia-backup.db

# Encrypt with age and send offsite (same key as the ops bundle)
age -r "$(cat ~/net-kingdom-ops-bundle.key | grep 'public key' | awk '{print $NF}')" \
  -o /tmp/lldap-backup.db.age /tmp/lldap-backup.db

# Shred plaintext copies
shred -u /tmp/lldap-backup.db /tmp/authelia-backup.db