generated from coulomb/repo-seed
- Apply SQLite backup CronJobs (LLDAP, Authelia, privacyIDEA) — all verified running - Fix authelia-backup: remove scale-down/up dance; concurrent local-path PVC mount works on single-node k3s, sqlite3 .backup is safe for concurrent access - Fix privacyidea-backup: add supplementalGroups: [999] so uid=1000 can read enckey - Add allow-backup-to-kube-api NetworkPolicy (backup pod → 10.43.0.1:443) - Create break-glass LLDAP account (net-kingdom-admins); fix ((PASS++)) set-e trap - SQLite restore drill: LLDAP backup valid (2 users, all tables) - verify-t08.sh: PASS=15, FAIL=0; fix counter bug + enckey PVC path (/etc/privacyidea) - Update DR-RUNBOOK.md Authelia restore procedure - T09 deferred: CNPG backup (needs MinIO/S3), Prometheus (needs kube-prometheus-stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
7.7 KiB
7.7 KiB
Disaster Recovery Runbook — net-kingdom SSO/MFA Platform
Stack: LLDAP + Authelia + KeyCape (sso namespace) + privacyIDEA (mfa namespace)
PostgreSQL: Managed separately by CNPG (postgresql/scheduled-backup.yaml)
Recovery scenarios
| Scenario | Impact | Recovery |
|---|---|---|
| Pod crash / OOM | Stateless pods (KeyCape) recover automatically. Stateful pods (LLDAP, Authelia, PI) restart and reload from PVC. | K8s self-heals. Verify with verify-t05.sh. |
| PVC data corruption | Users/sessions/tokens lost. | Restore from SQLite backup (see below). |
| Node failure (single-node K3s) | All pods lost. PVCs intact on host. | Re-apply all manifests (idempotent). Pods re-attach to PVCs. |
| Node total loss (disk gone) | Everything lost. | Full restore from backup + KeePassXC. |
| Stack locked out (SSO broken, can't log in) | No user access to OIDC-protected apps. | Use break-glass account. |
| enckey lost (privacyIDEA) | All enrolled MFA tokens invalid. Users must re-enroll. | Restore from enckey backup or re-enroll all tokens. |
Break-glass access
When the SSO stack is broken and no user can authenticate:
# 1. Access LLDAP admin UI directly (requires VPN / IP-allowlisted access)
# URL: https://lldap.coulomb.social
# Username: break-glass
# Password: from KeePassXC → net-kingdom/Break-glass/break-glass
#
# 2. Or access LLDAP via kubectl exec (no network required)
kubectl exec -n sso deployment/lldap -- /bin/sh
# Inside container: use ldapwhoami / ldapsearch to verify directory state
# 3. Access privacyIDEA admin UI
# URL: https://pink.coulomb.social
# Username: pi-admin
# Password: from KeePassXC → net-kingdom/privacyIDEA/pi-admin
# NOTE: pi-admin has MFA enrolled — if privacyIDEA MFA is down, use:
kubectl exec -n mfa deployment/privacyidea -- pi-manage admin list
Restore order
CRITICAL: Always restore in this order. Components depend on each other at startup: privacyIDEA needs PostgreSQL, KeyCape needs all three.
1. PostgreSQL (databases ns) — CNPG operator handles restore
2. privacyIDEA (mfa ns) — needs PG + enckey PVC
3. LLDAP (sso ns) — standalone
4. Authelia (sso ns) — needs LLDAP (LDAP bind at startup check)
5. KeyCape (sso ns) — needs Authelia + LLDAP + privacyIDEA
Restore from SQLite backup (PVC data corruption)
LLDAP
# 1. Scale down LLDAP
kubectl scale deployment/lldap -n sso --replicas=0
# 2. Start a restore pod on the lldap-data PVC
kubectl run -n sso lldap-restore --image=nouchka/sqlite3:latest \
--restart=Never \
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"lldap-data"}}],"containers":[{"name":"lldap-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
# 3. Copy backup file into the pod (or it's already on the PVC under /data/backups/)
kubectl exec -n sso lldap-restore -- ls /data/backups/
# 4. Restore from the chosen backup
kubectl exec -n sso lldap-restore -- \
sqlite3 /data/backups/users.backup.YYYY-MM-DD ".dump" | \
sqlite3 /data/users.db
# 5. Clean up and restart
kubectl delete pod -n sso lldap-restore
kubectl scale deployment/lldap -n sso --replicas=1
kubectl rollout status deployment/lldap -n sso --timeout=120s
Authelia
# On single-node k3s (local-path PVCs are hostPath-backed), a restore pod can mount
# authelia-data alongside the running Authelia pod. Scale down only if you need to
# replace the live db.sqlite3 in-place (Authelia must be stopped to avoid corruption).
kubectl scale deployment/authelia -n sso --replicas=0
kubectl run -n sso authelia-restore --image=nouchka/sqlite3:latest \
--restart=Never \
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"authelia-data"}}],"containers":[{"name":"authelia-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
kubectl exec -n sso authelia-restore -- ls /data/backups/
kubectl exec -n sso authelia-restore -- \
sqlite3 /data/backups/authelia.backup.YYYY-MM-DD ".dump" | \
sqlite3 /data/db.sqlite3
kubectl delete pod -n sso authelia-restore
kubectl scale deployment/authelia -n sso --replicas=1
kubectl rollout status deployment/authelia -n sso --timeout=120s
privacyIDEA enckey
# If the enckey is lost, restore it from KeePassXC binary attachment PI_ENCFILE.
# Extract it to a local file first, then:
kubectl create secret generic privacyidea-enckey \
--from-file=PI_ENCFILE=./pi.enc \
--namespace mfa \
--dry-run=client -o yaml | kubectl apply -f -
# Restart privacyIDEA to pick up the restored key
kubectl rollout restart deployment/privacyidea -n mfa
# If the enckey is truly lost and unrecoverable:
# All enrolled MFA tokens are invalid.
# Generate a new enckey with: kubectl exec -n mfa ... -- pi-manage create_enckey
# All users must re-enroll their TOTP/hardware tokens.
Full node restore (new host)
# Prerequisites on new host:
# - K3s installed
# - Traefik ingress (bundled with K3s)
# - cert-manager installed (helm install cert-manager ...)
# - DNS records pointing to new node IP
# - KeePassXC vault accessible (offline copy or age-encrypted bundle)
# 1. Restore PostgreSQL from CNPG backup
# (See CNPG documentation for cluster restore from barmanObjectStore)
# 2. Re-apply all manifests in order
cd sso-mfa/k8s
kubectl apply -f namespaces/namespaces.yaml
kubectl apply -f network-policies/
kubectl apply -f cert-manager/issuers.yaml
# 3. Restore secrets from KeePassXC
# Run each create-secrets.sh in order:
cd postgresql && ./create-secrets.sh && cd ..
cd privacyidea && ./create-secrets.sh && cd ..
cd lldap && ./create-secrets.sh && cd ..
cd authelia && ./create-secrets.sh && cd ..
cd keycape && ./create-secrets.sh && cd ..
# 4. Apply workloads in restore order
kubectl apply -f postgresql/cluster.yaml
kubectl apply -f privacyidea/{pvc.yaml,configmap.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
kubectl apply -f lldap/{pvc.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
kubectl apply -f authelia/{pvc.yaml,configmap.yaml,deployment.yaml,ingress.yaml}
kubectl apply -f keycape/{deployment.yaml,middleware.yaml,ingress.yaml}
# 5. Wait for everything to be Ready
kubectl rollout status deployment/privacyidea -n mfa --timeout=300s
kubectl rollout status deployment/lldap -n sso --timeout=120s
kubectl rollout status deployment/authelia -n sso --timeout=120s
kubectl rollout status deployment/keycape -n sso --timeout=60s
# 6. Re-run bootstrap scripts if PVC data was lost
cd privacyidea && ./enckey-bootstrap.sh && ./bootstrap-admin.sh && ./bootstrap-realm.sh
cd ../lldap && ./bootstrap-users.sh && ./break-glass.sh
cd ../keycape && ./create-pi-token.sh && ./create-secrets.sh
kubectl rollout restart deployment/keycape -n sso
# 7. Verify
./verify-t04.sh && ./verify-t05.sh && ./verify-t06.sh && ./verify-t07.sh && ./verify-t08.sh
Backup offsite export
The SQLite backup files land on the PVCs but are not offsite until exported. Run this on the node host to pull them out and encrypt for offsite storage:
# Pull backup files from pods
kubectl exec -n sso deployment/lldap -- \
cat /data/backups/users.backup.$(date +%Y-%m-%d) > /tmp/lldap-backup.db
kubectl exec -n sso deployment/authelia -- \
cat /data/backups/authelia.backup.$(date +%Y-%m-%d) > /tmp/authelia-backup.db
# Encrypt with age and send offsite (same key as the ops bundle)
age -r "$(cat ~/net-kingdom-ops-bundle.key | grep 'public key' | awk '{print $NF}')" \
-o /tmp/lldap-backup.db.age /tmp/lldap-backup.db
# Shred plaintext copies
shred -u /tmp/lldap-backup.db /tmp/authelia-backup.db