generated from coulomb/repo-seed
- Apply SQLite backup CronJobs (LLDAP, Authelia, privacyIDEA) — all verified running - Fix authelia-backup: remove scale-down/up dance; concurrent local-path PVC mount works on single-node k3s, sqlite3 .backup is safe for concurrent access - Fix privacyidea-backup: add supplementalGroups: [999] so uid=1000 can read enckey - Add allow-backup-to-kube-api NetworkPolicy (backup pod → 10.43.0.1:443) - Create break-glass LLDAP account (net-kingdom-admins); fix ((PASS++)) set-e trap - SQLite restore drill: LLDAP backup valid (2 users, all tables) - verify-t08.sh: PASS=15, FAIL=0; fix counter bug + enckey PVC path (/etc/privacyidea) - Update DR-RUNBOOK.md Authelia restore procedure - T09 deferred: CNPG backup (needs MinIO/S3), Prometheus (needs kube-prometheus-stack) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
198 lines
7.7 KiB
Markdown
198 lines
7.7 KiB
Markdown
# Disaster Recovery Runbook — net-kingdom SSO/MFA Platform
|
|
|
|
**Stack:** LLDAP + Authelia + KeyCape (sso namespace) + privacyIDEA (mfa namespace)
|
|
**PostgreSQL:** Managed separately by CNPG (`postgresql/scheduled-backup.yaml`)
|
|
|
|
---
|
|
|
|
## Recovery scenarios
|
|
|
|
| Scenario | Impact | Recovery |
|
|
|----------|--------|----------|
|
|
| Pod crash / OOM | Stateless pods (KeyCape) recover automatically. Stateful pods (LLDAP, Authelia, PI) restart and reload from PVC. | K8s self-heals. Verify with `verify-t05.sh`. |
|
|
| PVC data corruption | Users/sessions/tokens lost. | Restore from SQLite backup (see below). |
|
|
| Node failure (single-node K3s) | All pods lost. PVCs intact on host. | Re-apply all manifests (idempotent). Pods re-attach to PVCs. |
|
|
| Node total loss (disk gone) | Everything lost. | Full restore from backup + KeePassXC. |
|
|
| Stack locked out (SSO broken, can't log in) | No user access to OIDC-protected apps. | Use break-glass account. |
|
|
| enckey lost (privacyIDEA) | All enrolled MFA tokens invalid. Users must re-enroll. | Restore from enckey backup or re-enroll all tokens. |
|
|
|
|
---
|
|
|
|
## Break-glass access
|
|
|
|
When the SSO stack is broken and no user can authenticate:
|
|
|
|
```bash
|
|
# 1. Access LLDAP admin UI directly (requires VPN / IP-allowlisted access)
|
|
# URL: https://lldap.coulomb.social
|
|
# Username: break-glass
|
|
# Password: from KeePassXC → net-kingdom/Break-glass/break-glass
|
|
#
|
|
# 2. Or access LLDAP via kubectl exec (no network required)
|
|
kubectl exec -n sso deployment/lldap -- /bin/sh
|
|
# Inside container: use ldapwhoami / ldapsearch to verify directory state
|
|
|
|
# 3. Access privacyIDEA admin UI
|
|
# URL: https://pink.coulomb.social
|
|
# Username: pi-admin
|
|
# Password: from KeePassXC → net-kingdom/privacyIDEA/pi-admin
|
|
# NOTE: pi-admin has MFA enrolled — if privacyIDEA MFA is down, use:
|
|
kubectl exec -n mfa deployment/privacyidea -- pi-manage admin list
|
|
```
|
|
|
|
---
|
|
|
|
## Restore order
|
|
|
|
**CRITICAL: Always restore in this order.** Components depend on each other
|
|
at startup: privacyIDEA needs PostgreSQL, KeyCape needs all three.
|
|
|
|
```
|
|
1. PostgreSQL (databases ns) — CNPG operator handles restore
|
|
2. privacyIDEA (mfa ns) — needs PG + enckey PVC
|
|
3. LLDAP (sso ns) — standalone
|
|
4. Authelia (sso ns) — needs LLDAP (LDAP bind at startup check)
|
|
5. KeyCape (sso ns) — needs Authelia + LLDAP + privacyIDEA
|
|
```
|
|
|
|
---
|
|
|
|
## Restore from SQLite backup (PVC data corruption)
|
|
|
|
### LLDAP
|
|
|
|
```bash
|
|
# 1. Scale down LLDAP
|
|
kubectl scale deployment/lldap -n sso --replicas=0
|
|
|
|
# 2. Start a restore pod on the lldap-data PVC
|
|
kubectl run -n sso lldap-restore --image=nouchka/sqlite3:latest \
|
|
--restart=Never \
|
|
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"lldap-data"}}],"containers":[{"name":"lldap-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
|
|
|
|
# 3. Copy backup file into the pod (or it's already on the PVC under /data/backups/)
|
|
kubectl exec -n sso lldap-restore -- ls /data/backups/
|
|
|
|
# 4. Restore from the chosen backup
|
|
kubectl exec -n sso lldap-restore -- \
|
|
sqlite3 /data/backups/users.backup.YYYY-MM-DD ".dump" | \
|
|
sqlite3 /data/users.db
|
|
|
|
# 5. Clean up and restart
|
|
kubectl delete pod -n sso lldap-restore
|
|
kubectl scale deployment/lldap -n sso --replicas=1
|
|
kubectl rollout status deployment/lldap -n sso --timeout=120s
|
|
```
|
|
|
|
### Authelia
|
|
|
|
```bash
|
|
# On single-node k3s (local-path PVCs are hostPath-backed), a restore pod can mount
|
|
# authelia-data alongside the running Authelia pod. Scale down only if you need to
|
|
# replace the live db.sqlite3 in-place (Authelia must be stopped to avoid corruption).
|
|
kubectl scale deployment/authelia -n sso --replicas=0
|
|
kubectl run -n sso authelia-restore --image=nouchka/sqlite3:latest \
|
|
--restart=Never \
|
|
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"authelia-data"}}],"containers":[{"name":"authelia-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
|
|
kubectl exec -n sso authelia-restore -- ls /data/backups/
|
|
kubectl exec -n sso authelia-restore -- \
|
|
sqlite3 /data/backups/authelia.backup.YYYY-MM-DD ".dump" | \
|
|
sqlite3 /data/db.sqlite3
|
|
kubectl delete pod -n sso authelia-restore
|
|
kubectl scale deployment/authelia -n sso --replicas=1
|
|
kubectl rollout status deployment/authelia -n sso --timeout=120s
|
|
```
|
|
|
|
### privacyIDEA enckey
|
|
|
|
```bash
|
|
# If the enckey is lost, restore it from KeePassXC binary attachment PI_ENCFILE.
|
|
# Extract it to a local file first, then:
|
|
kubectl create secret generic privacyidea-enckey \
|
|
--from-file=PI_ENCFILE=./pi.enc \
|
|
--namespace mfa \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
# Restart privacyIDEA to pick up the restored key
|
|
kubectl rollout restart deployment/privacyidea -n mfa
|
|
|
|
# If the enckey is truly lost and unrecoverable:
|
|
# All enrolled MFA tokens are invalid.
|
|
# Generate a new enckey with: kubectl exec -n mfa ... -- pi-manage create_enckey
|
|
# All users must re-enroll their TOTP/hardware tokens.
|
|
```
|
|
|
|
---
|
|
|
|
## Full node restore (new host)
|
|
|
|
```bash
|
|
# Prerequisites on new host:
|
|
# - K3s installed
|
|
# - Traefik ingress (bundled with K3s)
|
|
# - cert-manager installed (helm install cert-manager ...)
|
|
# - DNS records pointing to new node IP
|
|
# - KeePassXC vault accessible (offline copy or age-encrypted bundle)
|
|
|
|
# 1. Restore PostgreSQL from CNPG backup
|
|
# (See CNPG documentation for cluster restore from barmanObjectStore)
|
|
|
|
# 2. Re-apply all manifests in order
|
|
cd sso-mfa/k8s
|
|
kubectl apply -f namespaces/namespaces.yaml
|
|
kubectl apply -f network-policies/
|
|
kubectl apply -f cert-manager/issuers.yaml
|
|
|
|
# 3. Restore secrets from KeePassXC
|
|
# Run each create-secrets.sh in order:
|
|
cd postgresql && ./create-secrets.sh && cd ..
|
|
cd privacyidea && ./create-secrets.sh && cd ..
|
|
cd lldap && ./create-secrets.sh && cd ..
|
|
cd authelia && ./create-secrets.sh && cd ..
|
|
cd keycape && ./create-secrets.sh && cd ..
|
|
|
|
# 4. Apply workloads in restore order
|
|
kubectl apply -f postgresql/cluster.yaml
|
|
kubectl apply -f privacyidea/{pvc.yaml,configmap.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
|
|
kubectl apply -f lldap/{pvc.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
|
|
kubectl apply -f authelia/{pvc.yaml,configmap.yaml,deployment.yaml,ingress.yaml}
|
|
kubectl apply -f keycape/{deployment.yaml,middleware.yaml,ingress.yaml}
|
|
|
|
# 5. Wait for everything to be Ready
|
|
kubectl rollout status deployment/privacyidea -n mfa --timeout=300s
|
|
kubectl rollout status deployment/lldap -n sso --timeout=120s
|
|
kubectl rollout status deployment/authelia -n sso --timeout=120s
|
|
kubectl rollout status deployment/keycape -n sso --timeout=60s
|
|
|
|
# 6. Re-run bootstrap scripts if PVC data was lost
|
|
cd privacyidea && ./enckey-bootstrap.sh && ./bootstrap-admin.sh && ./bootstrap-realm.sh
|
|
cd ../lldap && ./bootstrap-users.sh && ./break-glass.sh
|
|
cd ../keycape && ./create-pi-token.sh && ./create-secrets.sh
|
|
kubectl rollout restart deployment/keycape -n sso
|
|
|
|
# 7. Verify
|
|
./verify-t04.sh && ./verify-t05.sh && ./verify-t06.sh && ./verify-t07.sh && ./verify-t08.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Backup offsite export
|
|
|
|
The SQLite backup files land on the PVCs but are not offsite until exported.
|
|
Run this on the node host to pull them out and encrypt for offsite storage:
|
|
|
|
```bash
|
|
# Pull backup files from pods
|
|
kubectl exec -n sso deployment/lldap -- \
|
|
cat /data/backups/users.backup.$(date +%Y-%m-%d) > /tmp/lldap-backup.db
|
|
kubectl exec -n sso deployment/authelia -- \
|
|
cat /data/backups/authelia.backup.$(date +%Y-%m-%d) > /tmp/authelia-backup.db
|
|
|
|
# Encrypt with age and send offsite (same key as the ops bundle)
|
|
age -r "$(cat ~/net-kingdom-ops-bundle.key | grep 'public key' | awk '{print $NF}')" \
|
|
-o /tmp/lldap-backup.db.age /tmp/lldap-backup.db
|
|
|
|
# Shred plaintext copies
|
|
shred -u /tmp/lldap-backup.db /tmp/authelia-backup.db
|
|
```
|