generated from coulomb/repo-seed
T07 — User management & self-service: - k8s/lldap/bootstrap-users.sh: creates net-kingdom-users and net-kingdom-admins groups in LLDAP via GraphQL API; idempotent. - k8s/lldap/break-glass.sh: creates break-glass bypass account in LLDAP, sets BREAKGLASS_PASSWORD, assigns to net-kingdom-admins. - k8s/verify-t07.sh: 6 checks — groups, break-glass, self-service portal, KeyCape OIDC client registrations. T08 — Backups, DR, break-glass: - k8s/backup/cronjob-sqlite-backups.yaml: daily CronJobs for LLDAP SQLite, Authelia SQLite (with scale-down/up RBAC), and privacyIDEA enckey backup. 7-day retention, 03:00/03:15/03:30 UTC staggered schedule. - k8s/backup/DR-RUNBOOK.md: full restore runbook — scenarios, restore order, LLDAP/Authelia/PI SQLite restore procedure, full node rebuild sequence, offsite age-encrypted export. - k8s/verify-t08.sh: 9 checks — CronJobs, RBAC, run history, backup files on PVCs, DR runbook presence, offsite backup (manual confirmation). - WORKPLAN.md: T07/T08 sections with done-criteria added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
188 lines
7.0 KiB
Markdown
188 lines
7.0 KiB
Markdown
# Disaster Recovery Runbook — net-kingdom SSO/MFA Platform
|
|
|
|
**Stack:** LLDAP + Authelia + KeyCape (sso namespace) + privacyIDEA (mfa namespace)
|
|
**PostgreSQL:** Managed separately by CNPG (`postgresql/scheduled-backup.yaml`)
|
|
|
|
---
|
|
|
|
## Recovery scenarios
|
|
|
|
| Scenario | Impact | Recovery |
|
|
|----------|--------|----------|
|
|
| Pod crash / OOM | Stateless pods (KeyCape) recover automatically. Stateful pods (LLDAP, Authelia, PI) restart and reload from PVC. | K8s self-heals. Verify with `verify-t05.sh`. |
|
|
| PVC data corruption | Users/sessions/tokens lost. | Restore from SQLite backup (see below). |
|
|
| Node failure (single-node K3s) | All pods lost. PVCs intact on host. | Re-apply all manifests (idempotent). Pods re-attach to PVCs. |
|
|
| Node total loss (disk gone) | Everything lost. | Full restore from backup + KeePassXC. |
|
|
| Stack locked out (SSO broken, can't log in) | No user access to OIDC-protected apps. | Use break-glass account. |
|
|
| enckey lost (privacyIDEA) | All enrolled MFA tokens invalid. Users must re-enroll. | Restore from enckey backup or re-enroll all tokens. |
|
|
|
|
---
|
|
|
|
## Break-glass access
|
|
|
|
When the SSO stack is broken and no user can authenticate:
|
|
|
|
```bash
|
|
# 1. Access LLDAP admin UI directly (requires VPN / IP-allowlisted access)
|
|
# URL: https://lldap.coulomb.social
|
|
# Username: break-glass
|
|
# Password: from KeePassXC → net-kingdom/Break-glass/break-glass
|
|
#
|
|
# 2. Or access LLDAP via kubectl exec (no network required)
|
|
kubectl exec -n sso deployment/lldap -- /bin/sh
|
|
# Inside container: use ldapwhoami / ldapsearch to verify directory state
|
|
|
|
# 3. Access privacyIDEA admin UI
|
|
# URL: https://pink.coulomb.social
|
|
# Username: pi-admin
|
|
# Password: from KeePassXC → net-kingdom/privacyIDEA/pi-admin
|
|
# NOTE: pi-admin has MFA enrolled — if privacyIDEA MFA is down, use:
|
|
kubectl exec -n mfa deployment/privacyidea -- pi-manage admin list
|
|
```
|
|
|
|
---
|
|
|
|
## Restore order
|
|
|
|
**CRITICAL: Always restore in this order.** Components depend on each other
|
|
at startup: privacyIDEA needs PostgreSQL, KeyCape needs all three.
|
|
|
|
```
|
|
1. PostgreSQL (databases ns) — CNPG operator handles restore
|
|
2. privacyIDEA (mfa ns) — needs PG + enckey PVC
|
|
3. LLDAP (sso ns) — standalone
|
|
4. Authelia (sso ns) — needs LLDAP (LDAP bind at startup check)
|
|
5. KeyCape (sso ns) — needs Authelia + LLDAP + privacyIDEA
|
|
```
|
|
|
|
---
|
|
|
|
## Restore from SQLite backup (PVC data corruption)
|
|
|
|
### LLDAP
|
|
|
|
```bash
|
|
# 1. Scale down LLDAP
|
|
kubectl scale deployment/lldap -n sso --replicas=0
|
|
|
|
# 2. Start a restore pod on the lldap-data PVC
|
|
kubectl run -n sso lldap-restore --image=nouchka/sqlite3:latest \
|
|
--restart=Never \
|
|
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"lldap-data"}}],"containers":[{"name":"lldap-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
|
|
|
|
# 3. Copy backup file into the pod (or it's already on the PVC under /data/backups/)
|
|
kubectl exec -n sso lldap-restore -- ls /data/backups/
|
|
|
|
# 4. Restore from the chosen backup
|
|
kubectl exec -n sso lldap-restore -- \
|
|
sqlite3 /data/backups/users.backup.YYYY-MM-DD ".dump" | \
|
|
sqlite3 /data/users.db
|
|
|
|
# 5. Clean up and restart
|
|
kubectl delete pod -n sso lldap-restore
|
|
kubectl scale deployment/lldap -n sso --replicas=1
|
|
kubectl rollout status deployment/lldap -n sso --timeout=120s
|
|
```
|
|
|
|
### Authelia
|
|
|
|
```bash
|
|
# Same pattern as LLDAP, using authelia-data PVC and authelia.backup.YYYY-MM-DD
|
|
kubectl scale deployment/authelia -n sso --replicas=0
|
|
# ... (run restore pod, restore db.sqlite3, scale back up)
|
|
kubectl scale deployment/authelia -n sso --replicas=1
|
|
```
|
|
|
|
### privacyIDEA enckey
|
|
|
|
```bash
|
|
# If the enckey is lost, restore it from KeePassXC binary attachment PI_ENCFILE.
|
|
# Extract it to a local file first, then:
|
|
kubectl create secret generic privacyidea-enckey \
|
|
--from-file=PI_ENCFILE=./pi.enc \
|
|
--namespace mfa \
|
|
--dry-run=client -o yaml | kubectl apply -f -
|
|
|
|
# Restart privacyIDEA to pick up the restored key
|
|
kubectl rollout restart deployment/privacyidea -n mfa
|
|
|
|
# If the enckey is truly lost and unrecoverable:
|
|
# All enrolled MFA tokens are invalid.
|
|
# Generate a new enckey with: kubectl exec -n mfa ... -- pi-manage create_enckey
|
|
# All users must re-enroll their TOTP/hardware tokens.
|
|
```
|
|
|
|
---
|
|
|
|
## Full node restore (new host)
|
|
|
|
```bash
|
|
# Prerequisites on new host:
|
|
# - K3s installed
|
|
# - Traefik ingress (bundled with K3s)
|
|
# - cert-manager installed (helm install cert-manager ...)
|
|
# - DNS records pointing to new node IP
|
|
# - KeePassXC vault accessible (offline copy or age-encrypted bundle)
|
|
|
|
# 1. Restore PostgreSQL from CNPG backup
|
|
# (See CNPG documentation for cluster restore from barmanObjectStore)
|
|
|
|
# 2. Re-apply all manifests in order
|
|
cd sso-mfa/k8s
|
|
kubectl apply -f namespaces/namespaces.yaml
|
|
kubectl apply -f network-policies/
|
|
kubectl apply -f cert-manager/issuers.yaml
|
|
|
|
# 3. Restore secrets from KeePassXC
|
|
# Run each create-secrets.sh in order:
|
|
cd postgresql && ./create-secrets.sh && cd ..
|
|
cd privacyidea && ./create-secrets.sh && cd ..
|
|
cd lldap && ./create-secrets.sh && cd ..
|
|
cd authelia && ./create-secrets.sh && cd ..
|
|
cd keycape && ./create-secrets.sh && cd ..
|
|
|
|
# 4. Apply workloads in restore order
|
|
kubectl apply -f postgresql/cluster.yaml
|
|
kubectl apply -f privacyidea/{pvc.yaml,configmap.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
|
|
kubectl apply -f lldap/{pvc.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
|
|
kubectl apply -f authelia/{pvc.yaml,configmap.yaml,deployment.yaml,ingress.yaml}
|
|
kubectl apply -f keycape/{deployment.yaml,middleware.yaml,ingress.yaml}
|
|
|
|
# 5. Wait for everything to be Ready
|
|
kubectl rollout status deployment/privacyidea -n mfa --timeout=300s
|
|
kubectl rollout status deployment/lldap -n sso --timeout=120s
|
|
kubectl rollout status deployment/authelia -n sso --timeout=120s
|
|
kubectl rollout status deployment/keycape -n sso --timeout=60s
|
|
|
|
# 6. Re-run bootstrap scripts if PVC data was lost
|
|
cd privacyidea && ./enckey-bootstrap.sh && ./bootstrap-admin.sh && ./bootstrap-realm.sh
|
|
cd ../lldap && ./bootstrap-users.sh && ./break-glass.sh
|
|
cd ../keycape && ./create-pi-token.sh && ./create-secrets.sh
|
|
kubectl rollout restart deployment/keycape -n sso
|
|
|
|
# 7. Verify
|
|
./verify-t04.sh && ./verify-t05.sh && ./verify-t06.sh && ./verify-t07.sh && ./verify-t08.sh
|
|
```
|
|
|
|
---
|
|
|
|
## Backup offsite export
|
|
|
|
The SQLite backup files land on the PVCs but are not offsite until exported.
|
|
Run this on the node host to pull them out and encrypt for offsite storage:
|
|
|
|
```bash
|
|
# Pull backup files from pods
|
|
kubectl exec -n sso deployment/lldap -- \
|
|
cat /data/backups/users.backup.$(date +%Y-%m-%d) > /tmp/lldap-backup.db
|
|
kubectl exec -n sso deployment/authelia -- \
|
|
cat /data/backups/authelia.backup.$(date +%Y-%m-%d) > /tmp/authelia-backup.db
|
|
|
|
# Encrypt with age and send offsite (same key as the ops bundle)
|
|
age -r "$(cat ~/net-kingdom-ops-bundle.key | grep 'public key' | awk '{print $NF}')" \
|
|
-o /tmp/lldap-backup.db.age /tmp/lldap-backup.db
|
|
|
|
# Shred plaintext copies
|
|
shred -u /tmp/lldap-backup.db /tmp/authelia-backup.db
|
|
```
|