generated from coulomb/repo-seed
feat(sso-mfa): T07/T08 user mgmt, backups, DR & break-glass (NK-WP-0001-T07/T08)
T07 — User management & self-service: - k8s/lldap/bootstrap-users.sh: creates net-kingdom-users and net-kingdom-admins groups in LLDAP via GraphQL API; idempotent. - k8s/lldap/break-glass.sh: creates break-glass bypass account in LLDAP, sets BREAKGLASS_PASSWORD, assigns to net-kingdom-admins. - k8s/verify-t07.sh: 6 checks — groups, break-glass, self-service portal, KeyCape OIDC client registrations. T08 — Backups, DR, break-glass: - k8s/backup/cronjob-sqlite-backups.yaml: daily CronJobs for LLDAP SQLite, Authelia SQLite (with scale-down/up RBAC), and privacyIDEA enckey backup. 7-day retention, 03:00/03:15/03:30 UTC staggered schedule. - k8s/backup/DR-RUNBOOK.md: full restore runbook — scenarios, restore order, LLDAP/Authelia/PI SQLite restore procedure, full node rebuild sequence, offsite age-encrypted export. - k8s/verify-t08.sh: 9 checks — CronJobs, RBAC, run history, backup files on PVCs, DR runbook presence, offsite backup (manual confirmation). - WORKPLAN.md: T07/T08 sections with done-criteria added. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
187
sso-mfa/k8s/backup/DR-RUNBOOK.md
Normal file
187
sso-mfa/k8s/backup/DR-RUNBOOK.md
Normal file
@@ -0,0 +1,187 @@
|
||||
# Disaster Recovery Runbook — net-kingdom SSO/MFA Platform
|
||||
|
||||
**Stack:** LLDAP + Authelia + KeyCape (sso namespace) + privacyIDEA (mfa namespace)
|
||||
**PostgreSQL:** Managed separately by CNPG (`postgresql/scheduled-backup.yaml`)
|
||||
|
||||
---
|
||||
|
||||
## Recovery scenarios
|
||||
|
||||
| Scenario | Impact | Recovery |
|
||||
|----------|--------|----------|
|
||||
| Pod crash / OOM | Stateless pods (KeyCape) recover automatically. Stateful pods (LLDAP, Authelia, PI) restart and reload from PVC. | K8s self-heals. Verify with `verify-t05.sh`. |
|
||||
| PVC data corruption | Users/sessions/tokens lost. | Restore from SQLite backup (see below). |
|
||||
| Node failure (single-node K3s) | All pods lost. PVCs intact on host. | Re-apply all manifests (idempotent). Pods re-attach to PVCs. |
|
||||
| Node total loss (disk gone) | Everything lost. | Full restore from backup + KeePassXC. |
|
||||
| Stack locked out (SSO broken, can't log in) | No user access to OIDC-protected apps. | Use break-glass account. |
|
||||
| enckey lost (privacyIDEA) | All enrolled MFA tokens invalid. Users must re-enroll. | Restore from enckey backup or re-enroll all tokens. |
|
||||
|
||||
---
|
||||
|
||||
## Break-glass access
|
||||
|
||||
When the SSO stack is broken and no user can authenticate:
|
||||
|
||||
```bash
|
||||
# 1. Access LLDAP admin UI directly (requires VPN / IP-allowlisted access)
|
||||
# URL: https://lldap.coulomb.social
|
||||
# Username: break-glass
|
||||
# Password: from KeePassXC → net-kingdom/Break-glass/break-glass
|
||||
#
|
||||
# 2. Or access LLDAP via kubectl exec (no network required)
|
||||
kubectl exec -n sso deployment/lldap -- /bin/sh
|
||||
# Inside container: use ldapwhoami / ldapsearch to verify directory state
|
||||
|
||||
# 3. Access privacyIDEA admin UI
|
||||
# URL: https://pink.coulomb.social
|
||||
# Username: pi-admin
|
||||
# Password: from KeePassXC → net-kingdom/privacyIDEA/pi-admin
|
||||
# NOTE: pi-admin has MFA enrolled — if privacyIDEA MFA is down, use:
|
||||
kubectl exec -n mfa deployment/privacyidea -- pi-manage admin list
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Restore order
|
||||
|
||||
**CRITICAL: Always restore in this order.** Components depend on each other
|
||||
at startup: privacyIDEA needs PostgreSQL, KeyCape needs all three.
|
||||
|
||||
```
|
||||
1. PostgreSQL (databases ns) — CNPG operator handles restore
|
||||
2. privacyIDEA (mfa ns) — needs PG + enckey PVC
|
||||
3. LLDAP (sso ns) — standalone
|
||||
4. Authelia (sso ns) — needs LLDAP (LDAP bind at startup check)
|
||||
5. KeyCape (sso ns) — needs Authelia + LLDAP + privacyIDEA
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Restore from SQLite backup (PVC data corruption)
|
||||
|
||||
### LLDAP
|
||||
|
||||
```bash
|
||||
# 1. Scale down LLDAP
|
||||
kubectl scale deployment/lldap -n sso --replicas=0
|
||||
|
||||
# 2. Start a restore pod on the lldap-data PVC
|
||||
kubectl run -n sso lldap-restore --image=nouchka/sqlite3:latest \
|
||||
--restart=Never \
|
||||
--overrides='{"spec":{"volumes":[{"name":"data","persistentVolumeClaim":{"claimName":"lldap-data"}}],"containers":[{"name":"lldap-restore","image":"nouchka/sqlite3:latest","command":["sleep","3600"],"volumeMounts":[{"name":"data","mountPath":"/data"}]}]}}'
|
||||
|
||||
# 3. Copy backup file into the pod (or it's already on the PVC under /data/backups/)
|
||||
kubectl exec -n sso lldap-restore -- ls /data/backups/
|
||||
|
||||
# 4. Restore from the chosen backup
|
||||
kubectl exec -n sso lldap-restore -- \
|
||||
sqlite3 /data/backups/users.backup.YYYY-MM-DD ".dump" | \
|
||||
sqlite3 /data/users.db
|
||||
|
||||
# 5. Clean up and restart
|
||||
kubectl delete pod -n sso lldap-restore
|
||||
kubectl scale deployment/lldap -n sso --replicas=1
|
||||
kubectl rollout status deployment/lldap -n sso --timeout=120s
|
||||
```
|
||||
|
||||
### Authelia
|
||||
|
||||
```bash
|
||||
# Same pattern as LLDAP, using authelia-data PVC and authelia.backup.YYYY-MM-DD
|
||||
kubectl scale deployment/authelia -n sso --replicas=0
|
||||
# ... (run restore pod, restore db.sqlite3, scale back up)
|
||||
kubectl scale deployment/authelia -n sso --replicas=1
|
||||
```
|
||||
|
||||
### privacyIDEA enckey
|
||||
|
||||
```bash
|
||||
# If the enckey is lost, restore it from KeePassXC binary attachment PI_ENCFILE.
|
||||
# Extract it to a local file first, then:
|
||||
kubectl create secret generic privacyidea-enckey \
|
||||
--from-file=PI_ENCFILE=./pi.enc \
|
||||
--namespace mfa \
|
||||
--dry-run=client -o yaml | kubectl apply -f -
|
||||
|
||||
# Restart privacyIDEA to pick up the restored key
|
||||
kubectl rollout restart deployment/privacyidea -n mfa
|
||||
|
||||
# If the enckey is truly lost and unrecoverable:
|
||||
# All enrolled MFA tokens are invalid.
|
||||
# Generate a new enckey with: kubectl exec -n mfa ... -- pi-manage create_enckey
|
||||
# All users must re-enroll their TOTP/hardware tokens.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Full node restore (new host)
|
||||
|
||||
```bash
|
||||
# Prerequisites on new host:
|
||||
# - K3s installed
|
||||
# - Traefik ingress (bundled with K3s)
|
||||
# - cert-manager installed (helm install cert-manager ...)
|
||||
# - DNS records pointing to new node IP
|
||||
# - KeePassXC vault accessible (offline copy or age-encrypted bundle)
|
||||
|
||||
# 1. Restore PostgreSQL from CNPG backup
|
||||
# (See CNPG documentation for cluster restore from barmanObjectStore)
|
||||
|
||||
# 2. Re-apply all manifests in order
|
||||
cd sso-mfa/k8s
|
||||
kubectl apply -f namespaces/namespaces.yaml
|
||||
kubectl apply -f network-policies/
|
||||
kubectl apply -f cert-manager/issuers.yaml
|
||||
|
||||
# 3. Restore secrets from KeePassXC
|
||||
# Run each create-secrets.sh in order:
|
||||
cd postgresql && ./create-secrets.sh && cd ..
|
||||
cd privacyidea && ./create-secrets.sh && cd ..
|
||||
cd lldap && ./create-secrets.sh && cd ..
|
||||
cd authelia && ./create-secrets.sh && cd ..
|
||||
cd keycape && ./create-secrets.sh && cd ..
|
||||
|
||||
# 4. Apply workloads in restore order
|
||||
kubectl apply -f postgresql/cluster.yaml
|
||||
kubectl apply -f privacyidea/{pvc.yaml,configmap.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
|
||||
kubectl apply -f lldap/{pvc.yaml,deployment.yaml,middleware.yaml,ingress.yaml}
|
||||
kubectl apply -f authelia/{pvc.yaml,configmap.yaml,deployment.yaml,ingress.yaml}
|
||||
kubectl apply -f keycape/{deployment.yaml,middleware.yaml,ingress.yaml}
|
||||
|
||||
# 5. Wait for everything to be Ready
|
||||
kubectl rollout status deployment/privacyidea -n mfa --timeout=300s
|
||||
kubectl rollout status deployment/lldap -n sso --timeout=120s
|
||||
kubectl rollout status deployment/authelia -n sso --timeout=120s
|
||||
kubectl rollout status deployment/keycape -n sso --timeout=60s
|
||||
|
||||
# 6. Re-run bootstrap scripts if PVC data was lost
|
||||
cd privacyidea && ./enckey-bootstrap.sh && ./bootstrap-admin.sh && ./bootstrap-realm.sh
|
||||
cd ../lldap && ./bootstrap-users.sh && ./break-glass.sh
|
||||
cd ../keycape && ./create-pi-token.sh && ./create-secrets.sh
|
||||
kubectl rollout restart deployment/keycape -n sso
|
||||
|
||||
# 7. Verify
|
||||
./verify-t04.sh && ./verify-t05.sh && ./verify-t06.sh && ./verify-t07.sh && ./verify-t08.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Backup offsite export
|
||||
|
||||
The SQLite backup files land on the PVCs but are not offsite until exported.
|
||||
Run this on the node host to pull them out and encrypt for offsite storage:
|
||||
|
||||
```bash
|
||||
# Pull backup files from pods
|
||||
kubectl exec -n sso deployment/lldap -- \
|
||||
cat /data/backups/users.backup.$(date +%Y-%m-%d) > /tmp/lldap-backup.db
|
||||
kubectl exec -n sso deployment/authelia -- \
|
||||
cat /data/backups/authelia.backup.$(date +%Y-%m-%d) > /tmp/authelia-backup.db
|
||||
|
||||
# Encrypt with age and send offsite (same key as the ops bundle)
|
||||
age -r "$(cat ~/net-kingdom-ops-bundle.key | grep 'public key' | awk '{print $NF}')" \
|
||||
-o /tmp/lldap-backup.db.age /tmp/lldap-backup.db
|
||||
|
||||
# Shred plaintext copies
|
||||
shred -u /tmp/lldap-backup.db /tmp/authelia-backup.db
|
||||
```
|
||||
Reference in New Issue
Block a user