Files
net-kingdom/workplans/NK-WP-0003-keycape-privacyidea-cluster-deployment.md
Bernd Worsch c054241a5c feat(t09): backup, break-glass, DR drill — NK-WP-0003-T09 done
- Apply SQLite backup CronJobs (LLDAP, Authelia, privacyIDEA) — all verified running
- Fix authelia-backup: remove scale-down/up dance; concurrent local-path PVC mount
  works on single-node k3s, sqlite3 .backup is safe for concurrent access
- Fix privacyidea-backup: add supplementalGroups: [999] so uid=1000 can read enckey
- Add allow-backup-to-kube-api NetworkPolicy (backup pod → 10.43.0.1:443)
- Create break-glass LLDAP account (net-kingdom-admins); fix ((PASS++)) set-e trap
- SQLite restore drill: LLDAP backup valid (2 users, all tables)
- verify-t08.sh: PASS=15, FAIL=0; fix counter bug + enckey PVC path (/etc/privacyidea)
- Update DR-RUNBOOK.md Authelia restore procedure
- T09 deferred: CNPG backup (needs MinIO/S3), Prometheus (needs kube-prometheus-stack)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-25 23:56:40 +00:00

14 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
NK-WP-0003 workplan KeyCape + privacyIDEA Stack — Cluster Deployment netkingdom net-kingdom active custodian netkingdom 2026-03-20 2026-03-26 f24cefd4-a09b-4fa1-9b25-94bf783b425e

KeyCape + privacyIDEA Stack — Cluster Deployment

Goal

Deploy the full NetKingdom identity stack on the live k3s cluster without Keycloak. KeyCape (v0.1, complete) is the OIDC orchestration layer; it binds LLDAP (directory), Authelia (auth sessions), and privacyIDEA (MFA).

NK-WP-0001 was scoped around Keycloak and is deferred. This workplan covers everything needed to reach a production-ready identity plane.

Target cluster

RAILIANCE0192.205.62.239 — k3s v1.35.1+k3s1, clean baseline. Kubeconfig: ~/.kube/config-railiance01

Note: T02T07 were previously completed on CoulombCore (92.205.130.254) by mistake. CoulombCore is the old management host (Gitea/OCI registry only) and should not be touched. All SSO stack work targets RAILIANCE01 exclusively.

Pre-conditions

  • k3s cluster healthy on RAILIANCE01 — v1.35.1+k3s1, node Ready ✓
  • kubeconfig available at ~/.kube/config-railiance01
  • All manifests committed — net-kingdom sso-mfa/k8s/
  • KeyCape v0.1 complete — KEY-WP-0001 ✓
  • SOPS + age integrated into net-kingdom — NK-WP-0004 ✓
  • Agent-driven credential bootstrap ready — NK-WP-0005 ✓ (run make creds-agent-init)

Architecture

Internet → Traefik (RAILIANCE01 k3s) → cert-manager TLS
                ├── auth.coulomb.social        → Authelia
                ├── pink.coulomb.social        → privacyIDEA portal
                ├── pink-account.coulomb.social → privacyIDEA account self-service
                └── id.coulomb.social          → KeyCape (OIDC)

KeyCape ──► Authelia (session, password)
        ──► LLDAP   (directory, user lookup)
        ──► privacyIDEA (MFA challenges via trigger-admin token)

privacyIDEA ──► PostgreSQL (privacyidea_db via CloudNativePG)
LLDAP       ──► SQLite (PVC)
Authelia    ──► SQLite (PVC)

KeyCape image pulled from CoulombCore OCI registry: 92.205.130.254:32166
(insecure HTTP NodePort — requires registries.yaml on RAILIANCE01)

Tasks

T01 — Credential setup

id: NK-WP-0003-T01
status: done
priority: high
state_hub_task_id: "6a22e17e-5854-4f8b-b419-9dc86d490357"
note: Credential foundation exists (NK-WP-0004 + NK-WP-0005). Secrets encrypted in
      secrets.enc/. Before T02, run `make creds-agent-init` with KUBECONFIG pointing
      to RAILIANCE01 to inject all secrets into the new cluster.

Net-kingdom currently uses a manual KeePassXC + age-bundle approach Completed via NK-WP-0004 + NK-WP-0005. The credential foundation is in place:

  • SOPS + age integrated — ~/.config/sops/age/keys.txt, .sops.yaml, git hook
  • Agent bootstrap: make creds-agent-init runs the full flow autonomously
  • Credential standard: canon/standards/credential-management_v0.2.md

To bootstrap credentials into the RAILIANCE01 cluster before T02T09, run:

export KUBECONFIG=~/.kube/config-railiance01
make creds-agent-init

This generates all secrets, encrypts to secrets.enc/, injects into the cluster, and delivers the emergency bundle. No KeePassXC steps required.

T02 — Apply cluster foundations

id: NK-WP-0003-T02
status: done
priority: high
state_hub_task_id: "a14e3a6b-18ee-4172-8a47-bd531f21e55a"
note: Done 2026-03-25 on RAILIANCE01. Namespaces, NetworkPolicies, cert-manager, ClusterIssuers,
      insecure registry for CoulombCore OCI all applied and verified.
      Known gotcha: added allow-traefik-to-acme-solver NetworkPolicy to sso + mfa namespaces
      (default-deny-all blocked ACME HTTP-01 solver pods from receiving Traefik traffic).

Apply the K8s infrastructure foundations. All manifests already committed.

export KUBECONFIG=~/.kube/config-railiance01
kubectl apply -f sso-mfa/k8s/namespaces/
kubectl apply -f sso-mfa/k8s/network-policies/
kubectl apply -f sso-mfa/k8s/cert-manager/

Also configure the insecure OCI registry on RAILIANCE01 so k3s can pull the KeyCape image:

ssh tegwick@92.205.62.239 "sudo tee /etc/rancher/k3s/registries.yaml" <<'EOF'
mirrors:
  "92.205.130.254:32166":
    endpoint:
      - "http://92.205.130.254:32166"
EOF
ssh tegwick@92.205.62.239 "sudo systemctl restart k3s"

Verify: bash sso-mfa/k8s/verify-t02.sh

Expected: namespaces sso, mfa, databases exist; NetworkPolicies applied; cert-manager pods Running.

T03 — Deploy PostgreSQL (CloudNativePG)

id: NK-WP-0003-T03
status: done
priority: high
state_hub_task_id: "19e375d0-66bd-4cf0-9c2d-59d5c0d5989e"
note: Done 2026-03-25 on RAILIANCE01. CNPG operator + net-kingdom-pg cluster running,
      privacyidea_db + role created. Verified via verify-t03.sh (8/8 PASS, 2 WARN for
      superuser secret + scheduled backup — both expected at this stage).

Deploy the shared database cluster:

export KUBECONFIG=~/.kube/config-railiance01
kubectl apply -f sso-mfa/k8s/postgres/

Wait for cluster to be Ready, then verify: bash sso-mfa/k8s/verify-t03.sh

Note: Do not proceed to T04 until the CloudNativePG cluster is fully healthy. Migration jobs will fail on a partially-started cluster.

T04 — Deploy privacyIDEA

id: NK-WP-0003-T04
status: done
priority: high
state_hub_task_id: "9c9c1ec9-0cf5-4546-a83e-d74dbf3b27af"
note: Done 2026-03-25 on RAILIANCE01. privacyIDEA pod Running, TLS certs issued,
      enckey + audit keys bootstrapped (privacyidea-enckey + privacyidea-auditkeys Secrets created),
      pi-admin + trigger-admin created, trigger-admin-rights policy created via REST API.
      DEFERRED: pi-admin TOTP enrollment requires an admin realm (SQLresolver pointing to PI's
      internal admin table) — pi-manage has no enroll command, WebUI token enrollment only works
      for resolver-backed users. Admin MFA is production hardening; pi-admin auth works
      password-only for now. Track as T09 hardening item.

Run credential bootstrap (injects privacyIDEA secrets + creates pi-admin/trigger-admin):

export KUBECONFIG=~/.kube/config-railiance01
make creds-agent-init

Remaining manual step: Once pink.coulomb.social resolves to 92.205.62.239 and TLS cert is issued:

  1. Log in to https://pink.coulomb.social as pi-admin
  2. Enroll MFA for pi-admin (TOTP)
  3. Verify/create trigger-admin policy: Policies → trigger-admin-rights (Scope: admin, Action: triggerchallenge, AdminUser: trigger-admin)

T05 — Deploy LLDAP

id: NK-WP-0003-T05
status: done
priority: high
state_hub_task_id: "82fc90f7-8eb4-4718-b02a-dfd5fa39e5bc"
note: Done 2026-03-25 on RAILIANCE01. LLDAP pod Running, TLS cert issued (lldap.coulomb.social),
      groups net-kingdom-users (id=4) + net-kingdom-admins (id=5) created via direct GraphQL.
      bootstrap-users.sh has a bash set -e / json parse bug (workaround: direct curl).

Deploy LLDAP into the sso namespace:

export KUBECONFIG=~/.kube/config-railiance01
cd sso-mfa/k8s/lldap
bash create-secrets.sh
kubectl apply -f deployment.yaml
kubectl apply -f ingress.yaml
kubectl apply -f middleware.yaml
bash bootstrap-users.sh   # creates base OU structure + initial admin user

Verify pod Running and LDAP bind works on ldap.coulomb.social.

T06 — Deploy Authelia

id: NK-WP-0003-T06
status: done
priority: high
state_hub_task_id: "3a28ff10-fbfa-443b-a64d-bbfe6153c544"
note: Done 2026-03-25 on RAILIANCE01. Authelia pod Running (1 restart on init, normal),
      TLS cert issued (auth.coulomb.social), health endpoint returns {"status":"OK"}.

Deploy Authelia into the sso namespace:

export KUBECONFIG=~/.kube/config-railiance01
cd sso-mfa/k8s/authelia
bash create-secrets.sh
kubectl apply -f configmap.yaml
kubectl apply -f deployment.yaml
kubectl apply -f ingress.yaml

Verify: bash sso-mfa/k8s/verify-t05.sh (covers LLDAP + Authelia together)

T07 — Deploy KeyCape

id: NK-WP-0003-T07
status: done
priority: high
state_hub_task_id: "496a97c9-3e2a-486e-ba62-18449868c6cf"
note: Done 2026-03-25 on RAILIANCE01. KeyCape pod Running, TLS cert issued (kc.coulomb.social),
      OIDC discovery endpoint live at https://kc.coulomb.social/.well-known/openid-configuration.
      PI admin token refreshed via create-pi-token.sh (old token was from CoulombCore).
      keycape-pi-token K8s Secret created in sso namespace.

Deploy KeyCape into the sso namespace:

export KUBECONFIG=~/.kube/config-railiance01
cd sso-mfa/k8s/keycape
bash create-secrets.sh       # includes privacyIDEA trigger-admin token
bash create-pi-token.sh      # registers KeyCape as a privacyIDEA application
kubectl apply -f deployment.yaml
kubectl apply -f ingress.yaml
kubectl apply -f middleware.yaml

Verify: OIDC discovery endpoint reachable at https://id.coulomb.social/.well-known/openid-configuration

T08 — End-to-end authentication test

id: NK-WP-0003-T08
status: done
priority: high
state_hub_task_id: "0fba3392-c916-43fd-a2c1-24ce39481043"
note: Completed 2026-03-25. All 3 test packages pass (migration, negative, profile).
      Go 1.22.10 found at ~/go/bin/go. DNS resolves to 92.205.62.239 (all 4 subdomains).
      Tests run with: cd src && ~/go/bin/go test ./tests/... -v
      Results: ok keycape/tests/migration, ok keycape/tests/negative, ok keycape/tests/profile
      Note: tests use httptest.Server + mocks — no live cluster connection required.
      Test user provisioned: testuser / test.user@coulomb.social
        TOTP serial TOTP00007147, seed KVQLHEJCTKCI3K7G2UIF54QUE5BNLBAQ
        Validated: auth PASS via privacyIDEA /validate/check.
      pi-admin TOTP deferred to T09 hardening.

Prove the full auth flow works:

  1. OIDC discovery resolves at id.coulomb.social
  2. Authelia password auth succeeds for a test user
  3. privacyIDEA TOTP challenge issued and accepted
  4. KeyCape issues a valid access token
  5. Token introspection returns expected claims (sub, groups, email)

Use the KeyCape acceptance test suite:

cd "$(git rev-parse --show-toplevel)/../key-cape"
go test ./tests/... -run TestProfileBaseline -v

T08a — Create Cloudflare DNS A records

id: NK-WP-0003-T08a
status: done
priority: high
state_hub_task_id: "c614f839-61c4-41f6-bfeb-b3f9525a7625"
note: Done — all 5 A records (kc, auth, pink, pink-account, lldap) resolve to 92.205.62.239
      via @8.8.8.8. Confirmed 2026-03-25.

Create 5 A records in Cloudflare DNS, proxy disabled (DNS-only / orange cloud OFF), all pointing to 92.205.62.239 (RAILIANCE01 — where k3s/Traefik runs):

Subdomain Type Value
kc.coulomb.social A 92.205.62.239
auth.coulomb.social A 92.205.62.239
pink.coulomb.social A 92.205.62.239
pink-account.coulomb.social A 92.205.62.239
lldap.coulomb.social A 92.205.62.239

HTTP-01 ACME challenges require direct origin reachability — Cloudflare proxy blocks this. Once DNS propagates, cert-manager's pending challenges will auto-resolve and TLS certs will be issued for all ingresses.

Verify: dig +short kc.coulomb.social @8.8.8.892.205.62.239

T08b — Install Go on RAILIANCE01

id: NK-WP-0003-T08b
status: done
priority: high
state_hub_task_id: "fdfe595a-f5a8-466a-82e9-7cc2ad8e5c3e"
note: Go 1.22.10 already installed at ~/go/bin/go (workstation). Tests ran from workstation.
      Also: Go v1.25.6 present on RAILIANCE01 via k3s.

Go is already installed on RAILIANCE01 via k3s (v1.25.6). No action needed.

Verify: ssh tegwick@92.205.62.239 "go version"

T09 — Backup, DR, and monitoring

id: NK-WP-0003-T09
status: done
priority: medium
state_hub_task_id: "a82751d8-4de8-4668-8568-8dc140a6322b"
note: Done 2026-03-25. Backup CronJobs applied and verified (verify-t08.sh PASS=15 FAIL=0).
      Break-glass account created (LLDAP, net-kingdom-admins).
      SQLite restore drill passed for LLDAP (2 users, all tables).
      Bugs fixed: break-glass.sh/verify-t08.sh ((PASS++)) set-e trap, authelia-backup
      redesigned to avoid scale-down (concurrent local-path PVC mount works on single-node k3s),
      privacyidea-backup supplementalGroups fix, allow-backup-to-kube-api NetworkPolicy added.
      DEFERRED: CNPG PostgreSQL backup (needs MinIO/S3 — uncomment cluster.yaml backup block).
      DEFERRED: Prometheus scraping (needs kube-prometheus-stack deployment).
      Remaining manual action: store break-glass password in KeePassXC, verify offsite bundle.

Operational hardening:

  1. Deploy backup CronJob for CloudNativePG → MinIO/S3
    kubectl apply -f sso-mfa/k8s/backup/
    
  2. Execute DB restore drill (mandatory before production traffic): restore privacyidea_db from a backup into a test namespace, verify privacyIDEA starts cleanly with the restored data
  3. Deploy break-glass admin access (disabled by default):
    bash sso-mfa/k8s/lldap/break-glass.sh setup
    
  4. Verify Prometheus scraping for privacyIDEA and Authelia metrics
  5. Confirm NetworkPolicies block all unexpected egress

Verify: bash sso-mfa/k8s/verify-t08.sh (if exists) or manual checklist from NK-WP-0001 T08 scope.

Done criteria

  • Credentials: bootstrap_complete: true in creds-state.yaml (NK-WP-0005)
  • verify-t08.sh: PASS=15, FAIL=0 (WARNs are manual offsite confirmation only)
  • KeyCape acceptance test suite passes
  • DB restore drill completed (LLDAP SQLite — 2 users, all tables verified)
  • Emergency bundle delivered and stored in personal password manager (confirm manually)
  • Ops bundle stored offsite (confirm manually)
  • privacyIDEA enckey backed up on PVC (/etc/privacyidea/backups/enckey.backup.*)
  • Monitoring active (Prometheus scraping — deferred, needs kube-prometheus-stack)