Files
net-kingdom/workplans/NK-WP-0001-sso-mfa-platform.md
tegwick 534906d509 docs(workplan): update NK-WP-0001 with resolved decisions D1/D2/D3
- Add Decisions table summarising D1 (KeePassXC→Vault), D2 (Keycloak-internal
  hybrid + file-based bootstrap), D3 (plain Helm, AI-first philosophy)
- Split T01 into Phase 0a (pre-cluster KeePassXC) and Phase 0b (in-cluster
  Vault transition) per D1
- Update T05 to explicitly reference D3 (plain Helm first)
- Update T06 to state the D2 identity decision rather than re-opening it
- Update T07: remove "decide" language, implement decided approach, add
  D2 bootstrap user management scope note
- Update T08: add Vault unseal key backup to the backup list
- Replace Open Questions with remaining unresolved items (5 items)
- Add DECISIONS.md (decision log auto-generated by State Hub)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-01 22:51:11 +01:00

16 KiB

id, type, title, domain, status, owner, topic_slug, state_hub_workstream_id, created, updated
id type title domain status owner topic_slug state_hub_workstream_id created updated
NK-WP-0001 workplan SSO & MFA Platform — Keycloak + privacyIDEA on Kubernetes netkingdom active worsch netkingdom 39263c4b-ef70-4053-b782-350834b7e1be 2026-02-28 2026-03-01

SSO & MFA Platform — Keycloak + privacyIDEA on Kubernetes

Summary

Deploy a hardened SSO and MFA platform on Kubernetes: Keycloak as the OIDC/SAML identity provider, privacyIDEA as the MFA/token engine, integrated via the privacyIDEA Keycloak Provider. This is the foundational security layer for the net-kingdom DevSecOps platform.

Context

Synthesised from two AI protoplans (wiki/WorkplanOneChatgpt.md and wiki/WorkplanOneGrok.md). Both sources converge on the same architecture; this plan picks the most concrete and production-aligned choices from each:

  • Single-credential bootstrap (Grok) — one master secret unlocks the vault; all other credentials are vault-managed and never typed manually.
  • Phase structure (ChatGPT) — eight sequential phases reducing blast radius at each step.
  • Tooling choices (both) — Keycloak Operator or codecentric Helm, gpappsoft privacyIDEA Helm, CloudNativePG for PostgreSQL, cert-manager for TLS, Traefik as ingress (K3s native, aligned with Railiance).
  • Custom Keycloak image (both) — JAR baked into image via kc.sh build rather than kubectl cp; clean GitOps pattern.

Decisions

All three pending decisions from the first session have been resolved (2026-03-01, decided by Tegwick). Full rationale in DECISIONS.md.

ID Decision Outcome
D1 Vault backend KeePassXC pre-cluster → HashiCorp Vault in-cluster. Bootstrap on KeePassXC before a cluster is available; transition to Vault once K3s is operational.
D2 Identity source of truth Hybrid: Keycloak-internal + LDAP/Entra federation for enterprise tier. Plus a file-based bootstrap user store for pre-Keycloak dev/test/sandbox systems.
D3 GitOps tooling Plain Helm to start, upgrade to Flux when warranted. Development philosophy: AI-first (TDD, API-first/headless, MCP layer, CLI tooling; UI is low-priority and lives in separate repos).

Architecture

                  Internet
                     │ TLS (cert-manager / Let's Encrypt)
              ┌──────┴──────┐
              │   Traefik   │  (K3s native ingress)
              └──┬───────┬──┘
                 │       │
         keycloak.…  pi.…   pi-account.…
                 │       │         │
          ┌──────┘  ┌────┘         │
          ▼         ▼              │
      [Keycloak]  [privacyIDEA]◄──┘  (self-service portal)
          │         │
          └────┬────┘
               ▼
          [PostgreSQL]  (CloudNativePG, namespace: databases)
               │
          [HashiCorp Vault]  ← single credential unlocks (in-cluster)
          [KeePassXC]        ← pre-cluster bootstrap / dev/test/sandbox

Namespaces: sso (Keycloak), mfa (privacyIDEA), databases

Integration: Keycloak runs the browser login flow; privacyIDEA provides MFA via the privacyIDEA Keycloak Provider JAR (baked into custom image).

Dependencies

  • Depends on: railiance/three-phoenix-ha-cluster — full production deployment targets the ThreePhoenix K3s HA cluster. Development/staging can proceed on a single-node k3s instance.
  • Depends on: railiance/phase-0-operational-baseline — cert-manager, TLS, backup strategy must be operational before going live.

Tasks

T01 — Phase 0: Vault & secret bootstrap (single-credential principle)

id: NK-WP-0001-T01
state_hub_task_id: 7992528c-d533-44e5-bcce-f92aaa2b75b2
status: todo
priority: critical

Decision D1 applies: Two-phase vault strategy.

Phase 0a — Pre-cluster KeePassXC bootstrap (do this first, before K8s):

Create a KeePassXC .kdbx database as the initial secret store. Keep the KeePassXC master password in a personal password manager. Generate and store all bootstrap secrets inside KeePassXC:

  • privacyIDEA: SECRET_KEY (64+ chars), PI_PEPPER (32+ chars), PI_ENCFILE content (pi-manage create_enckey).
  • PostgreSQL: root + keycloak + privacyidea user passwords.
  • Keycloak: admin bootstrap secret + DB password.
  • TLS: ACME account key (if not delegated fully to cert-manager).
  • Break-glass: admin credentials + offline recovery OTP seed.

Export an age-encrypted ops bundle (encrypted tar of all secret YAML manifests). Store offsite.

Phase 0b — HashiCorp Vault in-cluster (after T02, once K3s is running):

Deploy HashiCorp Vault in the cluster (Helm chart). Migrate secrets from KeePassXC into Vault. Enable K8s encryption-at-rest. Choose and implement secret injection strategy: External Secrets Operator + Vault backend, or Vault Agent Injector (ESO preferred for GitOps alignment). KeePassXC remains the source of truth for dev/test/sandbox systems that do not connect to the cluster Vault.

Done when: KeePassXC created and all secrets generated (0a). Vault deployed in-cluster, secrets migrated, injection strategy operational (0b). Encrypted ops bundle exported and stored offsite.


T02 — Phase 1: K8s foundations (namespaces, NetworkPolicies, cert-manager)

id: NK-WP-0001-T02
state_hub_task_id: 721ca6b2-0cf4-4008-a966-87b1563550fa
status: todo
priority: high

Create namespaces: sso, mfa, databases. Verify cert-manager is installed and functional on the K3s cluster (Traefik ingress). Define and apply NetworkPolicies to prevent lateral movement:

  • Only ingress controller reaches Keycloak/privacyIDEA service ports.
  • Only Keycloak pods call the privacyIDEA API.
  • Only app pods/ingress reach Keycloak.
  • DB pods reachable only from sso and mfa namespaces.

Verify StorageClass for PVCs.

Done when: namespaces exist, NetworkPolicies applied and tested (verify denied paths), cert-manager issues a test certificate.


T03 — Phase 2: PostgreSQL deployment (Keycloak + privacyIDEA DBs)

id: NK-WP-0001-T03
state_hub_task_id: 7fa60004-deb2-4db5-a470-f95dda07f6ab
status: todo
priority: high

Deploy PostgreSQL via CloudNativePG operator (preferred: aligns with ThreePhoenix HA posture) or Bitnami Helm chart as fallback. Create:

  • Database keycloak_db, user keycloak
  • Database privacyidea_db, user privacyidea

Store DB credentials as K8s Secrets injected from Vault (T01 Phase 0b must be complete, or use placeholder K8s Secrets until Vault is live). Configure automated DB backups to object storage (S3 or MinIO). Run a restore drill before proceeding — a failed restore later is a critical blocker.

Done when: both DBs live, credentials in K8s Secrets, backup running, restore drill passed.


T04 — Phase 3: Deploy privacyIDEA (MFA core)

id: NK-WP-0001-T04
state_hub_task_id: 6ad1296a-a488-4031-b665-f77030e971ed
status: todo
priority: high

Deploy privacyIDEA via gpappsoft/privacyidea Helm chart (Artifact Hub) or custom manifests (Deployment + Service + Ingress + PVC + Secrets). Key Helm values:

database:
  password: <from-vault>
privacyidea:
  config:
    SECRET_KEY: <from-vault>
    PI_PEPPER: <from-vault>
  encfile:
    enabled: true
    existingSecret: privacyidea-secrets
    key: PI_ENCFILE
  ingress:
    enabled: true
    hostname: pi.yourdomain.com
    tls: true

Create K8s Secrets: privacyidea-config, privacyidea-enckey, privacyidea-auditkeys. Configure Ingress + TLS. Add rate-limiting and WAF rules at Traefik level.

Bootstrap (single-credential moment):

  1. kubectl exec into pod, run pi-manage admin add pi-admin — password comes from vault (only time a password is typed).
  2. Immediately enroll MFA for pi-admin (TOTP or hardware token).
  3. Create trigger-admin with triggerchallenge right only.
  4. Apply policies: WebUI restricted to VPN/office IPs; MFA required for all admin actions.

Done when: privacyIDEA reachable at pi.yourdomain.com with valid TLS, pi-admin enrolled with MFA, trigger-admin created, rate-limiting active.


T05 — Phase 4: Deploy Keycloak (SSO core)

id: NK-WP-0001-T05
state_hub_task_id: b9f73aa6-9035-4643-9905-64e73a29b298
status: todo
priority: high

Build a custom Keycloak image that includes the privacyIDEA Provider JAR:

FROM quay.io/keycloak/keycloak:<version>
COPY PrivacyIDEA-Provider.jar /opt/keycloak/providers/
RUN /opt/keycloak/bin/kc.sh build

Deploy via plain Helm chart (official Keycloak Operator CRD-based or codecentric KeycloakX Helm chart; decision D3: plain Helm first, Flux later). Configure:

  • DB: keycloak_db (credentials from Vault / K8s Secret)
  • Ingress + TLS: keycloak.yourdomain.com (Traefik + cert-manager)
  • Hostname strictness + proxy mode (Traefik forward headers)
  • Metrics/logging (Prometheus annotations)
  • Admin bootstrap secret from vault
  • Realm import strategy: GitOps-friendly (realm JSON in git or CR)

Done when: Keycloak reachable with valid TLS, admin console accessible, custom image with privacyIDEA JAR deployed and verified.


T06 — Phase 5: Realm config & MFA authentication flow

id: NK-WP-0001-T06
state_hub_task_id: 3b6379a4-a27b-4d25-82be-bc600879f036
status: todo
priority: medium

In Keycloak:

  1. Create/configure realm. Decision D2 applies: identity source of truth is Keycloak-internal users. LDAP/AD and Entra federation is deferred to the enterprise tier (not in scope for this workplan phase).
  2. Create Authentication Flow "privacyIDEA Browser":
    • Add privacyIDEA execution step (REQUIRED)
    • Config: privacyIDEA URL = https://pi.yourdomain.com, service account = trigger-admin (secret from K8s Secret)
    • Optional: bypass group (break-glass) with strict restrictions + alerts
  3. Set this flow as the default browser flow.
  4. Require MFA step-up for admin console and sensitive OIDC clients.

Test:

  • Normal user: password → MFA OTP → session established
  • Admin console: MFA required
  • Failure modes: wrong OTP, token missing, privacyIDEA unreachable
  • Break-glass: bypass works, alert fires

Done when: end-to-end auth works for normal and admin paths, all failure modes handled gracefully.


T07 — Phase 6: User management, policies & self-service portal

id: NK-WP-0001-T07
state_hub_task_id: c7cf902a-b480-4545-a536-293070945206
status: todo
priority: medium

Decision D2 applies: identity source of truth is Keycloak-internal with the privacyIDEA Keycloak resolver. Implement (not decide):

  • Configure privacyIDEA 3.12+ Keycloak user resolver to align Keycloak users with privacyIDEA token ownership.
  • LDAP/Entra federation: explicitly out of scope for this phase; tracked as an enterprise-tier extension point.

Define policies in privacyIDEA:

  • Allowed token types: TOTP, hardware (YubiKey), passkey
  • Enrollment rules (who can self-enroll, which token types)
  • Admin rights separation: super-admin vs. helpdesk-admin

Enable self-service portal at pi-account.yourdomain.com for user token enrollment/replacement.

Configure auditing and log shipping: privacyIDEA audit logs + Keycloak events → centralized logging (ELK/Loki or equivalent). Token lifecycle policies: enrollment, revocation, re-enrollment on device loss.

Bootstrap user management (D2 extension — scope TBD): D2 also specifies a file-based lightweight user store for pre-Keycloak systems (dev/test/sandbox that do not connect to the cluster). Users stored as files in a secure subdirectory of the Linux home directory; auto-generates two test users with N / +testN username and email suffixes. Test users must not spill over into other systems; a mapping mechanism from sandbox identities to production should be provided. This scope is not yet captured in a task — see Open Questions.

Done when: policies documented and applied, self-service portal live, audit logs flowing, Keycloak resolver configured.


T08 — Phase 7: Backups, DR, break-glass & monitoring

id: NK-WP-0001-T08
state_hub_task_id: 9cbd1d89-b5bf-491e-9d16-b1c7d57076fb
status: todo
priority: medium

Backups:

  • DB backups: Keycloak + privacyIDEA (Velero or CloudNativePG scheduled backup to S3/MinIO). Test restore.
  • privacyIDEA encryption/audit key Secrets: encrypted export, versioned.
  • Keycloak realm exports: stored as JSON in git (GitOps-friendly).
  • Vault unseal keys and root token: offline copy in KeePassXC.

Disaster recovery drill (mandatory before production):

  1. Restore DB + keys into a fresh namespace.
  2. Verify token validation still works — this catches key/secret mistakes.

Break-glass procedure:

  • Disabled-by-default Keycloak admin path or group exemption.
  • Break-glass credentials stored offline + vault. Alert (PagerDuty/webhook) on every use.

Monitoring:

  • Prometheus scraping Keycloak + privacyIDEA metrics.
  • Grafana dashboards: auth success/failure rates, MFA challenge latency, token count by type.
  • Alert: privacyIDEA unreachable (blocks all logins).

Final validation:

  • All external traffic: Ingress + HSTS + strict TLS.
  • NetworkPolicies verified (no unintended open paths).
  • End-to-end: app → Keycloak → privacyIDEA OTP → SSO session established.

Done when: DR drill passed, monitoring live, break-glass procedure documented and tested, HSTS and NetworkPolicies verified.


Deliverables Checklist

  • KeePassXC vault created; all secrets generated and encrypted ops bundle exported
  • HashiCorp Vault deployed in-cluster; secrets migrated from KeePassXC
  • Secret injection strategy chosen and operational (ESO + Vault or Vault Agent)
  • sso, mfa, databases namespaces + NetworkPolicies deployed
  • TLS everywhere via cert-manager (Traefik ingress)
  • PostgreSQL live; both DBs created; backup + restore tested
  • privacyIDEA running at pi.yourdomain.com; pi-admin MFA enrolled; trigger-admin created with least-privilege rights
  • Keycloak running from custom image including privacyIDEA Provider JAR
  • Keycloak "privacyIDEA Browser" flow enforced as default
  • Realm exported to git; admin secret from vault
  • Self-service portal live; token lifecycle policies defined
  • DR drill passed; monitoring live; break-glass documented and tested

Open Questions

The three original pending decisions (D1 vault backend, D2 identity source of truth, D3 GitOps tooling) have all been resolved. See DECISIONS.md.

Remaining open items:

  1. Secret injection strategy — D1 resolves the vault backend (Vault in-cluster) but the concrete injection mechanism is still open: External Secrets Operator vs Vault Agent Injector. Should be decided and closed in T01 Phase 0b.

  2. File-based bootstrap user management (D2 extension) — D2 specifies a lightweight file-based user store for pre-Keycloak environments. This is non-trivial scope (file format, test-user generation, isolation controls, production-mapping mechanism) and is not captured in any current task. Needs a decision: is this a task within this workplan, or a separate workplan/repo?

  3. AI-first / MCP layer (D3 extension) — D3 establishes an AI-first development philosophy (TDD, API-first/headless, MCP layer, CLI tooling). This workplan currently covers only infrastructure deployment. Should Keycloak/privacyIDEA operations (user management, policy CRUD, token lifecycle) be wrapped in an MCP server or CLI? If so, this needs a new task or workplan.

  4. LDAP/Entra federation — Explicitly deferred to the enterprise tier (D2). Track as an extension point when the time comes.

  5. Cluster target for dev/test — D1 implies KeePassXC-based systems run independently of the cluster. The plan assumes single-node k3s for dev and ThreePhoenix for production. The sequencing between T01 Phase 0a (pre-cluster) and Phase 0b (in-cluster) should be confirmed once the Railiance cluster timeline is clearer.