From 3466c431ddb8e9144f6ed53ec7981b320b65b07d Mon Sep 17 00:00:00 2001
From: tegwick <bernd.worsch@gmail.com>
Date: Wed, 3 Jun 2026 16:31:08 +0200
Subject: [PATCH] feat(NET-WP-0018-T03): initial retrospective + gap matrix
 (substantial draft)

- Created docs/security-bootstrap-retrospective.md:
  - Exec summary (wins: S6, console/0019 automation, evidence discipline, T02 arch doc; gaps: UE adapters, consolidated guide, tests, proper audit)
  - 9 bumps with diagnosis/now-automated?/remaining (realm, OIDC callbacks, LLDAP claims, OpenBao mapping, tokens, operator-state, secret taint pre-0019, audit correlation, etc.)
  - Full gap matrix table (areas, current status incl. 0019/T02, remaining, priorities)
  - Recommendations (T05 consolidate guide, T07/T08 use 0019+T02 as fixtures, T09 classify UE risk, continue pragmatic)
  - Refs to T02 doc, 0017/0019 evidence, console, assessment gap 7, pragmatic records
- Updated workplan T03 with progress note (still in_progress for expansion)
- Pragmatic tracking: progress events (with task), file notes, commit
- Builds directly on T02 + prior 0017/0019 + Coordination pragmatic guidance

Feeds T05/T06/T08/T09. Next tasks can reference this + T02.
---
 docs/security-bootstrap-retrospective.md      | 138 ++++++++++++++++++
 ...tstrap-automation-and-rebuild-readiness.md |   6 +-
 2 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 docs/security-bootstrap-retrospective.md

diff --git a/docs/security-bootstrap-retrospective.md b/docs/security-bootstrap-retrospective.md
new file mode 100644
index 0000000..6165181
--- /dev/null
+++ b/docs/security-bootstrap-retrospective.md
@@ -0,0 +1,138 @@
+# Security Bootstrap Retrospective and Automation Gap Matrix
+
+**Status:** initial draft (NET-WP-0018-T03)
+**Date:** 2026-06-03
+**Scope:** Captures the first successful NetKingdom bootstrap (through 0015-0017 "Reopen under custody" + 0019 T06-adjacent polish) and automation gaps. Input from T02 runtime architecture doc, prior workplan notes/evidence (0017/0019), console metadata/.local flags, operator experience, assessment, and pragmatic records (progress events, evidence.json, git, console validators). 
+
+This feeds T05 (smooth guide with evidence per step), T06/T08 (control surface + validations), T09 (rebuild risk/rehearsal), and future optimization.
+
+Not a full historical dump; focused on bumps, diagnosis, current automation status, and remaining fragile/manual assumptions.
+
+## Executive Summary
+
+The first bootstrap proved the stack (LLDAP + Authelia + privacyIDEA + KeyCape + OpenBao + local bootstrap console + State Hub) can deliver a working IAM/security bootstrap path with platform-root custody, MFA, OIDC admin bridging to OpenBao, and non-root user lifecycle.
+
+It also proved the system is easy to derail interactively: realm drift, OIDC callback bridging, LLDAP lookup assumptions, OpenBao claim/policy shape, token expiry/revocation, operator-state persistence (.local metadata), secret taint hygiene, and audit correlation.
+
+**Post-0017/0019 state (S6 Reopen under custody):**
+- Many gates now durable/non-secret in .local/security-bootstrap.json + console (platform_reopened, cleanup_complete, oidc_login_verified, audit_core_bootstrap_risk_accepted with owner/review 2026-07-02, etc.).
+- Control surface (console + make + web-ui) + evidence templates/validators provide repeatable safe checks and runbooks.
+- 0019 added repeatable non-root dry-run orchestration ( /tmp hygiene + k8s fallback, create --test, verifs, lock/offboard, evidence with 12+ exact bools, validate) + claims helper + cleanup + runbook exposure. Makes T06 gate automatable and less manual/taint-prone.
+- Pragmatic audit (local-identity TSV + OpenBao PVC+mock + State Hub/console evidence) in use; production Audit Core deferred (risk accepted).
+- T02 arch doc now exists as specific-as-deployed baseline (including pragmatic audit paths and UE gaps).
+
+**Key wins (now automated or evidenced):**
+- S6 reopen + custody approval + MFA self-enroll + KeyCape OIDC admin path to OpenBao.
+- Evidence discipline for cleanup (T03/T04 0017), lifecycle flow (T05), onboarding dry-run (T06 0017/0019) — all validate exact bools + no secrets.
+- Dry-run user lifecycle repeatable/safe (0019 orchestrator + console/make).
+- Non-secret progress in console metadata + State Hub /progress/ (used for tracking 0018 impl itself).
+
+**Remaining gaps / fragile assumptions (see matrix):**
+- Consolidated smooth bootstrap guide (T05; piecemeal docs + console guide exist but not one operator sequence with per-step evidence).
+- Full automated tests/validations for all UI sections/runbooks (T07; 0019 pieces are ready for coverage; more gates needed in T08).
+- UE integration (adapters, claims_enrichment routing, membership sync, audit correlation, bootstrap-to-governed transition) — biggest per assessment; current paths direct LLDAP/Keycloak (valid for bootstrap per contract but drift risk).
+- Production Audit Core + full correlation (deferred; pragmatic sufficient for now).
+- Scratch rebuild risk rehearsal plan (T09; needs T03/T05/T07/T08 complete).
+- Some interactive repairs still required for drift cases (realm, callbacks, claims); console exposes repair but not fully preventive.
+
+A scratch rebuild should only happen after the guide + validations + T09 say which human interactions are unavoidable.
+
+## Bumps Encountered, Diagnosis, and Current Status
+
+(Compiled from 0017/0019 notes, T02 doc, console, metadata, assessment, operator history. Each: bump, root cause/diagnosis, was it automated now?, remaining manual/fragile?)
+
+1. **Realm drift / privacyIDEA repair (early bootstrap):**
+   - Bump: Realm/resolver not correctly set for LLDAP users; self-enroll or admin MFA broken.
+   - Diagnosis: Manual inspection of privacyIDEA admin + LLDAP queries; repair-realm-live.sh or similar.
+   - Now automated?: Console has "privacyIDEA realm repair" runbook (template + attended steps); validate-t02 etc. check related. repair script exists.
+   - Remaining: Still attended (not fully declarative/CI); needs live cluster + operator. Gap in T05 guide + T08 validator for "realm healthy for bootstrap users".
+
+2. **OIDC callback bridging / KeyCape client registration:**
+   - Bump: Redirects (localhost:8250 etc.) or client config not matching; login fails for console/OpenBao.
+   - Diagnosis: KeyCape client definition in code (create-secrets.sh); apply + restart; verify via console OIDC login.
+   - Now automated?: KeyCape OpenBao client definition (non-secret in source) + "KeyCape OpenBao client deployed" gate; console preflight + status checks.
+   - Remaining: Manual apply/restart cycle for drift; documented in T02 but no preventive validator yet (T08 target).
+
+3. **LLDAP lookup assumptions / group membership for claims:**
+   - Bump: Groups (net-kingdom-admins/users) not reflected in OIDC claims or policy; platform-admin not granted or over-granted.
+   - Diagnosis: Direct LLDAP queries (GraphQL or scripts); inventory scripts; console claims verification helper (0019).
+   - Now automated?: 0019 dry-run-nonroot-user.sh + create-user.sh --test + verify-openbao-client + onboarding-dry-run-claims (infers from groups + T01 role; warns on root/admins); evidence "lldap_identity_verified", "keycape_oidc_claims_verified"; console status.
+   - Remaining: Direct LLDAP in paths (not yet via UE claims_enrichment adapter per assessment gap 6); for non-dry-run, still some manual verification. UE adapters missing.
+
+4. **OpenBao claim mapping / policy shape / OIDC admin binding:**
+   - Bump: Claims from KeyCape not mapping to expected platform-admin policy; root token or unseal issues post-init.
+   - Diagnosis: OpenBao status + token create with policy; manual claim inspection; attended init ceremony.
+   - Now automated?: OpenBao OIDC auth configured gate + "OIDC admin login verified"; platform-admin via KeyCape/MFA proven; root token disposition (revoked); console openbao-preflight + status.
+   - Remaining: Init/unseal still highly attended (human custody); claim mapping config in KeyCape source (declarative but apply manual). T02 documents current.
+
+5. **Token expiry / revocation / short-lived handling:**
+   - Bump: Tokens (OpenBao helper, sessions) expired or leaked; revocation needed without exposing values.
+   - Diagnosis: Token lookup/revoke commands (accessor or self); manual in console or kubectl exec.
+   - Now automated?: Runbook "OpenBao token revocation" (template + interactive but no plaintext on CLI); console helpers for revoke.
+   - Remaining: Interactive for some cases; no fully non-interactive revocation in dry-run paths yet. Gap for T08.
+
+6. **Operator-state persistence / .local metadata drift:**
+   - Bump: Flags (e.g. oidc verified, cleanup) out of sync with reality; stage stuck; manual edits risky.
+   - Diagnosis: cat .local/security-bootstrap.json; console status/approve/validate flows.
+   - Now automated?: Console metadata-template + approve-custody-mode + save_progress_metadata; validate-* targets; .local updated only via console (non-secret); S6 "platform_reopened" + "cleanup_complete" set.
+   - Remaining: Still file-based (not in cluster secret or State Hub durable for multi-op); risk of manual tamper. T08 should compute more from validators.
+
+7. **Secret taint / hygiene during user lifecycle (pre-0019):**
+   - Bump: Temporary secrets written to sso-mfa/bootstrap/secrets/ for dry-run/test users; not cleaned; plaintext exposure risk.
+   - Diagnosis: Manual steps in early T06; inventory + evidence checks.
+   - Now automated?: 0019 dry-run-nonroot-user.sh (/tmp WORKSPACE + trap EXIT rm; k8s fallback in create-user.sh never touches persistent bootstrap/secrets for --test); evidence "no_secret_material_recorded":true + validator; make security-bootstrap-onboarding-dry-run + cleanup targets; console subcmd; rm in guide updated conceptually.
+   - Remaining: Old manual path in lifecycle_guide T06 section still lists secret-mkdir (minor doc staleness; prefer orchestrator). Good model for other secret handling.
+
+8. **Audit / correlation gaps (ongoing):**
+   - Bump: Audit events not correlated across bootstrap (local-identity TSV, OpenBao PVC, State Hub progress, console evidence) vs. UE audit records or flex-auth decisions.
+   - Diagnosis: Separate systems; assessment gap 7; contract requires shared IDs (request/actor/decision/user_engine_audit/outbox).
+   - Now automated?: Pragmatic layer working (progress events with workstream/task/decision correlation used for 0018 impl tracking itself; console evidence; local audit.py); audit_core posture gate in console (risk accepted).
+   - Remaining: No production Audit Core sink (deferred per metadata 2026-07-02); no UE adapters for audit writer/outbox; bootstrap audit separate. T02/T03 document it; T09 risk item.
+
+9. **Other (realm repair, image delivery/KeyCape config, escrow, restore drill, etc.):**
+   - Many covered in T02 gates + 0017 evidence (restore drill passed, custody roster, etc.).
+   - 0019 added dry-run evidence for user lifecycle (effective preview before save, prevents platform root grant, actor_class checks, reversible lock/offboard).
+   - Realm repair, KeyCape delivery still have attended elements (runbooks exist).
+
+## Automation Gap Matrix
+
+| Area | Bump/Fragile Assumption | Current Automation (0017/0019/T02) | Remaining Manual/Fragile | Priority for T05/T08/T09 | Notes / Evidence |
+|------|-------------------------|------------------------------------|---------------------------|---------------------------|------------------|
+| State persistence | .local metadata drift; stage/flags out of sync | Console approve/validate/metadata flows; S6 flags (platform_reopened, cleanup_complete) set | File-based (tamper risk); not cluster-durable | High (T08 compute from validators) | .local/security-bootstrap.json; console save_progress |
+| privacyIDEA realm repair | Realm/resolver drift; MFA self-enroll broken | Runbook + repair script; some validate-t02 | Attended apply; no full declarative gate | Medium | Console "privacyIDEA realm repair" |
+| KeyCape image/config delivery + client | Client/redirect mismatch; OIDC login fails | Non-secret client def in source; "client deployed" gate; console verify | Manual apply/restart for drift | High (T08) | keycape/create-secrets.sh; T02 clients |
+| OIDC callbacks / bridging | Redirect or client config issues | Documented in T02; console OIDC paths | Manual verification | Medium | kc.coulomb.social + localhost:8250 |
+| OpenBao claim mapping / policy | Claims not granting expected policy | OIDC auth configured + admin login verified gates | Init/policy apply attended | Medium | T02 OpenBao OIDC section |
+| Token revocation / expiry | Leaked/expired tokens hard to revoke safely | Runbook + console revoke helpers (no plaintext CLI) | Some cases still interactive | Medium (T08) | T02 token flows |
+| Audit (pragmatic vs proper) | No correlation; separate systems | Pragmatic: local-identity/audit.py TSV, OpenBao PVC+mock, State Hub/progress/console evidence, audit_core gate (risk accepted) | Production tenant-aware sink + full UE/flex/platform correlation (gap 7) | High (T03/T09) | See T02 "Pragmatic Audit Paths", assessment, metadata audit_core_* |
+| Secret taint / hygiene (user lifecycle) | Plaintext in bootstrap/secrets for tests | 0019 orchestrator (/tmp + trap + k8s fallback); evidence "no_secret..."; validate + cleanup make/console | Old manual path lingers in guide docs | Low (mostly done) | dry-run-nonroot-user.sh; 0019 evidence 12 bools |
+| User lifecycle (onboard/lock/offboard) | Manual, no preview, no evidence, taint risk | 0019 dry-run + claims + cleanup + console + make + evidence (effective before save, actor checks, reversible) | Transition to UE-backed (adapters) | High (T05/T08 use as model) | T02 + 0019; prevents platform-root grant |
+| Restore drill / escrow | No proof of recovery before trust | restore drill passed gate + evidence; custody roster (2of3 planned) | Attended; low-friction upgrade path to escrow | Medium (T09) | 0017 T02 evidence; T02 custody |
+| UE integration (adapters, claims_enrichment, memberships, app onboarding, audit correlation) | Direct LLDAP in bootstrap paths; no adapters | Documented in T02 + assessment; 0019 dry-run proves IAM contract | Adapters missing (biggest gap); claims still direct; memberships not synced with owning semantics | High (T03/T09 classify; T07/T08 testbed) | assessment 7 gaps; T02 UE section; boundary contract |
+| Consolidated guide + per-step evidence | Operator must rediscover sequence | Piecemeal docs + console lifecycle-guide (T05/T06 flows + 0019 dry-run) + evidence templates | No single "smooth bootstrap guide" with evidence per step + wrong-order hard | High (T05 primary) | T02 feeds it; link validate-* |
+| Tests / validations for UI/runbooks | No coverage; sections can regress | Layered plan in T07; 0019 pieces (orchestrator, console cmds, claims, validators) ready | Most sections lack unit/fixture/live tests; live gated | High (T07) | Use T02 doc + 0019 artifacts as fixtures |
+| Rebuild risk / rehearsal | Unknown residual human interactions | T02 specific doc + 0019 dry-run model + S6 evidence | Full T03/T05/T07/T08 needed before T09 assessment | High (T09 at end) | Recommend isolated/namespace/scripted first (non-goal: destructive) |
+
+## Recommendations / Next Steps (from this retrospective)
+
+- **T05 priority:** Consolidate into one smooth guide (update operator-journey or new) with explicit evidence per step (link the validate-* and 0019 templates). Update console lifecycle_guide T06 section to prefer orchestrator.
+- **T07/T08:** Use 0019 dry-run + new T02 arch doc + evidence as concrete test cases/fixtures. Add validators for realm health, KeyCape client, audit_core posture (already partial), token revocation success, etc. Static tests for runbook presence.
+- **T03 complete:** Expand this doc with any new bumps from T05-T08 work. Output the matrix as table in final.
+- **T09:** After above, classify risks (esp. UE integration as high, with mitigation via adapters + updated dry-run). Recommend rehearsal strategy (scripted dry + namespace first).
+- **Cross:** Feed pragmatic records (this retrospective process itself used progress events + file notes + T02 doc) back into T03. Document current audit in T02 (done).
+- **UE:** Per assessment, do not block 0018 on adapters (NK orchestration role), but use T07/T08 + 0019 tooling to prepare integration tests. Create follow-up for UE-side adapter stub if needed.
+- Continue pragmatic tracking for remaining T0x (progress + workplan notes).
+
+## References / Inputs
+
+- docs/NetkingdomRuntimeArchitecture.md (T02)
+- NET-WP-0017 + 0019 workplans + evidence.json examples
+- .local/security-bootstrap.json + console status (S6 + available actions)
+- docs/platform-identity-security-architecture.md, responsibility-map.md, security-bootstrap-*.md (operator-journey, openbao-ceremony-ux, user-lifecycle, handover-cleanup, etc.), user-engine-netkingdom-integration-assessment.md, SCOPE.md, platform-root-custody.md
+- tools/security-bootstrap-console/security_bootstrap_console.py + Makefile
+- sso-mfa/k8s/lldap/dry-run-nonroot-user.sh + related
+- State Hub /progress/ events for 0018 (pragmatic record of impl)
+- canon/standards/* (iam-profile, user-engine-boundary-contract)
+- Assessment gap 7 + contract audit correlation bundle
+
+Update this doc as T03-T09 proceed. It is the "what went wrong / now fixed / still fragile" companion to the runtime architecture doc.
\ No newline at end of file
diff --git a/workplans/NET-WP-0018-bootstrap-automation-and-rebuild-readiness.md b/workplans/NET-WP-0018-bootstrap-automation-and-rebuild-readiness.md
index 2e816e7..3a3a866 100644
--- a/workplans/NET-WP-0018-bootstrap-automation-and-rebuild-readiness.md
+++ b/workplans/NET-WP-0018-bootstrap-automation-and-rebuild-readiness.md
@@ -148,7 +148,7 @@ canon/standards/user-engine-boundary-contract_v0.1.md and the assessment.
 
 ```task
 id: NET-WP-0018-T03
-status: todo
+status: in_progress
 priority: high
 state_hub_task_id: "1a3c4261-4133-4021-bd53-ea3dc77021a0"
 ```
@@ -174,6 +174,10 @@ repeatable automation. Gaps remaining: UE adapter integration (see assessment).
 The first bootstrap's interactive repairs (realm drift, callbacks, claim shape,
 token expiry, operator-state) are now partially automated via console/evidence.
 
+**2026-06-03:** Started T03 (after T02 arch doc complete). Using pragmatic (progress + file notes). Compiling bumps from 0015-0017/0019 history + T02 doc + console/metadata/evidence examples. Will produce docs/security-bootstrap-retrospective.md + gap matrix (state persistence, privacyIDEA repair, KeyCape delivery, OIDC callbacks, OpenBao claims, token revocation, **audit**, escrow, rebuild verification + new: 0019 dry-run hygiene/automation, console evidence, UE gaps). What is now automated vs. remaining manual/fragile.
+
+**2026-06-03:** T03 initial substantial progress. Created docs/security-bootstrap-retrospective.md (exec summary, 9 detailed bumps with "now automated?" status, full gap matrix table covering audit + UE + 0019 items, recommendations for T05/T07/T08/T09, references to T02 doc + pragmatic records + evidence). Uses 0019 dry-run/evidence as model. Still in_progress (expand with any new from later T0x). 
+
 ### T04 - Review Repository Intent And Scope Boundaries
 
 ```task