Files
net-kingdom/workplans/NET-WP-0018-bootstrap-automation-and-rebuild-readiness.md
tegwick 7da19ef767 feat(NET-WP-0018-T05): complete smooth bootstrap guide
- Created docs/smooth-bootstrap-guide.md as the single consolidated operator guide per T05 spec + T03 recs:
  - Full sequence (prereqs, creds/king, privacyIDEA, LLDAP/user + MFA, KeyCape, OpenBao, lifecycle via 0019, reopen, handoff)
  - Per-step evidence requirements + links to validate-*, 0019 dry-run, console subcmds/make
  - Blocked conditions, next safe action, effective preview, actor classes
  - References T02 runtime arch, T03 retrospective/matrix, console lifecycle-guide (incl. 0019), UX contracts, evidence templates
  - Pragmatic note + update pointers for console guide
- Updated workplan T05 to done + completion note
- Pragmatic: progress (task_id), file notes, this commit
- Brief/fix next (expect 4/9 done: T02-T03 + T05; T04 medium can follow or parallel)

T05 complete. T06 (align control surface) next logical (uses this guide + T02/T03).
2026-06-03 16:56:10 +02:00

24 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated depends_on state_hub_workstream_id
NET-WP-0018 workplan Bootstrap Automation And Rebuild Readiness netkingdom net-kingdom active codex netkingdom 2026-06-01 2026-06-03
NET-WP-0015
NET-WP-0017
800f9f16-bc44-4bbf-a771-58a630a3b698

NET-WP-0018 - Bootstrap Automation And Rebuild Readiness

Goal

Turn the first successful NetKingdom security bootstrap into a repeatable, well-bounded, highly automated setup path that can survive an infrastructure reset with minimal interactive diagnosis.

The first run proved that the stack can work: LLDAP, Authelia, privacyIDEA, KeyCape, OpenBao, the local bootstrap control surface, and State Hub now form a working identity and security bootstrap path. It also proved that the system is still too easy to derail: realm drift, callback bridging, LLDAP lookup assumptions, OpenBao claim shape, token expiry, and operator-state persistence all required interactive repair. This workplan converts those lessons into architecture documentation, bootstrap sequencing, validation coverage, UI automation, and a clear scratch-rebuild risk assessment.

Strategy

Proceed in layers:

  1. close or explicitly hand off the remaining NET-WP-0015 bootstrap gates;
  2. document the runtime architecture that now actually exists;
  3. write down the bootstrap retrospective and automation gaps;
  4. clarify repository boundaries so future fixes land in the right place;
  5. produce a sequence guide for a smooth rebuild;
  6. improve the control-surface UI so it follows that guide;
  7. add tests and validations for every guided bootstrap section; and
  8. assess the residual risk of rebuilding NetKingdom from scratch.

This is not a request to immediately destroy and rebuild the live stack. A scratch rebuild should come only after the guide, validations, and risk review say which interactions remain genuinely unavoidable.

Coordination Notes

  • Avoid duplicating NET-WP-0017: audit durability, escrow, user onboarding, and hardening remain there unless this workplan explicitly turns them into bootstrap-guide or validation work.
  • Keep the bootstrap UI a control surface, not a secret collector. It may run safe checks, generate commands, and store non-secret evidence, but it must not store passwords, OTP seeds, OpenBao tokens, unseal shares, or recovery codes.
  • Prefer validation helpers that are usable both by the UI and by CI or operator command lines.
  • Treat interactive prompts as an explicit design boundary: automate everything that can be automated safely, and document why each remaining human action is required.
  • Pragmatic auditing / tracking for implementing this workplan: use State Hub /progress/ (and /decisions/ for key choices e.g. during T02/T04), dated notes
    • task status in this file (source of truth per ADR-001), descriptive git commits, console evidence/validators + .local/security-bootstrap.json when exercising paths, /tmp evidence, and runbooks. These artifacts (plus bumps encountered while doing T02T08) directly feed T03 retrospective and gap matrix (which explicitly covers "audit" among other items). This enables post-impl review for optimization potential without requiring production Audit Core first. See audit_core_* fields in metadata (bootstrap risk accepted=true; production sink ready=false; temp exception with owner/review 2026-07-02 per .local and console gates). Proper cross-system audit correlation (UE + flex-auth + platform sinks per contract/assessment gap 7) remains a follow-up; document current pragmatic paths (local-identity/audit.py TSV, OpenBao PVC + mock, State Hub/console evidence, separate bootstrap audit) in T02 arch doc and T03 matrix. Do not block 0018 start on full Audit Core.

Related (post-0019 + assessment)

  • NET-WP-0019 (T06-adjacent user lifecycle dry-run polish; advanced control surface, evidence, claims for T06/T07/T08)
  • docs/user-engine-netkingdom-integration-assessment.md (detailed T04 boundary/intent/scope review for user-engine integration + 7 gaps; cross-referenced from SCOPE etc.)

Tasks

T01 - Close Or Hand Off NET-WP-0015 Remaining Gates

id: NET-WP-0018-T01
status: done
priority: high
state_hub_task_id: "7ff22629-838b-41df-9feb-bb36c5d57cc1"

Review NET-WP-0015 now that platform-root can obtain OpenBao platform-admin through KeyCape/MFA. Close any gates that are truly complete, and explicitly move unfinished production-readiness work to NET-WP-0017 or this workplan when it no longer belongs in the bootstrap ceremony plan.

Done when NET-WP-0015 is either finished and ready to archive, or its remaining tasks have precise owners, target workplans, and non-duplicative acceptance criteria.

2026-06-01: Completed. NET-WP-0015 was scope-closed as finished after the OpenBao admin bridge was proven through KeyCape/MFA. Its remaining production-readiness concerns were reconciled into NET-WP-0017: T02 owns audit, restore, emergency drill evidence, and escrow; T03/T04 own bootstrap path retirement and credential reset/rotation; T07 owns final archive review. NET-WP-0018 now continues with architecture documentation, retrospective, guide, UI automation, validations, and rebuild-risk assessment.

2026-06-03: 0019 polish (dry-run orchestrator, console subcommands/make targets/claims/validators/runbook) and the user-engine/net-kingdom assessment (see T04) are cross-cutting enablers. See per-task notes (T02T09) for specifics; 0019 advances T06/T07/T08 for lifecycle automation; assessment fulfills UE boundary review portion of T04. Related: NET-WP-0019, docs/user-engine-netkingdom-integration-assessment.md.

T02 - Document The Runtime Architecture

id: NET-WP-0018-T02
status: done
priority: high
state_hub_task_id: "121ee797-e3f5-4d3e-9baa-cfa8c92f8a66"

Create docs/NetkingdomRuntimeArchitecture.md documenting what now exists: identity stores, MFA realms, KeyCape OIDC flow, Authelia handoff, OpenBao OIDC admin path, bootstrap UI state, State Hub relation, live DNS/routes, trust boundaries, token flows, and operational assumptions.

The document should explain the working system as deployed, not an idealized future architecture. It should be specific enough to guide a scratch rebuild without requiring the operator to rediscover the same integration details.

2026-06-03 (post 0017/0019 + assessment): The runtime now includes the T06-adjacent dry-run tooling (orchestrator + console/make exposure + evidence discipline) as part of the control surface. Per the persisted assessment, the arch doc must capture: current direct LLDAP/KeyCape paths for bootstrap users (vs. future UE claims_enrichment adapter), membership facts in LLDAP groups vs. UE Membership (owning_system etc.), bootstrap local-identity vs. UE local mode, and the boundary contract as the governance layer. Include refs to canon/standards/user-engine-boundary-contract_v0.1.md and the assessment.

2026-06-03: Started T02. Using pragmatic tracking (this note + will POST /progress/ with task). Gathering deployed components from existing docs, code, and configs to produce specific-as-deployed doc (not idealized). Will cover all listed items + pragmatic audit paths, dry-run 0019 additions, UE integration points/gaps per assessment.

2026-06-03: T02 complete. Created docs/NetkingdomRuntimeArchitecture.md (comprehensive sections on planes model, identity stores/MFA/OIDC flows (lightweight key-cape: LLDAP at lldap.coulomb.social + Authelia + privacyIDEA + KeyCape issuer https://kc.coulomb.social with bootstrap clients), Authelia handoff, OpenBao OIDC admin + secrets/credential path (SOPS/age bootstrap -> runtime with K8s auth, ESO, leases), bootstrap console/UI state (S6 Reopen, full gates incl. audit_core_posture, 0019 dry-run orchestrator/console subcmds/make targets/evidence/validators/runbook entry), State Hub relation (progress/decisions for tracking), k8s/DNS/routes/ingress/trust boundaries (sso/openbao ns, recursive rule, concrete hosts), operational assumptions + rebuild notes. Explicitly includes current pragmatic audit paths (local-identity/audit.py TSV, OpenBao PVC+mock, State Hub/console evidence) and UE integration points + 7 gaps (from assessment + contract refs). Specific as-deployed for rebuild guidance. This doc now feeds T03 retrospective, T05 guide, T09 risk, and T02/T08 validation targets.

T03 - Produce A Bootstrap Retrospective And Automation Gap Matrix

id: NET-WP-0018-T03
status: done
priority: high
state_hub_task_id: "1a3c4261-4133-4021-bd53-ea3dc77021a0"

Assess how the first bootstrap went. Capture each bump encountered, the root cause, how it was diagnosed, whether it is now automated, and what remains as a manual step or fragile assumption.

Recommended output: docs/security-bootstrap-retrospective.md with a gap matrix covering state persistence, privacyIDEA realm repair, KeyCape image delivery, OIDC callbacks, OpenBao claim mapping, token revocation, audit, escrow, and rebuild verification.

2026-06-03 (post 0017 close + 0019 polish): Retrospective should now incorporate: successful S6 reopen + platform_reopened flag + cleanup_complete in .local/security-bootstrap.json; T06 dry-run evidence discipline (12+ bools incl. effective_access_before_save, no_secret_material_recorded, lldap_identity_verified, keycape_oidc_claims_verified, actor_class != king, !net-kingdom-admins for non-root); safer secret handling via /tmp WORKSPACE + trap + k8s fallback (never write sso-mfa/bootstrap/secrets for dry-runs); console as non-secret control surface with runbooks + templates + validators; 0019 make targets and orchestrator as repeatable automation. Gaps remaining: UE adapter integration (see assessment). The first bootstrap's interactive repairs (realm drift, callbacks, claim shape, token expiry, operator-state) are now partially automated via console/evidence.

2026-06-03: Started T03 (after T02 arch doc complete). Using pragmatic (progress + file notes). Compiling bumps from 0015-0017/0019 history + T02 doc + console/metadata/evidence examples. Will produce docs/security-bootstrap-retrospective.md + gap matrix (state persistence, privacyIDEA repair, KeyCape delivery, OIDC callbacks, OpenBao claims, token revocation, audit, escrow, rebuild verification + new: 0019 dry-run hygiene/automation, console evidence, UE gaps). What is now automated vs. remaining manual/fragile.

2026-06-03: T03 initial substantial progress. Created docs/security-bootstrap-retrospective.md (exec summary, 9 detailed bumps with "now automated?" status, full gap matrix table covering audit + UE + 0019 items, recommendations for T05/T07/T08/T09, references to T02 doc + pragmatic records + evidence). Uses 0019 dry-run/evidence as model. Still in_progress (expand with any new from later T0x).

2026-06-03: T03 complete. Finalized retrospective draft with comprehensive bumps analysis, gap matrix (explicitly including audit, UE integration, 0019 polish as enablers), and actionable recs. No further expansion needed at this stage (will reference in later tasks). Used pragmatic tracking throughout (progress events with task_id, workplan notes, git). The doc + T02 now provide strong foundation for T05 (guide), T07/T08 (tests/validations), T09 (risk). Marked done in file and will sync via fix.

T04 - Review Repository Intent And Scope Boundaries

id: NET-WP-0018-T04
status: todo
priority: medium
state_hub_task_id: "9c286579-b7bc-46ae-9789-801b2b27b26d"

Review INTENT.md, SCOPE.md, and equivalent boundary documents across the associated repositories involved in the bootstrap. At minimum consider net-kingdom, key-cape, railiance-platform, state-hub/custodian, and any repo that owns OpenBao deployment, image delivery, identity runtime, or bootstrap automation.

Update the boundary documents or create follow-up workplans where ownership is unclear. The result should answer: where should a bug fix live, where should a runbook live, where should validation live, and which repo owns live deployment state.

2026-06-03: The user-engine/net-kingdom integration assessment (persisted in docs/user-engine-netkingdom-integration-assessment.md, cross-referenced from SCOPE.md Getting Oriented, canon/standards/user-engine-boundary-contract_v0.1.md, docs/responsibility-map.md, user-engine-interface-guidance.md, and this/0019 workplans) provides a comprehensive review of intent, implemented scope (UE: headless domain models + in-mem MVP + ports/adapters for claims/audit/projections; NK: IAM orchestration + contracts + bootstrap), architectural fit (no intent conflicts; UE owns user-domain facts/projections, NK orchestrates boundaries per ADR-0007/0010/contract), and 7 specific gaps/risks (1. Missing Platform Integration Adapters -- biggest; 2. Bootstrap/Platform Users vs. Governed UE Lifecycle; 3. App Onboarding "Application" concept overload; 4. Membership/Group overlap; 5. Governance/Workplan/Brief split (UE brief stale); 6. Claims Enrichment Path drift (current direct LLDAP in NK/keycape paths); 7. Audit correlation). NK bootstrap (0015-0017/0019) is allowed for local/non-prod per contract. This largely fulfills the UE + boundary review portion of T04. Recommend follow-up reviews or work items for key-cape (OIDC client vs UE Application Binding), railiance-platform (deployment refs), and explicit transition rules for seeding externally_provisioned memberships from IAM groups. The assessment recommends using 0018's T07/T08 to drive integration tests/dry-runs once adapters exist.

T05 - Create The Smooth Bootstrap Guide

id: NET-WP-0018-T05
status: in_progress
priority: high
state_hub_task_id: "e7b45fc8-8ee7-4914-ac4b-d0c8a35fad13"

Create or update the NetKingdom bootstrap guide so an operator knows what to do, in what order, and what evidence proves each step is complete.

The guide should cover prerequisites, credential bundle creation, cluster foundation checks, privacyIDEA bootstrap, LLDAP/bootstrap user creation, KeyCape deployment and client registration, OpenBao init/unseal/configuration, OIDC admin binding, token cleanup, State Hub sync, and handoff to production readiness.

2026-06-03: Base material exists in piecemeal form: docs/security-bootstrap-.md (user-lifecycle.md, operator-journey.md, king-credential-kit.md, openbao-ceremony-ux.md, handover-cleanup.md, etc.), console lifecycle-guide (T05/T06 flows with previews), and security-bootstrap-user-lifecycle.md (UX contract for show-effective-before-save, actor classes, blocked conditions). The 0019 polish extended the console lifecycle-guide with T06 DRY-RUN EXECUTION section (though that section still lists pre-orchestrator manual secret steps; update to prefer make security-bootstrap-onboarding-dry-run + dry-run-nonroot-user.sh + k8s fallback). T05 should consolidate into one "NET-WP-0018 smooth bootstrap guide" (or update operator-journey) with explicit evidence per step (linking the validate- make targets and templates). 0019's dry-run + evidence is the model for user-lifecycle portion of the guide.

2026-06-03: Started T05 (after T03 complete). Per retrospective recs (T05 high priority now that T02 arch + T03 retrospective exist). Using pragmatic tracking. Will consolidate piecemeal materials (T02, T03 retrospective, console lifecycle-guide + 0019 extensions, security-bootstrap-operator-journey.md, user-lifecycle.md, other -ux.md, evidence templates/validators from console/0019) into a single operator guide with clear sequence, prerequisites, evidence per step (links to validate-, 0019 dry-run, etc.), and "next safe action" / blocked gates model from the UX contract. Update console guide section as needed. Produce docs/smooth-bootstrap-guide.md or update main journey doc.

2026-06-03: T05 complete. Created docs/smooth-bootstrap-guide.md (the consolidated NET-WP-0018 smooth bootstrap guide): covers full sequence from prereqs to reopen + user lifecycle (using 0019 polish), per-step evidence + validator/make links, blocked conditions, next safe action / blocked gates from UX contracts (operator-journey + user-lifecycle), references to T02 arch, T03 retrospective, console, 0019 artifacts. Also notes to update console lifecycle-guide for 0019 polish. Pragmatic tracking used (progress, file notes). This fulfills T05 + feeds T06 alignment.

T06 - Align The Control Surface With The Bootstrap Guide

id: NET-WP-0018-T06
status: todo
priority: high
state_hub_task_id: "9bba26b3-b1be-4e58-a18b-a0533683d63b"

Review the local security bootstrap UI against the guide. Improve the automation grade where safe: replace passive checkboxes with safe validators, convert fragile copy-paste sequences into scripts, persist non-secret progress durably, expose repair routines for known drift cases, and keep manual steps clear when human custody or secret handling is required.

Done when the UI guides the same sequence as the bootstrap guide and makes wrong-order execution visibly hard.

2026-06-03 (0019 polish delivered): Control surface now includes (in status, available actions, parser, dispatch, runbook_payloads, web-ui capable):

  • onboarding-dry-run-template / validate-onboarding-dry-run
  • onboarding-dry-run (delegates to sso-mfa/k8s/lldap/dry-run-nonroot-user.sh)
  • onboarding-dry-run-claims (uses print_dry_run_oidc_claims_verification, warns on platform-root/admins groups)
  • lifecycle-cleanup-dryrun-users (pattern offboard)
  • lifecycle-guide (with T06 section)
  • make targets: security-bootstrap-onboarding-dry-run (SUBJECT/EMAIL/DISPLAY), security-bootstrap-lifecycle-cleanup-dryrun-users PATTERN=..., security-bootstrap-* -validate-onboarding-dry-run etc. The orchestrator (dry-run-nonroot-user.sh) uses /tmp workspace + EXIT trap, prefers env/k8s for LLDAP_ADMIN_PASS (k8s fallback added to create-user.sh), runs create --test, verifs (check-user-mfa, verify-openbao-client), optional GraphQL lock/offboard, populates /tmp/.../evidence.json from template + live jq data, then runs validate. Non-secret only. This fulfills much of the "convert fragile copy-paste into scripts", "persist non-secret progress", "expose repair" for the user-lifecycle slice. Full alignment awaits T05 guide + more validators in T08 (e.g. for OIDC client, OpenBao config). See 0019 workplan for details; lifecycle_guide T06 section needs refresh to deprecate old secret-mkdir path.

T07 - Add Automated Tests For Bootstrap UI Sections And Runbooks

id: NET-WP-0018-T07
status: todo
priority: high
state_hub_task_id: "c412d9e0-a2ca-4849-b6ee-bd4450b5a4a5"

For each task section and runbook exposed in the control surface, add automated tests that validate the implementation contract.

Use a layered approach:

  • static/unit tests for UI payload generation and command card presence;
  • shell/Python syntax tests for generated helper scripts;
  • dry-run or fixture tests for validators and state transitions; and
  • live-cluster checks gated behind explicit operator environment variables.

Done when every visible bootstrap section has at least one automated test that would fail if the section disappears, emits the wrong command, or reports an impossible state.

Note (NET-WP-0019 polish): Include tests for the user-lifecycle dry-run (T06 from 0017/0019): the orchestrator script, onboarding-dry-run console command, claims verification (T05), cleanup helper, and evidence validators. See NET-WP-0019 workplan and sso-mfa/k8s/lldap/dry-run-nonroot-user.sh . This cross-links the T06-adjacent polish into 0018's automation goals.

See also docs/user-engine-netkingdom-integration-assessment.md for the broader intent/scope fit, gaps (esp. adapters), and recommendations. (The 0019 artifacts -- script, console subcmds, make targets, runbook entry, templates/validators -- are now the concrete implementation to cover with the layered tests in T07.)

T08 - Integrate Validations Into The UI State Model

id: NET-WP-0018-T08
status: todo
priority: high
state_hub_task_id: "32f05fb1-269c-421c-ae34-57d2ceb7e47a"

Make the current setup prove itself through the same validations the UI shows. Where possible, compute ok, fail, err, or nil from validators rather than relying only on manual confirmation.

Important targets include KeyCape client config, privacyIDEA realm/resolver, LLDAP user/group membership, Authelia/KeyCape route health, OpenBao OIDC auth config, token policy proof, audit status, restore evidence, and State Hub sync.

Done when the UI can distinguish success, failure, error, and unknown states for the critical bootstrap gates and the live setup satisfies those checks.

2026-06-03 (0019 contribution): Dry-run specific validators now exist: onboarding_dry_run_template() + require_evidence_fields match + make security-bootstrap-validate-onboarding-dry-run (calls console which runs print_validate_onboarding_dry_run or equivalent, checks all *_true bools, actor_class, groups, no secret markers, effective_access_summary etc.). Console status/metadata shows many gates as "done" from prior evidence-driven flags (e.g. platform_reopened, cleanup_complete, oidc_login_verified). The evidence_validator_gate and build_gates logic support computing ok/fail from live evidence rather than manual. Extend this pattern to other T08 targets (KeyCape client, privacyIDEA realm, LLDAP membership, OpenBao OIDC, Authelia routes, State Hub sync). 0019 also added claims verification as a hook that can feed validation (infers from LLDAP groups + T01 role binding, surfaces warnings). Use the dry-run orchestrator + /tmp evidence as a repeatable fixture for these validators. See assessment for UE-side validation targets once adapters land (e.g. claims_enrichment projection).

T09 - Assess Scratch-Rebuild Risk And Define A Rehearsal Plan

id: NET-WP-0018-T09
status: todo
priority: high
state_hub_task_id: "a9e60fd5-fac6-46e9-bc63-b2979cca548e"

Review the resulting architecture, guide, automation, tests, and live validation coverage. Produce a risk assessment for restarting the NetKingdom infrastructure from scratch.

The assessment should classify each risk by likelihood, impact, detection method, mitigation, and remaining human interaction. It should also recommend whether the next rebuild should be a full teardown, an isolated parallel cluster rehearsal, a namespace-level rehearsal, or a scripted dry run.

2026-06-03 (post 0017/0019 + assessment): Rebuild risk assessment (T09) will be informed by: T02 arch (incl. UE integration points/gaps), T03 retrospective (capturing what was fragile vs now automated via console/evidence/orchestrator), T05 guide + evidence per step, T07 tests, T08 live validations (current metadata shows S6 reopen with many flags true, but adapter gaps remain). From assessment:

  • IAM-orchestration bootstrap (creds via creds-init skill, LLDAP/Keycloak direct, OpenBao via KeyCape OIDC) is repeatable and rehearsable today with 0019 tooling.
  • Full UE-backed user facts in rebuild: blocked until net-kingdom-specific adapters (IdentityClaimsAdapter from KeyCape claims, AuthorizationCheckPort to flex-auth, SecretProvider OpenBao, EventOutbox, AuditWriter, MembershipFactExporter) are implemented (primarily in user-engine per contract; NK orchestrates).
  • Other: direct LLDAP in paths (create-user, keycape) must route via claims_enrichment adapter post-adapter to avoid drift. Bootstrap users (platform-root etc.) stay IAM-side or seed externally_provisioned in UE. Recommend: T09 classify "UE integration" as separate risk item with mitigation "implement adapters + NK wiring + update dry-run to exercise UE projection"; current 0019 dry-run proves the IAM-lifecycle contract. creds-init skill (in .claude/commands) provides automated cred bootstrap entrypoint for rehearsal. No live destructive rebuild as non-goal.

Acceptance Criteria

  • NET-WP-0015 is closed, archived, or explicitly reconciled with remaining work owned elsewhere.
  • docs/NetkingdomRuntimeArchitecture.md documents the real deployed runtime.
  • A bootstrap retrospective and automation gap matrix exists.
  • Associated repository boundaries are reviewed and updated or tracked with follow-up work.
  • A smooth bootstrap guide describes the intended sequence and evidence.
  • The control surface follows the guide and uses safe automation wherever appropriate.
  • Every bootstrap UI section and runbook has automated coverage.
  • The live setup passes the integrated validations or reports actionable failures.
  • A scratch-rebuild risk assessment recommends the next rehearsal strategy.

Non-Goals

  • Do not perform a destructive live rebuild as part of this workplan.
  • Do not move secret material into Git, State Hub, or the bootstrap UI.
  • Do not hide remaining human custody decisions behind automation.
  • Do not collapse repository ownership boundaries merely for convenience.