net-kingdom/workplans/NET-WP-0018-bootstrap-automation-and-rebuild-readiness.md

---
id: NET-WP-0018
type: workplan
title: "Bootstrap Automation And Rebuild Readiness"
domain: netkingdom
repo: net-kingdom
status: active
owner: codex
topic_slug: netkingdom
created: "2026-06-01"
updated: "2026-06-03"
depends_on:
  - NET-WP-0015
  - NET-WP-0017
state_hub_workstream_id: "800f9f16-bc44-4bbf-a771-58a630a3b698"
---

# NET-WP-0018 - Bootstrap Automation And Rebuild Readiness

## Goal

Turn the first successful NetKingdom security bootstrap into a repeatable,
well-bounded, highly automated setup path that can survive an infrastructure
reset with minimal interactive diagnosis.

The first run proved that the stack can work: LLDAP, Authelia, privacyIDEA,
KeyCape, OpenBao, the local bootstrap control surface, and State Hub now form a
working identity and security bootstrap path. It also proved that the system is
still too easy to derail: realm drift, callback bridging, LLDAP lookup
assumptions, OpenBao claim shape, token expiry, and operator-state persistence
all required interactive repair. This workplan converts those lessons into
architecture documentation, bootstrap sequencing, validation coverage, UI
automation, and a clear scratch-rebuild risk assessment.

## Strategy

Proceed in layers:

1. close or explicitly hand off the remaining `NET-WP-0015` bootstrap gates;
2. document the runtime architecture that now actually exists;
3. write down the bootstrap retrospective and automation gaps;
4. clarify repository boundaries so future fixes land in the right place;
5. produce a sequence guide for a smooth rebuild;
6. improve the control-surface UI so it follows that guide;
7. add tests and validations for every guided bootstrap section; and
8. assess the residual risk of rebuilding NetKingdom from scratch.

This is not a request to immediately destroy and rebuild the live stack. A
scratch rebuild should come only after the guide, validations, and risk review
say which interactions remain genuinely unavoidable.

## Coordination Notes

- Avoid duplicating `NET-WP-0017`: audit durability, escrow, user onboarding,
  and hardening remain there unless this workplan explicitly turns them into
  bootstrap-guide or validation work.
- Keep the bootstrap UI a control surface, not a secret collector. It may run
  safe checks, generate commands, and store non-secret evidence, but it must not
  store passwords, OTP seeds, OpenBao tokens, unseal shares, or recovery codes.
- Prefer validation helpers that are usable both by the UI and by CI or
  operator command lines.
- Treat interactive prompts as an explicit design boundary: automate everything
  that can be automated safely, and document why each remaining human action is
  required.
- Pragmatic auditing / tracking for implementing *this workplan*: use State Hub
  /progress/ (and /decisions/ for key choices e.g. during T02/T04), dated notes
  + task status in this file (source of truth per ADR-001), descriptive git
  commits, console evidence/validators + .local/security-bootstrap.json when
  exercising paths, /tmp evidence, and runbooks. These artifacts (plus bumps
  encountered while doing T02–T08) directly feed T03 retrospective and gap
  matrix (which explicitly covers "audit" among other items). This enables
  post-impl review for optimization potential without requiring production
  Audit Core first. See audit_core_* fields in metadata (bootstrap risk
  accepted=true; production sink ready=false; temp exception with owner/review
  2026-07-02 per .local and console gates). Proper cross-system audit
  correlation (UE + flex-auth + platform sinks per contract/assessment gap 7)
  remains a follow-up; document current pragmatic paths (local-identity/audit.py
  TSV, OpenBao PVC + mock, State Hub/console evidence, separate bootstrap
  audit) in T02 arch doc and T03 matrix. Do not block 0018 start on full Audit
  Core.

## Related (post-0019 + assessment)
- NET-WP-0019 (T06-adjacent user lifecycle dry-run polish; advanced control surface, evidence, claims for T06/T07/T08)
- docs/user-engine-netkingdom-integration-assessment.md (detailed T04 boundary/intent/scope review for user-engine integration + 7 gaps; cross-referenced from SCOPE etc.)

## Tasks

### T01 - Close Or Hand Off NET-WP-0015 Remaining Gates

```task
id: NET-WP-0018-T01
status: done
priority: high
state_hub_task_id: "7ff22629-838b-41df-9feb-bb36c5d57cc1"
```

Review `NET-WP-0015` now that `platform-root` can obtain OpenBao
`platform-admin` through KeyCape/MFA. Close any gates that are truly complete,
and explicitly move unfinished production-readiness work to `NET-WP-0017` or
this workplan when it no longer belongs in the bootstrap ceremony plan.

Done when `NET-WP-0015` is either finished and ready to archive, or its
remaining tasks have precise owners, target workplans, and non-duplicative
acceptance criteria.

**2026-06-01:** Completed. `NET-WP-0015` was scope-closed as finished after
the OpenBao admin bridge was proven through KeyCape/MFA. Its remaining
production-readiness concerns were reconciled into `NET-WP-0017`: T02 owns
audit, restore, emergency drill evidence, and escrow; T03/T04 own bootstrap
path retirement and credential reset/rotation; T07 owns final archive review.
`NET-WP-0018` now continues with architecture documentation, retrospective,
guide, UI automation, validations, and rebuild-risk assessment.

**2026-06-03:** 0019 polish (dry-run orchestrator, console subcommands/make targets/claims/validators/runbook) and the user-engine/net-kingdom assessment (see T04) are cross-cutting enablers. See per-task notes (T02–T09) for specifics; 0019 advances T06/T07/T08 for lifecycle automation; assessment fulfills UE boundary review portion of T04. Related: NET-WP-0019, docs/user-engine-netkingdom-integration-assessment.md.

### T02 - Document The Runtime Architecture

```task
id: NET-WP-0018-T02
status: done
priority: high
state_hub_task_id: "121ee797-e3f5-4d3e-9baa-cfa8c92f8a66"
```

Create `docs/NetkingdomRuntimeArchitecture.md` documenting what now exists:
identity stores, MFA realms, KeyCape OIDC flow, Authelia handoff, OpenBao OIDC
admin path, bootstrap UI state, State Hub relation, live DNS/routes, trust
boundaries, token flows, and operational assumptions.

The document should explain the working system as deployed, not an idealized
future architecture. It should be specific enough to guide a scratch rebuild
without requiring the operator to rediscover the same integration details.

**2026-06-03 (post 0017/0019 + assessment):** The runtime now includes the
T06-adjacent dry-run tooling (orchestrator + console/make exposure + evidence
discipline) as part of the control surface. Per the persisted assessment, the
arch doc must capture: current direct LLDAP/KeyCape paths for bootstrap users
(vs. future UE claims_enrichment adapter), membership facts in LLDAP groups
vs. UE Membership (owning_system etc.), bootstrap local-identity vs. UE local
mode, and the boundary contract as the governance layer. Include refs to
canon/standards/user-engine-boundary-contract_v0.1.md and the assessment.

**2026-06-03:** Started T02. Using pragmatic tracking (this note + will POST /progress/ with task). Gathering deployed components from existing docs, code, and configs to produce specific-as-deployed doc (not idealized). Will cover all listed items + pragmatic audit paths, dry-run 0019 additions, UE integration points/gaps per assessment.

**2026-06-03:** T02 complete. Created docs/NetkingdomRuntimeArchitecture.md (comprehensive sections on planes model, identity stores/MFA/OIDC flows (lightweight key-cape: LLDAP at lldap.coulomb.social + Authelia + privacyIDEA + KeyCape issuer https://kc.coulomb.social with bootstrap clients), Authelia handoff, OpenBao OIDC admin + secrets/credential path (SOPS/age bootstrap -> runtime with K8s auth, ESO, leases), bootstrap console/UI state (S6 Reopen, full gates incl. audit_core_posture, 0019 dry-run orchestrator/console subcmds/make targets/evidence/validators/runbook entry), State Hub relation (progress/decisions for tracking), k8s/DNS/routes/ingress/trust boundaries (sso/openbao ns, recursive rule, concrete hosts), operational assumptions + rebuild notes. Explicitly includes current pragmatic audit paths (local-identity/audit.py TSV, OpenBao PVC+mock, State Hub/console evidence) and UE integration points + 7 gaps (from assessment + contract refs). Specific as-deployed for rebuild guidance. This doc now feeds T03 retrospective, T05 guide, T09 risk, and T02/T08 validation targets.

### T03 - Produce A Bootstrap Retrospective And Automation Gap Matrix

```task
id: NET-WP-0018-T03
status: done
priority: high
state_hub_task_id: "1a3c4261-4133-4021-bd53-ea3dc77021a0"
```

Assess how the first bootstrap went. Capture each bump encountered, the root
cause, how it was diagnosed, whether it is now automated, and what remains as a
manual step or fragile assumption.

Recommended output: `docs/security-bootstrap-retrospective.md` with a gap
matrix covering state persistence, privacyIDEA realm repair, KeyCape image
delivery, OIDC callbacks, OpenBao claim mapping, token revocation, audit,
escrow, and rebuild verification.

**2026-06-03 (post 0017 close + 0019 polish):** Retrospective should now
incorporate: successful S6 reopen + platform_reopened flag + cleanup_complete
in .local/security-bootstrap.json; T06 dry-run evidence discipline (12+ bools
incl. effective_access_before_save, no_secret_material_recorded, lldap_identity_verified,
keycape_oidc_claims_verified, actor_class != king, !net-kingdom-admins for non-root);
safer secret handling via /tmp WORKSPACE + trap + k8s fallback (never write
sso-mfa/bootstrap/secrets for dry-runs); console as non-secret control surface
with runbooks + templates + validators; 0019 make targets and orchestrator as
repeatable automation. Gaps remaining: UE adapter integration (see assessment).
The first bootstrap's interactive repairs (realm drift, callbacks, claim shape,
token expiry, operator-state) are now partially automated via console/evidence.

**2026-06-03:** Started T03 (after T02 arch doc complete). Using pragmatic (progress + file notes). Compiling bumps from 0015-0017/0019 history + T02 doc + console/metadata/evidence examples. Will produce docs/security-bootstrap-retrospective.md + gap matrix (state persistence, privacyIDEA repair, KeyCape delivery, OIDC callbacks, OpenBao claims, token revocation, **audit**, escrow, rebuild verification + new: 0019 dry-run hygiene/automation, console evidence, UE gaps). What is now automated vs. remaining manual/fragile.

**2026-06-03:** T03 initial substantial progress. Created docs/security-bootstrap-retrospective.md (exec summary, 9 detailed bumps with "now automated?" status, full gap matrix table covering audit + UE + 0019 items, recommendations for T05/T07/T08/T09, references to T02 doc + pragmatic records + evidence). Uses 0019 dry-run/evidence as model. Still in_progress (expand with any new from later T0x).

**2026-06-03:** T03 complete. Finalized retrospective draft with comprehensive bumps analysis, gap matrix (explicitly including audit, UE integration, 0019 polish as enablers), and actionable recs. No further expansion needed at this stage (will reference in later tasks). Used pragmatic tracking throughout (progress events with task_id, workplan notes, git). The doc + T02 now provide strong foundation for T05 (guide), T07/T08 (tests/validations), T09 (risk). Marked done in file and will sync via fix.

### T04 - Review Repository Intent And Scope Boundaries

```task
id: NET-WP-0018-T04
status: todo
priority: medium
state_hub_task_id: "9c286579-b7bc-46ae-9789-801b2b27b26d"
```

Review `INTENT.md`, `SCOPE.md`, and equivalent boundary documents across the
associated repositories involved in the bootstrap. At minimum consider
`net-kingdom`, `key-cape`, `railiance-platform`, `state-hub`/custodian, and any
repo that owns OpenBao deployment, image delivery, identity runtime, or
bootstrap automation.

Update the boundary documents or create follow-up workplans where ownership is
unclear. The result should answer: where should a bug fix live, where should a
runbook live, where should validation live, and which repo owns live
deployment state.

**2026-06-03:** The user-engine/net-kingdom integration assessment (persisted in
`docs/user-engine-netkingdom-integration-assessment.md`, cross-referenced from
SCOPE.md Getting Oriented, canon/standards/user-engine-boundary-contract_v0.1.md,
docs/responsibility-map.md, user-engine-interface-guidance.md, and this/0019
workplans) provides a comprehensive review of intent, implemented scope (UE:
headless domain models + in-mem MVP + ports/adapters for claims/audit/projections;
NK: IAM orchestration + contracts + bootstrap), architectural fit (no intent
conflicts; UE owns user-domain facts/projections, NK orchestrates boundaries per
ADR-0007/0010/contract), and 7 specific gaps/risks (1. Missing Platform Integration
Adapters -- biggest; 2. Bootstrap/Platform Users vs. Governed UE Lifecycle;
3. App Onboarding "Application" concept overload; 4. Membership/Group overlap;
5. Governance/Workplan/Brief split (UE brief stale); 6. Claims Enrichment Path
drift (current direct LLDAP in NK/keycape paths); 7. Audit correlation). NK
bootstrap (0015-0017/0019) is allowed for local/non-prod per contract. This
largely fulfills the UE + boundary review portion of T04. Recommend follow-up
reviews or work items for key-cape (OIDC client vs UE Application Binding),
railiance-platform (deployment refs), and explicit transition rules for seeding
externally_provisioned memberships from IAM groups. The assessment recommends
using 0018's T07/T08 to drive integration tests/dry-runs once adapters exist.

### T05 - Create The Smooth Bootstrap Guide

```task
id: NET-WP-0018-T05
status: in_progress
priority: high
state_hub_task_id: "e7b45fc8-8ee7-4914-ac4b-d0c8a35fad13"
```

Create or update the NetKingdom bootstrap guide so an operator knows what to
do, in what order, and what evidence proves each step is complete.

The guide should cover prerequisites, credential bundle creation, cluster
foundation checks, privacyIDEA bootstrap, LLDAP/bootstrap user creation,
KeyCape deployment and client registration, OpenBao init/unseal/configuration,
OIDC admin binding, token cleanup, State Hub sync, and handoff to production
readiness.

**2026-06-03:** Base material exists in piecemeal form: docs/security-bootstrap-*.md
(user-lifecycle.md, operator-journey.md, king-credential-kit.md, openbao-ceremony-ux.md,
handover-cleanup.md, etc.), console lifecycle-guide (T05/T06 flows with previews),
and security-bootstrap-user-lifecycle.md (UX contract for show-effective-before-save,
actor classes, blocked conditions). The 0019 polish extended the console lifecycle-guide
with T06 DRY-RUN EXECUTION section (though that section still lists pre-orchestrator
manual secret steps; update to prefer make security-bootstrap-onboarding-dry-run +
dry-run-nonroot-user.sh + k8s fallback). T05 should consolidate into one
"NET-WP-0018 smooth bootstrap guide" (or update operator-journey) with explicit
evidence per step (linking the validate-* make targets and templates). 0019's
dry-run + evidence is the model for user-lifecycle portion of the guide.

**2026-06-03:** Started T05 (after T03 complete). Per retrospective recs (T05 high priority now that T02 arch + T03 retrospective exist). Using pragmatic tracking. Will consolidate piecemeal materials (T02, T03 retrospective, console lifecycle-guide + 0019 extensions, security-bootstrap-operator-journey.md, user-lifecycle.md, other *-ux.md, evidence templates/validators from console/0019) into a single operator guide with clear sequence, prerequisites, evidence per step (links to validate-*, 0019 dry-run, etc.), and "next safe action" / blocked gates model from the UX contract. Update console guide section as needed. Produce docs/smooth-bootstrap-guide.md or update main journey doc.

**2026-06-03:** T05 complete. Created docs/smooth-bootstrap-guide.md (the consolidated NET-WP-0018 smooth bootstrap guide): covers full sequence from prereqs to reopen + user lifecycle (using 0019 polish), per-step evidence + validator/make links, blocked conditions, next safe action / blocked gates from UX contracts (operator-journey + user-lifecycle), references to T02 arch, T03 retrospective, console, 0019 artifacts. Also notes to update console lifecycle-guide for 0019 polish. Pragmatic tracking used (progress, file notes). This fulfills T05 + feeds T06 alignment.

### T06 - Align The Control Surface With The Bootstrap Guide

```task
id: NET-WP-0018-T06
status: todo
priority: high
state_hub_task_id: "9bba26b3-b1be-4e58-a18b-a0533683d63b"
```

Review the local security bootstrap UI against the guide. Improve the
automation grade where safe: replace passive checkboxes with safe validators,
convert fragile copy-paste sequences into scripts, persist non-secret progress
durably, expose repair routines for known drift cases, and keep manual steps
clear when human custody or secret handling is required.

Done when the UI guides the same sequence as the bootstrap guide and makes
wrong-order execution visibly hard.

**2026-06-03 (0019 polish delivered):** Control surface now includes (in status,
available actions, parser, dispatch, runbook_payloads, web-ui capable):
- onboarding-dry-run-template / validate-onboarding-dry-run
- onboarding-dry-run (delegates to sso-mfa/k8s/lldap/dry-run-nonroot-user.sh)
- onboarding-dry-run-claims (uses print_dry_run_oidc_claims_verification, warns on
  platform-root/admins groups)
- lifecycle-cleanup-dryrun-users (pattern offboard)
- lifecycle-guide (with T06 section)
- make targets: security-bootstrap-onboarding-dry-run (SUBJECT/EMAIL/DISPLAY),
  security-bootstrap-lifecycle-cleanup-dryrun-users PATTERN=..., security-bootstrap-*
  -validate-onboarding-dry-run etc.
The orchestrator (dry-run-nonroot-user.sh) uses /tmp workspace + EXIT trap,
prefers env/k8s for LLDAP_ADMIN_PASS (k8s fallback added to create-user.sh),
runs create --test, verifs (check-user-mfa, verify-openbao-client), optional
GraphQL lock/offboard, populates /tmp/.../evidence.json from template + live jq
data, then runs validate. Non-secret only. This fulfills much of the "convert
fragile copy-paste into scripts", "persist non-secret progress", "expose repair"
for the user-lifecycle slice. Full alignment awaits T05 guide + more validators
in T08 (e.g. for OIDC client, OpenBao config). See 0019 workplan for details;
lifecycle_guide T06 section needs refresh to deprecate old secret-mkdir path.

### T07 - Add Automated Tests For Bootstrap UI Sections And Runbooks

```task
id: NET-WP-0018-T07
status: todo
priority: high
state_hub_task_id: "c412d9e0-a2ca-4849-b6ee-bd4450b5a4a5"
```

For each task section and runbook exposed in the control surface, add automated
tests that validate the implementation contract.

Use a layered approach:

- static/unit tests for UI payload generation and command card presence;
- shell/Python syntax tests for generated helper scripts;
- dry-run or fixture tests for validators and state transitions; and
- live-cluster checks gated behind explicit operator environment variables.

Done when every visible bootstrap section has at least one automated test that
would fail if the section disappears, emits the wrong command, or reports an
impossible state.

**Note (NET-WP-0019 polish):** Include tests for the user-lifecycle dry-run (T06 from 0017/0019): the orchestrator script, onboarding-dry-run console command, claims verification (T05), cleanup helper, and evidence validators. See NET-WP-0019 workplan and sso-mfa/k8s/lldap/dry-run-nonroot-user.sh . This cross-links the T06-adjacent polish into 0018's automation goals.

See also `docs/user-engine-netkingdom-integration-assessment.md` for the broader intent/scope fit, gaps (esp. adapters), and recommendations. (The 0019 artifacts -- script, console subcmds, make targets, runbook entry, templates/validators -- are now the concrete implementation to cover with the layered tests in T07.)

### T08 - Integrate Validations Into The UI State Model

```task
id: NET-WP-0018-T08
status: todo
priority: high
state_hub_task_id: "32f05fb1-269c-421c-ae34-57d2ceb7e47a"
```

Make the current setup prove itself through the same validations the UI shows.
Where possible, compute `ok`, `fail`, `err`, or `nil` from validators rather
than relying only on manual confirmation.

Important targets include KeyCape client config, privacyIDEA realm/resolver,
LLDAP user/group membership, Authelia/KeyCape route health, OpenBao OIDC auth
config, token policy proof, audit status, restore evidence, and State Hub sync.

Done when the UI can distinguish success, failure, error, and unknown states
for the critical bootstrap gates and the live setup satisfies those checks.

**2026-06-03 (0019 contribution):** Dry-run specific validators now exist:
onboarding_dry_run_template() + require_evidence_fields match + make
security-bootstrap-validate-onboarding-dry-run (calls console which runs
print_validate_onboarding_dry_run or equivalent, checks all *_true bools,
actor_class, groups, no secret markers, effective_access_summary etc.).
Console status/metadata shows many gates as "done" from prior evidence-driven
flags (e.g. platform_reopened, cleanup_complete, oidc_login_verified). The
evidence_validator_gate and build_gates logic support computing ok/fail from
live evidence rather than manual. Extend this pattern to other T08 targets
(KeyCape client, privacyIDEA realm, LLDAP membership, OpenBao OIDC, Authelia
routes, State Hub sync). 0019 also added claims verification as a hook that
can feed validation (infers from LLDAP groups + T01 role binding, surfaces
warnings). Use the dry-run orchestrator + /tmp evidence as a repeatable
fixture for these validators. See assessment for UE-side validation targets
once adapters land (e.g. claims_enrichment projection).

### T09 - Assess Scratch-Rebuild Risk And Define A Rehearsal Plan

```task
id: NET-WP-0018-T09
status: todo
priority: high
state_hub_task_id: "a9e60fd5-fac6-46e9-bc63-b2979cca548e"
```

Review the resulting architecture, guide, automation, tests, and live
validation coverage. Produce a risk assessment for restarting the NetKingdom
infrastructure from scratch.

The assessment should classify each risk by likelihood, impact, detection
method, mitigation, and remaining human interaction. It should also recommend
whether the next rebuild should be a full teardown, an isolated parallel
cluster rehearsal, a namespace-level rehearsal, or a scripted dry run.

**2026-06-03 (post 0017/0019 + assessment):** Rebuild risk assessment (T09) will
be informed by: T02 arch (incl. UE integration points/gaps), T03 retrospective
(capturing what was fragile vs now automated via console/evidence/orchestrator),
T05 guide + evidence per step, T07 tests, T08 live validations (current metadata
shows S6 reopen with many flags true, but adapter gaps remain). From assessment:
- IAM-orchestration bootstrap (creds via creds-init skill, LLDAP/Keycloak direct,
  OpenBao via KeyCape OIDC) is repeatable and rehearsable today with 0019 tooling.
- Full UE-backed user facts in rebuild: blocked until net-kingdom-specific
  adapters (IdentityClaimsAdapter from KeyCape claims, AuthorizationCheckPort to
  flex-auth, SecretProvider OpenBao, EventOutbox, AuditWriter, MembershipFactExporter)
  are implemented (primarily in user-engine per contract; NK orchestrates).
- Other: direct LLDAP in paths (create-user, keycape) must route via claims_enrichment
  adapter post-adapter to avoid drift. Bootstrap users (platform-root etc.) stay
  IAM-side or seed externally_provisioned in UE. Recommend: T09 classify "UE
  integration" as separate risk item with mitigation "implement adapters + NK
  wiring + update dry-run to exercise UE projection"; current 0019 dry-run proves
  the IAM-lifecycle contract. creds-init skill (in .claude/commands) provides
  automated cred bootstrap entrypoint for rehearsal. No live destructive rebuild
  as non-goal.

## Acceptance Criteria

- `NET-WP-0015` is closed, archived, or explicitly reconciled with remaining
  work owned elsewhere.
- `docs/NetkingdomRuntimeArchitecture.md` documents the real deployed runtime.
- A bootstrap retrospective and automation gap matrix exists.
- Associated repository boundaries are reviewed and updated or tracked with
  follow-up work.
- A smooth bootstrap guide describes the intended sequence and evidence.
- The control surface follows the guide and uses safe automation wherever
  appropriate.
- Every bootstrap UI section and runbook has automated coverage.
- The live setup passes the integrated validations or reports actionable
  failures.
- A scratch-rebuild risk assessment recommends the next rehearsal strategy.

## Non-Goals

- Do not perform a destructive live rebuild as part of this workplan.
- Do not move secret material into Git, State Hub, or the bootstrap UI.
- Do not hide remaining human custody decisions behind automation.
- Do not collapse repository ownership boundaries merely for convenience.