Files
net-kingdom/docs/smooth-bootstrap-guide.md

13 KiB
Raw Permalink Blame History

NET-WP-0018 Smooth Bootstrap Guide

Status: draft (initial consolidation for T05) Date: 2026-06-03 Purpose: The single operator guide for a smooth, repeatable NetKingdom security bootstrap. An operator knows what to do, in what order, and what (non-secret) evidence proves each step complete. Covers the full sequence from the T05 spec + inputs from T02 runtime architecture, T03 retrospective + gap matrix, existing UX contracts (operator-journey, user-lifecycle), console lifecycle-guide (incl. 0019 T06-adjacent polish), evidence templates/validators, and make targets.

This replaces piecemeal reliance on separate docs. It makes wrong-order execution visibly hard via "next safe action" and blocked gates. Links to concrete commands, scripts, console subcommands, validate targets, and evidence.

See also:

  • docs/NetkingdomRuntimeArchitecture.md (T02 what exists)
  • docs/security-bootstrap-retrospective.md (T03 what was bumpy, now automated, gaps)
  • tools/security-bootstrap-console/security_bootstrap_console.py + make security-bootstrap-* (control surface, evidence, validators)
  • sso-mfa/k8s/lldap/dry-run-nonroot-user.sh + related (0019 polish)
  • .local/security-bootstrap.json + console status (current gates)

Pragmatic note (per 0018 Coordination): Track your progress through this guide using State Hub /progress/ (with workstream/task), dated notes in NET-WP-0018 workplan, git, console evidence/validators, /tmp evidence. This feeds future retrospectives.

Overall Model and Principles

From platform architecture and UX contracts:

  • Stages: S1 Low-trust assembly → ... → S6 Reopen under custody (see console status).
  • Shell / First screen always answers: Current stage, Next safe action, Blocked gates (why), Evidence (non-secret records).
  • UI posture (console/web): Calm field notebook; black/white + hi accents; panels; sentence case; no hype. Shows effective access before any save/action. Blocked conditions explicit (e.g., no platform-root for non-king, MFA required for privileged).
  • Evidence discipline: All steps produce/require non-secret evidence.json or metadata flags matching exact templates/validators (no secret markers). 12+ bools for user lifecycle (effective preview, no root grant, actor checks, verified identity/claims, reversible, no secrets recorded, etc.).
  • Actor classes & previews: Always distinguish (setup operator, platform admin, tenant admin, reviewer, king). Show effective privileges before create/save. Never grant platform-root except via explicit king path.
  • Secret boundary: Console/UI never collects/stores secrets. Use password-safe, k8s secrets, or operator memory. Prefer k8s fallback for dry-runs (see 0019).
  • Reversible where possible; human custody gates explicit.
  • Handoff to production readiness: After S6, move to 0017 production items (audit durability, etc. not duplicated here).

Sequence overview (high-level; details per section):

  1. Prerequisites & cluster foundation.
  2. Credential bundle / king kit (SOPS/age, custody).
  3. PrivacyIDEA bootstrap + realm.
  4. LLDAP/bootstrap user (platform-root/king) + MFA self-enroll + verify.
  5. KeyCape deployment + client registration + OIDC.
  6. OpenBao init/unseal/config (OIDC admin binding via KeyCape).
  7. Token cleanup, root disposition, restore drill, escrow/custody.
  8. User lifecycle (onboard/lock/offboard/review use 0019 dry-run for tests).
  9. State Hub sync, audit posture, cleanup/rotation.
  10. Platform reopen (S6) + handoff.

Each step has: commands/scripts, evidence required, blocked conditions, links to validators/console.

Step 1: Prerequisites, Cluster Foundation, Credential Bundle

Prerequisites:

  • Live Railiance/k3s cluster with ingress, cert-manager, NetworkPolicies, operators as per T02.
  • Operator access: kubectl to sso/openbao ns; password safe entries (net-kingdom/LLDAP/admin, etc.); age keys/custodian public.
  • No live OpenBao init in console (attended only).

Credential bundle / king kit:

  • Use make security-bootstrap-king-kit or console king-kit.
  • Dedicated king credential (platform-root@lldap, separate from personal).
  • MFA: privacyIDEA self-service TOTP.
  • Storage: password-safe + offline packet (age encrypted).
  • Evidence: custodian_age_*_confirmed, king_credential_ready, mfa_enrolled_confirmed, password_safe_confirmed, storage_classes, custody_packet_prepared.
  • Validate: make security-bootstrap-validate-kit or console validate-king-kit.
  • Custody roster (for 2of3 target): make security-bootstrap-custody-roster-template, sign, validate.

Blocked if: No king kit or custody approval.

Next safe after: Approve custody mode (console approve-custody-mode or make with ARGS for flags like --mfa-enrolled-confirmed).

Step 2: PrivacyIDEA Bootstrap + Realm

  • Deploy/repair realm, LLDAP resolver, self-service policies.
  • Use repair script (sso-mfa/k8s/privacyidea/repair-realm-live.sh) or console runbook "privacyIDEA realm repair".
  • Enroll platform-root TOTP in repaired realm (pi-admin for setup).
  • Evidence: related to t02 validate (realm healthy).
  • Validate: make security-bootstrap-validate-t02 (covers audit/recovery gates incl. this).

Blocked if: Realm not correct for LLDAP users (MFA/self-enroll fails).

See T03 retrospective for past realm drift bumps (now partially automated via runbook + validate).

Step 3: LLDAP / Bootstrap User Creation (platform-root / king)

  • Create platform-root user in LLDAP (via create-user.sh or LLDAP admin UI at lldap.coulomb.social).
  • Command example (with KUBECTL fallback): cd sso-mfa/k8s/lldap && ./create-user.sh platform-root ... (no --admin for non-root tests; use --admin only for platform admins via king path).
  • Self-enroll TOTP in privacyIDEA.
  • Verify MFA state: cd ../privacyidea && ./check-user-mfa-state.sh platform-root.
  • Verify OIDC/KeyCape path: cd ../keycape && ./verify-openbao-client.sh.
  • Groups: net-kingdom-admins for platform-root.
  • Evidence: identity_account_created, identity_group_confirmed, mfa_*, oidc_login_verified.
  • For non-root tests: use 0019 dry-run (see Step 8).

Blocked if: Actor requests net-kingdom-admins without king path; no MFA for privileged.

Console: lifecycle-guide for full flow; onboarding-dry-run* for tests.

Step 4: KeyCape Deployment + Client Registration + OIDC

  • KeyCape as lightweight IAM (conforms to IAM Profile v0.2).
  • Deploy client config (sso-mfa/k8s/keycape/create-secrets.sh).
  • Apply keycape-config Secret, restart KeyCape.
  • Register bootstrap clients (netkingdom-bootstrap-console, openbao-admin).
  • Redirects: localhost:8250/oidc/callback etc.
  • Verify OIDC admin login: platform-root obtains OpenBao platform-admin via KeyCape/MFA.
  • Evidence: keycape client gates, openbao_oidc_* , oidc_login_verified.
  • Validate related in t02 / console.

Issuer: https://kc.coulomb.social (see T02 for claims: tenant, groups, roles, assurance, etc.).

See T03 for past callback/registration bumps (now gated).

Step 5: OpenBao Init / Unseal / Config + OIDC Admin Binding

Attended only (console refuses live init):

  • Preflight: make security-bootstrap-openbao-preflight --run or console.
  • Init ceremony (human-attended): produce init output, unseal shares, root token.
  • Post-unseal: apply initial config (auth, mounts, policies, audit).
  • OIDC auth config against KeyCape (maps claims/groups to policies e.g. net-kingdom-admins → platform-admin).
  • Key material handling: trial exposure taint, rotate unseal keys, emergency lockdown, restore drill (snapshot, isolate, verify, destroy).
  • Root token disposition: revoked (evidence: root_token_disposition).
  • Evidence: openbao_initialized, initial_config_applied, trial_exposed + response_complete, keys_rotated, post_unseal_verified, openbao_oidc_*, restore_drill_passed, etc.
  • Validate: multiple t02-related + console gates (taint logic in runbooks).

Blocked if: Not unsealed, no OIDC binding, no restore drill, trial material not responded to.

Console runbooks: Key material compromised, generate new unseal, emergency lock-down, restore drill, token revocation.

See T02 for secret/credential path details; T03 for past claim shape / token issues (now partially automated via gates/runbooks).

Step 6: Token Cleanup, State Hub Sync, Audit Posture, Cleanup/Rotation

  • Revoke short-lived/bootstrap tokens (use console revoke helpers or runbooks; no plaintext on CLI).
  • State Hub: sync work (POST /progress/, decisions, etc.); ensure .custodian-brief reflects.
  • Audit posture: record audit_core_bootstrap_risk_accepted (or production_sink_ready), owner, review_date (2026-07-02), note. Use console approve or metadata update. Validate: console audit_core_posture.
  • Cleanup/rotation: review bootstrap-era creds/databases/paths; rotate as needed. Evidence: cleanup_complete.
  • Custody packet / roster final.

Evidence: root_token_disposition, audit_core_*, cleanup_complete, etc.

Console: cleanup-evidence-template, validate-cleanup, metadata updates.

See T03 for operator-state / token bumps (now in metadata + validators).

Step 7: User Lifecycle (Onboard / Lock / Offboard / Review) Use 0019 Polish

This implements the first practical flow per docs/security-bootstrap-user-lifecycle.md (UX contract) + T02/ T03.

Always:

  • Preview effective access (actor_class, scope, groups, MFA, OpenBao policy, no root for non-king).
  • Record non-secret evidence/audit (State Hub /progress/, console evidence, k8s audit).
  • MFA required for privileged.
  • Reversible where possible.

Detailed (from console lifecycle-guide + 0019):

  • Onboard scoped non-root: use make security-bootstrap-onboarding-dry-run SUBJECT=... EMAIL=... DISPLAY="..." (or direct ./dry-run-nonroot-user.sh). Internally: safe secret (k8s/env /tmp), create --test (no --admin), verify MFA/KeyCape, optional lock/offboard GraphQL, populate/validate evidence (lldap_identity_verified, keycape_oidc_claims_verified, effective_access_summary, lock_offboard_result, actor_class="user", groups=["net-kingdom-users"], no_secret_material_recorded, prevents_platform_root_grant, etc.).
  • Console: onboarding-dry-run, onboarding-dry-run-claims (infers claims from groups + T01 role; warns on admins/root), lifecycle-cleanup-dryrun-users --pattern t06-*.
  • Lock: remove from groups (GraphQL or LLDAP UI); reversible.
  • Offboard: delete user after resource transfer + reason/date; evidence.
  • Review: check-user-mfa-state, LLDAP groups, owned principals; rotate via creds tools.
  • Fabric/tenant admin: same but scoped groups, no platform-root, explicit preview "will NOT be in net-kingdom-admins".
  • For platform-root/king: king path only.

Evidence templates: make security-bootstrap-onboarding-dry-run-template, lifecycle-flow-template. Validate: make security-bootstrap-validate-onboarding-dry-run, validate-lifecycle-flow. Runbook: "User lifecycle dry-run (T06)" in console runbooks (refs 0019 + script).

Blocked if: Missing actor/scope, privileged without MFA, ordinary user gets root groups/policy.

See 0019 workplan + dry-run script + T03 matrix for past taint/hygiene bumps (now largely automated via /tmp + evidence).

Update console lifecycle_guide T06 section if it still shows old manual secret steps (prefer orchestrator).

Step 8: Platform Reopen + Handoff

  • Final gates: all prior evidence + platform_reopened flag.
  • Approve custody if needed.
  • Console: status shows S6; "Review related workplans".
  • Handoff: produce handover checklist (console handover-checklist); transfer to production readiness (audit durability, escrow, etc. per 0017 not duplicated here).
  • Evidence: platform_reopened, review_date, notes.

Validate: related custody/kit + full status.

Step 9: Post-Reopen / Optimization

  • Use T03 retrospective + T02 arch + this guide for future drills.
  • Address gaps (UE adapters per assessment, full Audit Core correlation, more T08 validators).
  • Rehearse rebuild per T09 (T09 complete; see docs/security-bootstrap-rebuild-risk-and-rehearsal.md for risk classification + non-destructive scripted/ns/parallel plan using 0019+creds+T08 validators; scripted/namespace first; use 0019 dry-run as model).

Evidence Summary Table (Core Non-Secret)

  • King/custody: age keys, roster, packet, approval.
  • Identity/MFA: account/group created, mfa_enrolled, oidc_verified.
  • OpenBao: initialized, config_applied, oidc_bound, root_disposition, restore_drill, keys_rotated, taint/response.
  • Lifecycle (0019): dry_run_date, actor_class, groups, effective_access_summary, lock_offboard_result, verified, no_secret_material_recorded, prevents, shows_effective_before_save.
  • General: cleanup_complete, platform_reopened, audit_core_* (risk accepted or sink ready), metadata_updated_at.

All validated via console validate-* or make targets. Templates in console.

References and Updates

  • Full list in T02/T03 docs.
  • Console lifecycle-guide, status, web-ui.
  • Update this guide + console guide section as T06/T08 work proceeds (e.g., more validators, control surface alignment).
  • For web-ui exposure of this guide: see T06.

This guide + the runtime architecture + retrospective turn the first bootstrap into a repeatable, auditable (pragmatically), low-diagnosis path. Use it; record evidence; improve via T07+.

Next after this guide: Align control surface (T06), add tests (T07), integrate validations (T08), assess rebuild risk (T09).

See NET-WP-0018 workplan for full acceptance.