Files
net-kingdom/docs/smooth-bootstrap-guide.md

204 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# NET-WP-0018 Smooth Bootstrap Guide
**Status:** draft (initial consolidation for T05)
**Date:** 2026-06-03
**Purpose:** The single operator guide for a smooth, repeatable NetKingdom security bootstrap. An operator knows what to do, in what order, and what (non-secret) evidence proves each step complete. Covers the full sequence from the T05 spec + inputs from T02 runtime architecture, T03 retrospective + gap matrix, existing UX contracts (operator-journey, user-lifecycle), console lifecycle-guide (incl. 0019 T06-adjacent polish), evidence templates/validators, and make targets.
This replaces piecemeal reliance on separate docs. It makes wrong-order execution visibly hard via "next safe action" and blocked gates. Links to concrete commands, scripts, console subcommands, validate targets, and evidence.
See also:
- docs/NetkingdomRuntimeArchitecture.md (T02 what exists)
- docs/security-bootstrap-retrospective.md (T03 what was bumpy, now automated, gaps)
- tools/security-bootstrap-console/security_bootstrap_console.py + make security-bootstrap-* (control surface, evidence, validators)
- sso-mfa/k8s/lldap/dry-run-nonroot-user.sh + related (0019 polish)
- .local/security-bootstrap.json + console status (current gates)
**Pragmatic note (per 0018 Coordination):** Track your progress through this guide using State Hub /progress/ (with workstream/task), dated notes in NET-WP-0018 workplan, git, console evidence/validators, /tmp evidence. This feeds future retrospectives.
## Overall Model and Principles
From platform architecture and UX contracts:
- **Stages:** S1 Low-trust assembly → ... → S6 Reopen under custody (see console status).
- **Shell / First screen always answers:** Current stage, Next safe action, Blocked gates (why), Evidence (non-secret records).
- **UI posture (console/web):** Calm field notebook; black/white + hi accents; panels; sentence case; no hype. Shows effective access before any save/action. Blocked conditions explicit (e.g., no platform-root for non-king, MFA required for privileged).
- **Evidence discipline:** All steps produce/require non-secret evidence.json or metadata flags matching exact templates/validators (no secret markers). 12+ bools for user lifecycle (effective preview, no root grant, actor checks, verified identity/claims, reversible, no secrets recorded, etc.).
- **Actor classes & previews:** Always distinguish (setup operator, platform admin, tenant admin, reviewer, king). Show effective privileges before create/save. Never grant platform-root except via explicit king path.
- **Secret boundary:** Console/UI never collects/stores secrets. Use password-safe, k8s secrets, or operator memory. Prefer k8s fallback for dry-runs (see 0019).
- **Reversible where possible; human custody gates explicit.**
- **Handoff to production readiness:** After S6, move to 0017 production items (audit durability, etc. not duplicated here).
**Sequence overview (high-level; details per section):**
1. Prerequisites & cluster foundation.
2. Credential bundle / king kit (SOPS/age, custody).
3. PrivacyIDEA bootstrap + realm.
4. LLDAP/bootstrap user (platform-root/king) + MFA self-enroll + verify.
5. KeyCape deployment + client registration + OIDC.
6. OpenBao init/unseal/config (OIDC admin binding via KeyCape).
7. Token cleanup, root disposition, restore drill, escrow/custody.
8. User lifecycle (onboard/lock/offboard/review use 0019 dry-run for tests).
9. State Hub sync, audit posture, cleanup/rotation.
10. Platform reopen (S6) + handoff.
Each step has: commands/scripts, evidence required, blocked conditions, links to validators/console.
## Step 1: Prerequisites, Cluster Foundation, Credential Bundle
**Prerequisites:**
- Live Railiance/k3s cluster with ingress, cert-manager, NetworkPolicies, operators as per T02.
- Operator access: kubectl to sso/openbao ns; password safe entries (net-kingdom/LLDAP/admin, etc.); age keys/custodian public.
- No live OpenBao init in console (attended only).
**Credential bundle / king kit:**
- Use `make security-bootstrap-king-kit` or console `king-kit`.
- Dedicated king credential (platform-root@lldap, separate from personal).
- MFA: privacyIDEA self-service TOTP.
- Storage: password-safe + offline packet (age encrypted).
- Evidence: custodian_age_*_confirmed, king_credential_ready, mfa_enrolled_confirmed, password_safe_confirmed, storage_classes, custody_packet_prepared.
- Validate: `make security-bootstrap-validate-kit` or console `validate-king-kit`.
- Custody roster (for 2of3 target): `make security-bootstrap-custody-roster-template`, sign, validate.
**Blocked if:** No king kit or custody approval.
**Next safe after:** Approve custody mode (console `approve-custody-mode` or make with ARGS for flags like --mfa-enrolled-confirmed).
## Step 2: PrivacyIDEA Bootstrap + Realm
- Deploy/repair realm, LLDAP resolver, self-service policies.
- Use repair script (sso-mfa/k8s/privacyidea/repair-realm-live.sh) or console runbook "privacyIDEA realm repair".
- Enroll platform-root TOTP in repaired realm (pi-admin for setup).
- Evidence: related to t02 validate (realm healthy).
- Validate: `make security-bootstrap-validate-t02` (covers audit/recovery gates incl. this).
**Blocked if:** Realm not correct for LLDAP users (MFA/self-enroll fails).
See T03 retrospective for past realm drift bumps (now partially automated via runbook + validate).
## Step 3: LLDAP / Bootstrap User Creation (platform-root / king)
- Create platform-root user in LLDAP (via create-user.sh or LLDAP admin UI at lldap.coulomb.social).
- Command example (with KUBECTL fallback): `cd sso-mfa/k8s/lldap && ./create-user.sh platform-root ...` (no --admin for non-root tests; use --admin only for platform admins via king path).
- Self-enroll TOTP in privacyIDEA.
- Verify MFA state: `cd ../privacyidea && ./check-user-mfa-state.sh platform-root`.
- Verify OIDC/KeyCape path: `cd ../keycape && ./verify-openbao-client.sh`.
- Groups: net-kingdom-admins for platform-root.
- Evidence: identity_account_created, identity_group_confirmed, mfa_*, oidc_login_verified.
- For non-root tests: use 0019 dry-run (see Step 8).
**Blocked if:** Actor requests net-kingdom-admins without king path; no MFA for privileged.
**Console:** `lifecycle-guide` for full flow; `onboarding-dry-run*` for tests.
## Step 4: KeyCape Deployment + Client Registration + OIDC
- KeyCape as lightweight IAM (conforms to IAM Profile v0.2).
- Deploy client config (sso-mfa/k8s/keycape/create-secrets.sh).
- Apply keycape-config Secret, restart KeyCape.
- Register bootstrap clients (netkingdom-bootstrap-console, openbao-admin).
- Redirects: localhost:8250/oidc/callback etc.
- Verify OIDC admin login: platform-root obtains OpenBao platform-admin via KeyCape/MFA.
- Evidence: keycape client gates, openbao_oidc_* , oidc_login_verified.
- Validate related in t02 / console.
**Issuer:** https://kc.coulomb.social (see T02 for claims: tenant, groups, roles, assurance, etc.).
See T03 for past callback/registration bumps (now gated).
## Step 5: OpenBao Init / Unseal / Config + OIDC Admin Binding
**Attended only (console refuses live init):**
- Preflight: `make security-bootstrap-openbao-preflight --run` or console.
- Init ceremony (human-attended): produce init output, unseal shares, root token.
- Post-unseal: apply initial config (auth, mounts, policies, audit).
- OIDC auth config against KeyCape (maps claims/groups to policies e.g. net-kingdom-admins → platform-admin).
- Key material handling: trial exposure taint, rotate unseal keys, emergency lockdown, restore drill (snapshot, isolate, verify, destroy).
- Root token disposition: revoked (evidence: root_token_disposition).
- Evidence: openbao_initialized, initial_config_applied, trial_exposed + response_complete, keys_rotated, post_unseal_verified, openbao_oidc_*, restore_drill_passed, etc.
- Validate: multiple t02-related + console gates (taint logic in runbooks).
**Blocked if:** Not unsealed, no OIDC binding, no restore drill, trial material not responded to.
**Console runbooks:** Key material compromised, generate new unseal, emergency lock-down, restore drill, token revocation.
See T02 for secret/credential path details; T03 for past claim shape / token issues (now partially automated via gates/runbooks).
## Step 6: Token Cleanup, State Hub Sync, Audit Posture, Cleanup/Rotation
- Revoke short-lived/bootstrap tokens (use console revoke helpers or runbooks; no plaintext on CLI).
- State Hub: sync work (POST /progress/, decisions, etc.); ensure .custodian-brief reflects.
- Audit posture: record audit_core_bootstrap_risk_accepted (or production_sink_ready), owner, review_date (2026-07-02), note. Use console approve or metadata update. Validate: console audit_core_posture.
- Cleanup/rotation: review bootstrap-era creds/databases/paths; rotate as needed. Evidence: cleanup_complete.
- Custody packet / roster final.
**Evidence:** root_token_disposition, audit_core_*, cleanup_complete, etc.
**Console:** `cleanup-evidence-template`, `validate-cleanup`, metadata updates.
See T03 for operator-state / token bumps (now in metadata + validators).
## Step 7: User Lifecycle (Onboard / Lock / Offboard / Review) Use 0019 Polish
This implements the first practical flow per docs/security-bootstrap-user-lifecycle.md (UX contract) + T02/ T03.
**Always:**
- Preview effective access (actor_class, scope, groups, MFA, OpenBao policy, no root for non-king).
- Record non-secret evidence/audit (State Hub /progress/, console evidence, k8s audit).
- MFA required for privileged.
- Reversible where possible.
**Detailed (from console lifecycle-guide + 0019):**
- Onboard scoped non-root: use `make security-bootstrap-onboarding-dry-run SUBJECT=... EMAIL=... DISPLAY="..."` (or direct ./dry-run-nonroot-user.sh). Internally: safe secret (k8s/env /tmp), create --test (no --admin), verify MFA/KeyCape, optional lock/offboard GraphQL, populate/validate evidence (lldap_identity_verified, keycape_oidc_claims_verified, effective_access_summary, lock_offboard_result, actor_class="user", groups=["net-kingdom-users"], no_secret_material_recorded, prevents_platform_root_grant, etc.).
- Console: `onboarding-dry-run`, `onboarding-dry-run-claims` (infers claims from groups + T01 role; warns on admins/root), `lifecycle-cleanup-dryrun-users --pattern t06-*`.
- Lock: remove from groups (GraphQL or LLDAP UI); reversible.
- Offboard: delete user after resource transfer + reason/date; evidence.
- Review: check-user-mfa-state, LLDAP groups, owned principals; rotate via creds tools.
- Fabric/tenant admin: same but scoped groups, no platform-root, explicit preview "will NOT be in net-kingdom-admins".
- For platform-root/king: king path only.
**Evidence templates:** `make security-bootstrap-onboarding-dry-run-template`, `lifecycle-flow-template`.
**Validate:** `make security-bootstrap-validate-onboarding-dry-run`, `validate-lifecycle-flow`.
**Runbook:** "User lifecycle dry-run (T06)" in console runbooks (refs 0019 + script).
**Blocked if:** Missing actor/scope, privileged without MFA, ordinary user gets root groups/policy.
See 0019 workplan + dry-run script + T03 matrix for past taint/hygiene bumps (now largely automated via /tmp + evidence).
Update console lifecycle_guide T06 section if it still shows old manual secret steps (prefer orchestrator).
## Step 8: Platform Reopen + Handoff
- Final gates: all prior evidence + platform_reopened flag.
- Approve custody if needed.
- Console: status shows S6; "Review related workplans".
- Handoff: produce handover checklist (console `handover-checklist`); transfer to production readiness (audit durability, escrow, etc. per 0017 not duplicated here).
- Evidence: platform_reopened, review_date, notes.
**Validate:** related custody/kit + full status.
## Step 9: Post-Reopen / Optimization
- Use T03 retrospective + T02 arch + this guide for future drills.
- Address gaps (UE adapters per assessment, full Audit Core correlation, more T08 validators).
- Rehearse rebuild per T09 (T09 complete; see docs/security-bootstrap-rebuild-risk-and-rehearsal.md for risk classification + non-destructive scripted/ns/parallel plan using 0019+creds+T08 validators; scripted/namespace first; use 0019 dry-run as model).
## Evidence Summary Table (Core Non-Secret)
- King/custody: age keys, roster, packet, approval.
- Identity/MFA: account/group created, mfa_enrolled, oidc_verified.
- OpenBao: initialized, config_applied, oidc_bound, root_disposition, restore_drill, keys_rotated, taint/response.
- Lifecycle (0019): dry_run_date, actor_class, groups, effective_access_summary, lock_offboard_result, *_verified, no_secret_material_recorded, prevents_*, shows_effective_before_save.
- General: cleanup_complete, platform_reopened, audit_core_* (risk accepted or sink ready), metadata_updated_at.
All validated via console `validate-*` or make targets. Templates in console.
## References and Updates
- Full list in T02/T03 docs.
- Console `lifecycle-guide`, `status`, `web-ui`.
- Update this guide + console guide section as T06/T08 work proceeds (e.g., more validators, control surface alignment).
- For web-ui exposure of this guide: see T06.
This guide + the runtime architecture + retrospective turn the first bootstrap into a repeatable, auditable (pragmatically), low-diagnosis path. Use it; record evidence; improve via T07+.
**Next after this guide:** Align control surface (T06), add tests (T07), integrate validations (T08), assess rebuild risk (T09).
See NET-WP-0018 workplan for full acceptance.