Files
net-kingdom/docs/platform-identity-security-architecture.md

472 lines
20 KiB
Markdown

# Platform Identity and Security Architecture
Status: implemented architecture baseline for NetKingdom/Railiance/Coulomb
Date: 2026-05-18
## Purpose
This document captures the production-oriented identity, authorization,
MFA, credential, and bootstrap architecture for the platform we are
building. It deliberately treats Coulomb as the first internal tenant and
reference workload, not as the platform itself.
The architecture must be recursive: the same platform that protects
future tenants also protects the services and repositories used to build
and operate the platform. That recursion is useful, but it is also where
many security designs accidentally collapse into self-administering root
power. This document exists to prevent that.
## Core Model
```text
Bootstrap plane
establishes initial trust before normal platform services exist
Platform control plane
operates identity, MFA, secrets, policy, audit, and authorization
Tenant planes
run Coulomb and future customer/project/domain workloads
```
Coulomb is the first internal tenant. It is also the reference tenant that
helps validate the platform. It must not become the platform root of
trust merely because it is first.
## Planes
### Bootstrap Plane
The bootstrap plane exists before the full platform is alive. It owns the
minimal authority needed to create and recover the control plane.
Responsibilities:
- host provisioning and hardening
- root age/SOPS material and emergency bundles
- initial cluster access
- initial identity service deployment
- initial secret injection
- break-glass recovery
- transition to managed runtime authority
Owned primarily by `railiance-infra`, `railiance-cluster`, and the
credential bootstrap work in `net-kingdom`.
### Platform Control Plane
The platform control plane owns shared security services.
Responsibilities:
- NetKingdom IAM Profile
- lightweight identity mode through key-cape
- expanded identity mode through Keycloak
- MFA/token lifecycle through privacyIDEA where applicable
- canonical authorization through flex-auth
- delegated authorization runtime through Topaz first, with other PDPs as
adapters
- runtime secret authority through OpenBao
- audit and explanation records
- platform service secrets, dynamic credentials, leases, and rotation
Owned conceptually by `net-kingdom`; deployed through the Railiance stack.
### Tenant Plane
Tenant planes are where workloads live. Coulomb is tenant zero/reference
tenant; later tenants may be projects, customers, domains, sandboxes, or
isolated deployments.
Responsibilities:
- protected services and repositories
- tenant-owned resources
- tenant-specific groups, policies, and service accounts
- local enforcement of authorization decisions
- workload audit events and diagnostics
Tenant administrators may manage their tenant resources. They must not be
able to alter platform root trust, global identity configuration,
platform break-glass material, or the policy pipeline that governs the
platform itself.
## Component Responsibilities
| Component | Primary role | Must not become |
| --- | --- | --- |
| `net-kingdom` | canonical security architecture, IAM Profile, SSO/MFA, credential bootstrap decisions | a deployment repo for every stack layer |
| `key-cape` | lightweight IAM implementation of the NetKingdom IAM Profile | a general-purpose IAM platform or authorization engine |
| Keycloak | expanded-mode IAM and optional Keycloak Authorization Services adapter | the canonical model for all platform authorization |
| privacyIDEA | MFA/token authority, especially in lightweight/key-cape mode | a policy decision point for application resources |
| OpenBao | runtime platform secrets service, dynamic credential broker, lease/revocation point, and audit source for secret access | the bootstrap root of trust or an application-specific configuration store |
| `flex-auth` | authorization control plane, CARING descriptors, policy packages, decision envelopes, audit/explain | an identity provider or backend-specific wrapper |
| Topaz | first delegated authorization runtime/PDP for flex-auth | the platform control plane or identity provider |
| Railiance repos | converged infrastructure, cluster, platform services, enablement, and app deployment | the source of security policy semantics |
## Identity Path
```text
Human/service/agent principal
|
v
NetKingdom IAM Profile
|
+-- lightweight mode: key-cape
| Authelia + LLDAP + privacyIDEA
|
+-- expanded mode: Keycloak
Keycloak + LDAP/Entra federation + MFA integration
```
Applications depend on the IAM Profile, not on the concrete provider.
key-cape is the lightweight profile implementation. Keycloak is the
expanded-mode profile implementation. privacyIDEA provides MFA/token
capabilities where the deployment mode uses it.
The canonical profile is NetKingdom IAM Profile v0.2
(`canon/standards/iam-profile_v0.2.md`). It requires explicit `tenant`,
`principal_type`, `groups`, `roles`, `scope`/`scp`, and `assurance`
claims so flex-auth receives normalized identity input regardless of
whether key-cape or Keycloak issued the token.
The choice between lightweight and expanded mode is **capability-driven,
not scale-driven**. key-cape comfortably serves large internal user
populations; expanded-mode Keycloak is introduced when a *capability* is
required that the lightweight stack does not provide — chiefly inbound
enterprise federation and SAML brokering (Entra ID, Active Directory,
generic SAML IdPs), complex multi-realm topologies, or delegated admin.
A deployment climbs to expanded mode because it needs that capability,
not because it has more users. The lower resource and operational
footprint of the lightweight stack is a consequence of this rule, not the
trigger for it. See **Capability Progression** below.
Identity answers: who is this actor, how was the actor authenticated,
what coarse claims are asserted, and what assurance evidence exists?
Identity does not answer final resource-specific authorization.
## Authorization Path
```text
Identity claims from IAM Profile
|
v
flex-auth
resource registry
policy packages
CARING descriptors
decision/audit/explain envelope
|
+-- standalone evaluator
+-- Topaz delegated PDP
+-- optional Keycloak AuthZ adapter
+-- future OpenFGA/SpiceDB/OPA/Cedar adapters
|
v
Protected service enforcement
```
Authorization answers: may this actor perform this action on this
resource in this context, and what explanation/audit/CARING metadata
supports that answer?
Protected services enforce decisions locally. flex-auth is the canonical
policy and decision boundary; delegated PDPs are runtime implementations
behind it.
## Secret And Credential Path
```text
Bootstrap SOPS/age material
|
v
OpenBao platform secrets service
KV v2 platform configuration
dynamic database credentials
Kubernetes auth / workload identity
future object-storage credential brokering
audit devices and lease/revocation records
|
+-- direct OpenBao clients
+-- External Secrets Operator / synced Kubernetes Secrets
+-- CSI-mounted secrets where appropriate
|
v
Platform and tenant workloads
```
SOPS/age remains the bootstrap and Git-at-rest protection mechanism. It
can create the initial cluster secrets and emergency recovery bundles, but
it should not become the long-lived runtime authority for every workload
secret.
OpenBao is the runtime platform secrets service once the control plane is
alive. It owns secret leases, revocation, audit, dynamic credentials, and
workload-facing secret delivery patterns. Workloads should receive scoped
secrets or short-lived credentials, not platform-root material. Tenant
administrators may manage tenant-scoped secrets through approved policy
paths; they must not gain access to OpenBao root tokens, unseal keys,
platform mounts, or global secret engine configuration.
OpenBao does not replace identity or authorization. NetKingdom IAM
identifies actors and workloads; flex-auth decides whether a credential
or secret request is allowed; OpenBao stores, issues, audits, and revokes
the resulting secret material.
## Recursive Trust Rule
Normal tenant administration must never be sufficient to alter the
platform root of trust.
This applies even when the tenant is Coulomb. Coulomb can be a tenant and
a reference workload, but platform-root actions require platform control
plane authority and appropriate bootstrap/break-glass safeguards.
Examples of platform-root actions:
- changing IAM Profile semantics
- rotating root bootstrap keys
- changing break-glass access
- changing global MFA requirements
- activating authorization policy that governs platform administration
- changing flex-auth/Topaz policy import pipelines
- changing OpenBao root tokens, unseal policy, platform mounts, or global
auth methods
- changing audit retention or tamper-evidence settings
## Tenant Model
Every protected resource should belong to a tenant or to the platform
control plane.
Suggested identifiers:
```text
tenant:platform # platform control plane resources
tenant:coulomb # first internal/reference tenant
tenant:sandbox:<name> # sandbox tenants
tenant:customer:<name> # future customer tenants
```
Tenant membership and platform membership are distinct. A subject may be
an administrator in `tenant:coulomb` without being a platform operator.
CARING descriptors should explicitly identify scope and tenant when the
access is tenant-scoped. Platform-scoped descriptors should be rare,
audited, and usually condition-bound.
## Bootstrap To Runtime Transition
Production setup should move through explicit trust states:
1. **Bare host trust** - provisioned and verified by Railiance infra.
2. **Cluster trust** - Kubernetes runtime exists and is verified.
3. **Bootstrap secret trust** - age/SOPS and emergency bundles are
established.
4. **Bootstrap identity trust** - local/bootstrap identity can operate
enough to install full identity services.
5. **Runtime secret trust** - OpenBao is deployed, initialized, unsealed,
audited, backed up, and ready to issue scoped secrets.
6. **Runtime identity trust** - key-cape or Keycloak becomes the normal
IAM Profile issuer.
7. **Runtime authorization trust** - flex-auth and Topaz are initialized
with platform and tenant policies.
8. **Tenant onboarding trust** - Coulomb and later tenants register
resources and receive scoped authority.
Each transition needs a verification check and a rollback/recovery path.
## Production Topology
For an initial production-capable Coulomb deployment:
```text
railiance-infra
host baseline, SSH, age keys, emergency material
railiance-cluster
Kubernetes, ingress, cert-manager, network policy
railiance-platform
OpenBao, PostgreSQL, object storage, platform service secret delivery
key-cape or Keycloak
privacyIDEA where used
flex-auth
Topaz
railiance-apps
Coulomb services as tenant:coulomb workloads
```
`net-kingdom` owns the architecture and standards. Railiance owns the
converged deployment layers. Component repos own their implementation
contracts.
## Orchestration Implication
A future orchestration repo may be justified, but only after the state
machine is clear. It should not own resources directly. It should own
safe sequencing across repos.
Possible responsibilities:
- verify Railiance preconditions
- initialize credential bootstrap
- deploy or validate identity services
- deploy or validate flex-auth and Topaz
- run IAM Profile conformance checks
- run authorization conformance checks
- produce a platform security readiness report
This orchestration layer should build on Railiance capabilities rather
than bypassing the Railiance stack boundaries.
ADR-0007 records the current decision: keep orchestration in Railiance
playbooks for now, with NetKingdom defining the trust-state model,
readiness checks, OpenBao boundaries, and security semantics.
The playbook interface for that split is the NetKingdom Playbook
Capability Contract (`canon/standards/playbook-capability-contract_v0.1.md`).
Railiance playbooks publish declarations beside the playbooks; NetKingdom
validates and consumes those declarations to select capabilities,
parametrize allowed inputs, and assemble responsibility/trust-state
views without taking over execution.
## flex-auth And Topaz Implications
flex-auth work must preserve the recursive boundary between platform
control-plane resources and tenant resources.
Required implications:
- CARING descriptors must include scope and tenant metadata for
tenant-scoped access, and must mark rare platform-scoped access
explicitly.
- Policy packages must distinguish `tenant:platform` policy from
tenant-local packages such as `tenant:coulomb`.
- Decision envelopes must carry subject, issuer, audience, tenant,
principal type, groups, roles, scopes, protected-system id, resource,
action, requested TTL where relevant, assurance evidence, obligations,
deny reasons, and audit correlation ids. Subject, issuer, audience,
tenant, principal type, groups, roles, scopes, and assurance come from
the IAM Profile v0.2 token contract rather than provider-specific
session state.
- Topaz is a delegated PDP runtime behind flex-auth. It must not become
the canonical policy model, identity provider, or platform control
plane.
- Audit and explain records must be durable enough to reconstruct why a
platform-root, secret, credential, or tenant-administration decision was
allowed or denied.
- Platform-root guardrails must deny tenant administrators the ability to
alter IAM Profile semantics, OpenBao platform mounts/auth methods,
flex-auth policy import pipelines, Topaz runtime configuration, or
platform audit retention.
OpenBao secret access and dynamic credential requests follow the same
authorization rule: identity proves the actor or workload, flex-auth
decides whether the request is permitted, and OpenBao stores, issues,
leases, audits, and revokes the secret material.
## Coulomb Tenant Onboarding Path
The first Coulomb tenant onboarding path should be repeatable before it
becomes automated:
1. Register `tenant:coulomb` as a tenant distinct from
`tenant:platform`.
2. Map Coulomb human, service, and agent principals to IAM Profile claims
with issuer, audience, subject, group, tenant, and assurance evidence.
3. Register Coulomb protected systems and resources in flex-auth with
stable protected-system ids.
4. Import tenant-scoped policy packages and CARING descriptors for
Coulomb resources.
5. Initialize the delegated PDP runtime, starting with Topaz, using only
the policy packages approved for the tenant and platform boundary.
6. Provision Coulomb workload secret paths, Kubernetes auth roles, or
delivery mechanisms in OpenBao without granting access to platform
mounts, unseal/recovery material, or global auth configuration.
7. Run audit readiness checks before admitting production traffic:
identity issuance, flex-auth decision envelope, Topaz health,
OpenBao audit event, workload enforcement event, and correlation id.
The onboarding path is complete when a Coulomb workload can authenticate,
receive a scoped authorization decision, obtain only the allowed secret or
short-lived credential, enforce the decision locally, and produce an
auditable record without receiving platform-root authority.
## Capability Progression (Start Small → Enterprise)
NetKingdom is designed so an IT landscape can be brought up from nothing
and hardened **one capability at a time**, with no structural rework when
the next capability is added. Every tier is usable on its own and every
tier issues or consumes the same NetKingdom IAM Profile, so adding a
capability extends the system rather than replacing it.
The progression is capability-keyed: you climb a tier when you need the
capability it adds, never because of user count.
| Tier | Capability added | Components added | You move here when… |
| --- | --- | --- | --- |
| **C0 — Bootstrap identity** | A local OIDC issuer + secret bootstrap so things can start safely before the platform exists | local-identity (NK-WP-0002), SOPS/age + agent bootstrap (NK-WP-0004/0005) | you have nothing yet and need dev/test/sandbox identity |
| **C1 — Lightweight SSO** | Single-factor OIDC SSO over an internal directory | key-cape: Authelia + LLDAP | you want real SSO for internal users/services |
| **C2a — 2FA (light)** | Second factor without a new heavy service | Authelia built-in TOTP / WebAuthn | you need 2FA but not enterprise token lifecycle |
| **C2b — Token authority** | Hardware tokens, many token types, self-service enrollment, token lifecycle | privacyIDEA | you need an enterprise-grade MFA/token authority |
| **C3 — Runtime secrets** | Dynamic, scoped, leased, audited secrets beyond bootstrap | OpenBao (NK-WP-0006) | workloads need runtime secrets, not just bootstrap material |
| **C4 — Fine-grained authZ** | Policy-as-code decisions beyond coarse SSO claims | flex-auth + Topaz PDP (ADR-0006) | identity alone can no longer answer "may this actor do this?" |
| **C5 — Enterprise federation** | Inbound Entra ID / AD / SAML brokering, multi-tenant realms | expanded-mode Keycloak (NK-WP-0011) | identities originate in an external enterprise IdP |
| **C6 — Self-optimizing** | Audit feedback loops, drift surfacing, continuous adaptation | central audit sink + kaizen loops | the platform should improve and verify itself continuously |
Two properties make this safe rather than just sequential:
- **Usable at every tier.** C1 is a working SSO platform; you are never
forced to reach C5 to get value.
- **No structural breaks.** Because every tier targets the IAM Profile
contract, 2FA (C2), runtime secrets (C3), fine-grained authorization
(C4), and federation (C5) are *additive*. Applications keep targeting
the same Profile; the implementation behind it grows.
2FA illustrates the principle precisely: if you do not need a second
factor, C2 is simply absent — the C1 stack runs without privacyIDEA. When
you do, C2a (Authelia's built-in TOTP/WebAuthn) is the light option and
C2b (privacyIDEA) is the enterprise token-authority option. Neither
requires re-architecting C1.
The intent is turn-key: NetKingdom selects, places, and orchestrates the
components for the chosen tier set so the landscape reaches ready-to-run
state — like building a house to handover condition — and can be extended
to the next tier later without demolition.
## Production Readiness Checks
Before the security platform is production-ready, each trust state needs
an explicit check:
| Area | Readiness check |
| --- | --- |
| MFA and identity | key-cape or Keycloak issues IAM Profile v0.2-compatible tokens and passes `tools/iam-profile-conformance/`; privacyIDEA or the selected MFA provider enforces required assurance for privileged actions |
| Bootstrap and recovery | age/SOPS material, emergency bundle, and break-glass credentials are present, tested, and separated from tenant administration |
| OpenBao runtime secrets | OpenBao is initialized, unsealed or auto-unsealed by the approved mechanism, backed up, audited, and using scoped auth methods and mounts |
| Secret rotation | service, database, OpenBao-issued, and break-glass rotation paths have documented blast radius and verification steps |
| flex-auth policy state | platform and tenant policy packages are versioned, reviewable, imported, and explainable |
| Topaz runtime | delegated PDP health, data freshness, policy load status, and fail-closed behavior are verified |
| Tenant onboarding | `tenant:coulomb` resources, claims, policies, OpenBao paths, and audit correlation are registered and tested |
| Audit sink | identity, flex-auth, Topaz, OpenBao, Kubernetes, and workload audit records land in durable storage with restore/drill coverage |
| Break-glass | emergency access works when normal identity is unavailable and produces a post-event review record |
## Open Questions
- Where is the durable audit log stored for platform-root decisions?
- Where are OpenBao audit logs durably shipped, and how are they included
in tamper-evidence and restore drills?
- Which actions require dual control or human confirmation?
- How is break-glass use recorded when normal identity is unavailable?
- Which workloads consume OpenBao directly, via External Secrets Operator,
or via CSI-mounted secrets?
- Which tenant metadata is required before a service can register
resources with flex-auth?
- What precise per-tenant trigger and dual-issuer coexistence rule should
NK-WP-0011-T1 use for Keycloak expanded mode?
- Does Topaz run centrally for the platform, per tenant, or per service
for the first production deployment?