generated from coulomb/repo-seed
499 lines
22 KiB
Markdown
499 lines
22 KiB
Markdown
# Platform Identity and Security Architecture
|
|
|
|
Status: implemented architecture baseline for NetKingdom/Railiance/Coulomb
|
|
Date: 2026-05-24
|
|
|
|
## Purpose
|
|
|
|
This document captures the production-oriented identity, authorization,
|
|
MFA, credential, and bootstrap architecture for the platform we are
|
|
building. It deliberately treats Coulomb as the first internal tenant and
|
|
reference workload, not as the platform itself.
|
|
|
|
The architecture must be recursive: the same platform that protects
|
|
future tenants also protects the services and repositories used to build
|
|
and operate the platform. That recursion is useful, but it is also where
|
|
many security designs accidentally collapse into self-administering root
|
|
power. This document exists to prevent that.
|
|
|
|
## Core Model
|
|
|
|
```text
|
|
Bootstrap plane
|
|
establishes initial trust before normal platform services exist
|
|
|
|
Platform control plane
|
|
operates identity, MFA, secrets, policy, audit, and authorization
|
|
|
|
Tenant planes
|
|
run Coulomb and future customer/project/domain workloads
|
|
```
|
|
|
|
Coulomb is the first internal tenant. It is also the reference tenant that
|
|
helps validate the platform. It must not become the platform root of
|
|
trust merely because it is first.
|
|
|
|
## Planes
|
|
|
|
### Bootstrap Plane
|
|
|
|
The bootstrap plane exists before the full platform is alive. It owns the
|
|
minimal authority needed to create and recover the control plane.
|
|
|
|
Responsibilities:
|
|
|
|
- host provisioning and hardening
|
|
- root age/SOPS material and emergency bundles
|
|
- initial cluster access
|
|
- initial identity service deployment
|
|
- initial secret injection
|
|
- break-glass recovery
|
|
- transition to managed runtime authority
|
|
|
|
Owned primarily by `railiance-infra`, `railiance-cluster`, and the
|
|
credential bootstrap work in `net-kingdom`.
|
|
|
|
### Platform Control Plane
|
|
|
|
The platform control plane owns shared security services.
|
|
|
|
Responsibilities:
|
|
|
|
- NetKingdom IAM Profile
|
|
- lightweight identity mode through key-cape
|
|
- expanded identity mode through Keycloak
|
|
- MFA/token lifecycle through privacyIDEA where applicable
|
|
- canonical authorization through flex-auth
|
|
- delegated authorization runtime through Topaz first, with other PDPs as
|
|
adapters
|
|
- runtime secret authority through OpenBao
|
|
- audit and explanation records
|
|
- platform service secrets, dynamic credentials, leases, and rotation
|
|
|
|
Owned conceptually by `net-kingdom`; deployed through the Railiance stack.
|
|
|
|
### Tenant Plane
|
|
|
|
Tenant planes are where workloads live. Coulomb is tenant zero/reference
|
|
tenant; later tenants may be projects, customers, domains, sandboxes, or
|
|
isolated deployments.
|
|
|
|
Responsibilities:
|
|
|
|
- protected services and repositories
|
|
- tenant-owned resources
|
|
- tenant-specific groups, policies, and service accounts
|
|
- local enforcement of authorization decisions
|
|
- workload audit events and diagnostics
|
|
|
|
Tenant administrators may manage their tenant resources. They must not be
|
|
able to alter platform root trust, global identity configuration,
|
|
platform break-glass material, or the policy pipeline that governs the
|
|
platform itself.
|
|
|
|
## Component Responsibilities
|
|
|
|
| Component | Primary role | Must not become |
|
|
| --- | --- | --- |
|
|
| `net-kingdom` | canonical security architecture, IAM Profile, SSO/MFA, credential bootstrap decisions | a deployment repo for every stack layer |
|
|
| `key-cape` | lightweight IAM implementation of the NetKingdom IAM Profile | a general-purpose IAM platform or authorization engine |
|
|
| Keycloak | expanded-mode IAM and optional Keycloak Authorization Services adapter | the canonical model for all platform authorization |
|
|
| privacyIDEA | MFA/token authority, especially in lightweight/key-cape mode | a policy decision point for application resources |
|
|
| OpenBao | runtime platform secrets service, dynamic credential broker, lease/revocation point, and audit source for secret access | the bootstrap root of trust or an application-specific configuration store |
|
|
| `flex-auth` | authorization control plane, CARING descriptors, policy packages, decision envelopes, audit/explain | an identity provider or backend-specific wrapper |
|
|
| Topaz | first delegated authorization runtime/PDP for flex-auth | the platform control plane or identity provider |
|
|
| Railiance repos | converged infrastructure, cluster, platform services, enablement, and app deployment | the source of security policy semantics |
|
|
|
|
## Identity Path
|
|
|
|
```text
|
|
Human/service/agent principal
|
|
|
|
|
v
|
|
NetKingdom IAM Profile
|
|
|
|
|
+-- lightweight mode: key-cape
|
|
| Authelia + LLDAP + privacyIDEA
|
|
|
|
|
+-- expanded mode: Keycloak
|
|
Keycloak + LDAP/Entra federation + MFA integration
|
|
```
|
|
|
|
Applications depend on the IAM Profile, not on the concrete provider.
|
|
key-cape is the lightweight profile implementation. Keycloak is the
|
|
expanded-mode profile implementation. privacyIDEA provides MFA/token
|
|
capabilities where the deployment mode uses it.
|
|
|
|
The canonical profile is NetKingdom IAM Profile v0.2
|
|
(`canon/standards/iam-profile_v0.2.md`). It requires explicit `tenant`,
|
|
`principal_type`, `groups`, `roles`, `scope`/`scp`, and `assurance`
|
|
claims so flex-auth receives normalized identity input regardless of
|
|
whether key-cape or Keycloak issued the token.
|
|
|
|
The choice between lightweight and expanded mode is **capability-driven,
|
|
not scale-driven**. key-cape comfortably serves large internal user
|
|
populations; expanded-mode Keycloak is introduced when a *capability* is
|
|
required that the lightweight stack does not provide — chiefly inbound
|
|
enterprise federation and SAML brokering (Entra ID, Active Directory,
|
|
generic SAML IdPs), complex multi-realm topologies, or delegated admin.
|
|
A deployment climbs to expanded mode because it needs that capability,
|
|
not because it has more users. The lower resource and operational
|
|
footprint of the lightweight stack is a consequence of this rule, not the
|
|
trigger for it. See **Capability Progression** below.
|
|
|
|
Identity answers: who is this actor, how was the actor authenticated,
|
|
what coarse claims are asserted, and what assurance evidence exists?
|
|
|
|
Identity does not answer final resource-specific authorization.
|
|
|
|
## Authorization Path
|
|
|
|
```text
|
|
Identity claims from IAM Profile
|
|
|
|
|
v
|
|
flex-auth
|
|
resource registry
|
|
policy packages
|
|
CARING descriptors
|
|
decision/audit/explain envelope
|
|
|
|
|
+-- standalone evaluator
|
|
+-- Topaz delegated PDP
|
|
+-- optional Keycloak AuthZ adapter
|
|
+-- future OpenFGA/SpiceDB/OPA/Cedar adapters
|
|
|
|
|
v
|
|
Protected service enforcement
|
|
```
|
|
|
|
Authorization answers: may this actor perform this action on this
|
|
resource in this context, and what explanation/audit/CARING metadata
|
|
supports that answer?
|
|
|
|
Protected services enforce decisions locally. flex-auth is the canonical
|
|
policy and decision boundary; delegated PDPs are runtime implementations
|
|
behind it.
|
|
|
|
## Secret And Credential Path
|
|
|
|
```text
|
|
Bootstrap SOPS/age material
|
|
|
|
|
v
|
|
OpenBao platform secrets service
|
|
KV v2 platform configuration
|
|
dynamic database credentials
|
|
Kubernetes auth / workload identity
|
|
future object-storage credential brokering
|
|
audit devices and lease/revocation records
|
|
|
|
|
+-- direct OpenBao clients
|
|
+-- External Secrets Operator / synced Kubernetes Secrets
|
|
+-- CSI-mounted secrets where appropriate
|
|
|
|
|
v
|
|
Platform and tenant workloads
|
|
```
|
|
|
|
SOPS/age remains the bootstrap and Git-at-rest protection mechanism. It
|
|
can create the initial cluster secrets and emergency recovery bundles, but
|
|
it should not become the long-lived runtime authority for every workload
|
|
secret.
|
|
|
|
OpenBao is the runtime platform secrets service once the control plane is
|
|
alive. It owns secret leases, revocation, audit, dynamic credentials, and
|
|
workload-facing secret delivery patterns. Workloads should receive scoped
|
|
secrets or short-lived credentials, not platform-root material. Tenant
|
|
administrators may manage tenant-scoped secrets through approved policy
|
|
paths; they must not gain access to OpenBao root tokens, unseal keys,
|
|
platform mounts, or global secret engine configuration.
|
|
|
|
OpenBao does not replace identity or authorization. NetKingdom IAM
|
|
identifies actors and workloads; flex-auth decides whether a credential
|
|
or secret request is allowed; OpenBao stores, issues, audits, and revokes
|
|
the resulting secret material.
|
|
|
|
## Platform Root Custody
|
|
|
|
Platform root authority is an accountable custody role, not a tenant admin role
|
|
and not a Git account secret. `docs/platform-root-custody.md` records
|
|
`tegwick` / `bernd.worsch@gmail.com` as the initial setup operator and contact,
|
|
not as the long-term platform root of trust.
|
|
|
|
The actual root-of-trust target is a separate king credential: a dedicated,
|
|
rarely used platform-root identity independent from day-to-day Gitea and email
|
|
accounts. Email may receive notifications, but Git, Gitea, State Hub, chat,
|
|
tickets, shell history, and email must never store or transfer unseal keys,
|
|
root tokens, private keys, OTP seeds, recovery codes, or screenshots of secret
|
|
output.
|
|
|
|
Production-ready custody should move toward independent escrow, preferably
|
|
two-of-three human or institutional recovery control. Temporary single-operator
|
|
king custody is allowed only as a pre-production bootstrap posture with
|
|
second-factor protection, encrypted offline storage, and a low-friction upgrade
|
|
path to additional custodians.
|
|
|
|
The normal admin path should become NetKingdom IAM claims mapped to scoped
|
|
OpenBao policies. The initial OpenBao root token remains a bootstrap or
|
|
break-glass artifact and must not become the standing operator credential. The
|
|
platform must also reset or rotate bootstrap-era credentials and access paths
|
|
before live workloads rely on it.
|
|
|
|
## Recursive Trust Rule
|
|
|
|
Normal tenant administration must never be sufficient to alter the
|
|
platform root of trust.
|
|
|
|
This applies even when the tenant is Coulomb. Coulomb can be a tenant and
|
|
a reference workload, but platform-root actions require platform control
|
|
plane authority and appropriate bootstrap/break-glass safeguards.
|
|
|
|
Examples of platform-root actions:
|
|
|
|
- changing IAM Profile semantics
|
|
- rotating root bootstrap keys
|
|
- changing break-glass access
|
|
- changing global MFA requirements
|
|
- activating authorization policy that governs platform administration
|
|
- changing flex-auth/Topaz policy import pipelines
|
|
- changing OpenBao root tokens, unseal policy, platform mounts, or global
|
|
auth methods
|
|
- changing audit retention or tamper-evidence settings
|
|
|
|
## Tenant Model
|
|
|
|
Every protected resource should belong to a tenant or to the platform
|
|
control plane.
|
|
|
|
Suggested identifiers:
|
|
|
|
```text
|
|
tenant:platform # platform control plane resources
|
|
tenant:coulomb # first internal/reference tenant
|
|
tenant:sandbox:<name> # sandbox tenants
|
|
tenant:customer:<name> # future customer tenants
|
|
```
|
|
|
|
Tenant membership and platform membership are distinct. A subject may be
|
|
an administrator in `tenant:coulomb` without being a platform operator.
|
|
|
|
CARING descriptors should explicitly identify scope and tenant when the
|
|
access is tenant-scoped. Platform-scoped descriptors should be rare,
|
|
audited, and usually condition-bound.
|
|
|
|
## Bootstrap To Runtime Transition
|
|
|
|
Production setup should move through explicit trust states:
|
|
|
|
1. **Bare host trust** - provisioned and verified by Railiance infra.
|
|
2. **Cluster trust** - Kubernetes runtime exists and is verified.
|
|
3. **Bootstrap secret trust** - age/SOPS and emergency bundles are
|
|
established.
|
|
4. **Bootstrap identity trust** - local/bootstrap identity can operate
|
|
enough to install full identity services.
|
|
5. **Runtime secret trust** - OpenBao is deployed, initialized, unsealed,
|
|
audited, backed up, and ready to issue scoped secrets.
|
|
6. **Runtime identity trust** - key-cape or Keycloak becomes the normal
|
|
IAM Profile issuer.
|
|
7. **Runtime authorization trust** - flex-auth and Topaz are initialized
|
|
with platform and tenant policies.
|
|
8. **Tenant onboarding trust** - Coulomb and later tenants register
|
|
resources and receive scoped authority.
|
|
|
|
Each transition needs a verification check and a rollback/recovery path.
|
|
|
|
## Production Topology
|
|
|
|
For an initial production-capable Coulomb deployment:
|
|
|
|
```text
|
|
railiance-infra
|
|
host baseline, SSH, age keys, emergency material
|
|
|
|
railiance-cluster
|
|
Kubernetes, ingress, cert-manager, network policy
|
|
|
|
railiance-platform
|
|
OpenBao, PostgreSQL, object storage, platform service secret delivery
|
|
key-cape or Keycloak
|
|
privacyIDEA where used
|
|
flex-auth
|
|
Topaz
|
|
|
|
railiance-apps
|
|
Coulomb services as tenant:coulomb workloads
|
|
```
|
|
|
|
`net-kingdom` owns the architecture and standards. Railiance owns the
|
|
converged deployment layers. Component repos own their implementation
|
|
contracts.
|
|
|
|
## Orchestration Implication
|
|
|
|
A future orchestration repo may be justified, but only after the state
|
|
machine is clear. It should not own resources directly. It should own
|
|
safe sequencing across repos.
|
|
|
|
Possible responsibilities:
|
|
|
|
- verify Railiance preconditions
|
|
- initialize credential bootstrap
|
|
- deploy or validate identity services
|
|
- deploy or validate flex-auth and Topaz
|
|
- run IAM Profile conformance checks
|
|
- run authorization conformance checks
|
|
- produce a platform security readiness report
|
|
|
|
This orchestration layer should build on Railiance capabilities rather
|
|
than bypassing the Railiance stack boundaries.
|
|
|
|
ADR-0007 records the current decision: keep orchestration in Railiance
|
|
playbooks for now, with NetKingdom defining the trust-state model,
|
|
readiness checks, OpenBao boundaries, and security semantics.
|
|
|
|
The playbook interface for that split is the NetKingdom Playbook
|
|
Capability Contract (`canon/standards/playbook-capability-contract_v0.1.md`).
|
|
Railiance playbooks publish declarations beside the playbooks; NetKingdom
|
|
validates and consumes those declarations to select capabilities,
|
|
parametrize allowed inputs, and assemble responsibility/trust-state
|
|
views without taking over execution.
|
|
|
|
## flex-auth And Topaz Implications
|
|
|
|
flex-auth work must preserve the recursive boundary between platform
|
|
control-plane resources and tenant resources.
|
|
|
|
Required implications:
|
|
|
|
- CARING descriptors must include scope and tenant metadata for
|
|
tenant-scoped access, and must mark rare platform-scoped access
|
|
explicitly.
|
|
- Policy packages must distinguish `tenant:platform` policy from
|
|
tenant-local packages such as `tenant:coulomb`.
|
|
- Decision envelopes must carry subject, issuer, audience, tenant,
|
|
principal type, groups, roles, scopes, protected-system id, resource,
|
|
action, requested TTL where relevant, assurance evidence, obligations,
|
|
deny reasons, and audit correlation ids. Subject, issuer, audience,
|
|
tenant, principal type, groups, roles, scopes, and assurance come from
|
|
the IAM Profile v0.2 token contract rather than provider-specific
|
|
session state.
|
|
- Topaz is a delegated PDP runtime behind flex-auth. It must not become
|
|
the canonical policy model, identity provider, or platform control
|
|
plane.
|
|
- Audit and explain records must be durable enough to reconstruct why a
|
|
platform-root, secret, credential, or tenant-administration decision was
|
|
allowed or denied.
|
|
- Platform-root guardrails must deny tenant administrators the ability to
|
|
alter IAM Profile semantics, OpenBao platform mounts/auth methods,
|
|
flex-auth policy import pipelines, Topaz runtime configuration, or
|
|
platform audit retention.
|
|
|
|
OpenBao secret access and dynamic credential requests follow the same
|
|
authorization rule: identity proves the actor or workload, flex-auth
|
|
decides whether the request is permitted, and OpenBao stores, issues,
|
|
leases, audits, and revokes the secret material.
|
|
|
|
## Coulomb Tenant Onboarding Path
|
|
|
|
The first Coulomb tenant onboarding path should be repeatable before it
|
|
becomes automated:
|
|
|
|
1. Register `tenant:coulomb` as a tenant distinct from
|
|
`tenant:platform`.
|
|
2. Map Coulomb human, service, and agent principals to IAM Profile claims
|
|
with issuer, audience, subject, group, tenant, and assurance evidence.
|
|
3. Register Coulomb protected systems and resources in flex-auth with
|
|
stable protected-system ids.
|
|
4. Import tenant-scoped policy packages and CARING descriptors for
|
|
Coulomb resources.
|
|
5. Initialize the delegated PDP runtime, starting with Topaz, using only
|
|
the policy packages approved for the tenant and platform boundary.
|
|
6. Provision Coulomb workload secret paths, Kubernetes auth roles, or
|
|
delivery mechanisms in OpenBao without granting access to platform
|
|
mounts, unseal/recovery material, or global auth configuration.
|
|
7. Run audit readiness checks before admitting production traffic:
|
|
identity issuance, flex-auth decision envelope, Topaz health,
|
|
OpenBao audit event, workload enforcement event, and correlation id.
|
|
|
|
The onboarding path is complete when a Coulomb workload can authenticate,
|
|
receive a scoped authorization decision, obtain only the allowed secret or
|
|
short-lived credential, enforce the decision locally, and produce an
|
|
auditable record without receiving platform-root authority.
|
|
|
|
## Capability Progression (Start Small → Enterprise)
|
|
|
|
NetKingdom is designed so an IT landscape can be brought up from nothing
|
|
and hardened **one capability at a time**, with no structural rework when
|
|
the next capability is added. Every tier is usable on its own and every
|
|
tier issues or consumes the same NetKingdom IAM Profile, so adding a
|
|
capability extends the system rather than replacing it.
|
|
|
|
The progression is capability-keyed: you climb a tier when you need the
|
|
capability it adds, never because of user count.
|
|
|
|
| Tier | Capability added | Components added | You move here when… |
|
|
| --- | --- | --- | --- |
|
|
| **C0 — Bootstrap identity** | A local OIDC issuer + secret bootstrap so things can start safely before the platform exists | local-identity (NK-WP-0002), SOPS/age + agent bootstrap (NK-WP-0004/0005) | you have nothing yet and need dev/test/sandbox identity |
|
|
| **C1 — Lightweight SSO** | Single-factor OIDC SSO over an internal directory | key-cape: Authelia + LLDAP | you want real SSO for internal users/services |
|
|
| **C2a — 2FA (light)** | Second factor without a new heavy service | Authelia built-in TOTP / WebAuthn | you need 2FA but not enterprise token lifecycle |
|
|
| **C2b — Token authority** | Hardware tokens, many token types, self-service enrollment, token lifecycle | privacyIDEA | you need an enterprise-grade MFA/token authority |
|
|
| **C3 — Runtime secrets** | Dynamic, scoped, leased, audited secrets beyond bootstrap | OpenBao (NK-WP-0006) | workloads need runtime secrets, not just bootstrap material |
|
|
| **C4 — Fine-grained authZ** | Policy-as-code decisions beyond coarse SSO claims | flex-auth + Topaz PDP (ADR-0006) | identity alone can no longer answer "may this actor do this?" |
|
|
| **C5 — Enterprise federation** | Inbound Entra ID / AD / SAML brokering, multi-tenant realms | expanded-mode Keycloak (NK-WP-0011) | identities originate in an external enterprise IdP |
|
|
| **C6 — Self-optimizing** | Audit feedback loops, drift surfacing, continuous adaptation | central audit sink + kaizen loops | the platform should improve and verify itself continuously |
|
|
|
|
Two properties make this safe rather than just sequential:
|
|
|
|
- **Usable at every tier.** C1 is a working SSO platform; you are never
|
|
forced to reach C5 to get value.
|
|
- **No structural breaks.** Because every tier targets the IAM Profile
|
|
contract, 2FA (C2), runtime secrets (C3), fine-grained authorization
|
|
(C4), and federation (C5) are *additive*. Applications keep targeting
|
|
the same Profile; the implementation behind it grows.
|
|
|
|
2FA illustrates the principle precisely: if you do not need a second
|
|
factor, C2 is simply absent — the C1 stack runs without privacyIDEA. When
|
|
you do, C2a (Authelia's built-in TOTP/WebAuthn) is the light option and
|
|
C2b (privacyIDEA) is the enterprise token-authority option. Neither
|
|
requires re-architecting C1.
|
|
|
|
The intent is turn-key: NetKingdom selects, places, and orchestrates the
|
|
components for the chosen tier set so the landscape reaches ready-to-run
|
|
state — like building a house to handover condition — and can be extended
|
|
to the next tier later without demolition.
|
|
|
|
## Production Readiness Checks
|
|
|
|
Before the security platform is production-ready, each trust state needs
|
|
an explicit check:
|
|
|
|
| Area | Readiness check |
|
|
| --- | --- |
|
|
| Platform root custody | setup operator, dedicated king credential, second factor, recovery storage, escrow posture, and root-token disposition are recorded without storing secret values |
|
|
| MFA and identity | key-cape or Keycloak issues IAM Profile v0.2-compatible tokens and passes `tools/iam-profile-conformance/`; privacyIDEA or the selected MFA provider enforces required assurance for privileged actions |
|
|
| Bootstrap and recovery | age/SOPS material, emergency bundle, and break-glass credentials are present, tested, and separated from tenant administration |
|
|
| OpenBao runtime secrets | OpenBao is initialized, unsealed or auto-unsealed by the approved mechanism, backed up, audited, and using scoped auth methods and mounts |
|
|
| Secret rotation | service, database, OpenBao-issued, and break-glass rotation paths have documented blast radius and verification steps |
|
|
| flex-auth policy state | platform and tenant policy packages are versioned, reviewable, imported, and explainable |
|
|
| Topaz runtime | delegated PDP health, data freshness, policy load status, and fail-closed behavior are verified |
|
|
| Tenant onboarding | `tenant:coulomb` resources, claims, policies, OpenBao paths, and audit correlation are registered and tested |
|
|
| Audit sink | identity, flex-auth, Topaz, OpenBao, Kubernetes, and workload audit records land in durable storage with restore/drill coverage |
|
|
| Break-glass | emergency access works when normal identity is unavailable and produces a post-event review record |
|
|
|
|
## Open Questions
|
|
|
|
- Where is the durable audit log stored for platform-root decisions?
|
|
- Where are OpenBao audit logs durably shipped, and how are they included
|
|
in tamper-evidence and restore drills?
|
|
- Which actions require dual control or human confirmation?
|
|
- How is break-glass use recorded when normal identity is unavailable?
|
|
- Which workloads consume OpenBao directly, via External Secrets Operator,
|
|
or via CSI-mounted secrets?
|
|
- Which tenant metadata is required before a service can register
|
|
resources with flex-auth?
|
|
- What precise per-tenant trigger and dual-issuer coexistence rule should
|
|
NK-WP-0011-T1 use for Keycloak expanded mode?
|
|
- Does Topaz run centrally for the platform, per tenant, or per service
|
|
for the first production deployment?
|