generated from coulomb/repo-seed
Mirror the five Gitea wiki pages into wiki/ (Home, ProductVision, BrandFrame, ConfigLayering, CompetitiveLandscape) as a verbatim in-repo copy. Add research/ digest on configuration layering and the configuration control plane: the resolution/merge model, the 2024-2026 config-outage case, adjacent tool families (config-as-data, GitOps drift, feature flags + AI config, secrets, policy-as-code, CMDB/portals/SSPM), a reference architecture, and an annotated bibliography of 17 sources. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
245 lines
13 KiB
Markdown
245 lines
13 KiB
Markdown
# Configuration Layering and the Configuration Control Plane — Research Digest
|
||
|
||
> Compiled 2026-06-26. Numbered references resolve in [`sources.md`](sources.md).
|
||
> This digest deepens the repo's own [ConfigLayering primer](../wiki/ConfigLayering.md)
|
||
> and [CompetitiveLandscape](../wiki/CompetitiveLandscape.md) with primary sources
|
||
> and the surrounding technical context.
|
||
|
||
---
|
||
|
||
## 1. The thesis in one paragraph
|
||
|
||
Configuration stopped being static data a long time ago. It is now *distributed
|
||
control information*: the live mechanism that changes how production systems
|
||
behave, in real time, often faster and with less ceremony than a code deploy. As
|
||
cloud-native scale grew, the industry independently converged on treating
|
||
configuration as a **control plane** — something that needs staged rollout,
|
||
blast-radius containment, dependency-aware validation, and automated rollback,
|
||
exactly like the deployment systems it sits beside [1]. **ConfigAtlas** bets that
|
||
before companies can *control* that surface safely, they first need to *see* it:
|
||
discover where configuration lives, classify it by kind and scope, resolve the
|
||
effective value, and attach ownership and evidence. Map the territory, then govern
|
||
it.
|
||
|
||
---
|
||
|
||
## 2. Why this matters now: configuration is the dominant failure mode
|
||
|
||
The strongest argument for a configuration control plane is the outage record. A
|
||
disproportionate share of large 2024–2026 incidents trace to a configuration
|
||
change rather than a code defect [4][5]:
|
||
|
||
- **CrowdStrike (Jul 2024)** — a faulty Falcon *sensor configuration* update
|
||
blue-screened Windows hosts worldwide; estimated ~$5.4B impact to Fortune 500
|
||
firms alone. A content/config push, not a binary release [5].
|
||
- **AT&T Mobility (Feb 2024)** — an equipment *configuration error* took down
|
||
~125M devices for 12+ hours, blocking ~92M calls including 25,000 to 911 [5].
|
||
- **Cloudflare (Nov 2025)** — a global outage taking down X, ChatGPT, Spotify and
|
||
others, triggered by a software bug *exposed by a configuration change* [5].
|
||
- **Azure Front Door (Nov 2025) / Azure networking (2025)** — a control-plane
|
||
defect and a networking *configuration change* produced multi-hour to ~50-hour
|
||
degradations across services [4][7].
|
||
|
||
ThousandEyes' 2024 internet-outage analysis names configuration change as a
|
||
leading, recurring cause [4]. The lesson the hyperscalers drew is not "stop
|
||
changing config" — it is "make unsafe configuration changes progressively harder
|
||
to express, deploy, or overlook" [1]. That sentence is essentially the ConfigAtlas
|
||
mission restated as a safety property.
|
||
|
||
---
|
||
|
||
## 3. Configuration layering — the resolution model
|
||
|
||
Layering is the practice of composing one **effective configuration** from
|
||
multiple ordered scopes. The repo's primer [internal] gives the canonical stack;
|
||
the research backs *why* each design choice is non-negotiable.
|
||
|
||
### 3.1 The scope stack
|
||
|
||
```
|
||
L0 vendor/product defaults
|
||
L1 company baseline
|
||
L2 platform/domain baseline
|
||
L3 environment overlay (dev/test/stage/prod)
|
||
L4 region/zone/cluster overlay
|
||
L5 installation/deployment overlay
|
||
L6 tenant/customer/community overlay
|
||
L7 group/role overlay
|
||
L8 user/agent/workload overlay
|
||
L9 emergency/runtime override
|
||
```
|
||
|
||
"More specific wins" is the default, but **higher layers may declare
|
||
non-overridable guardrails** (a security baseline a tenant cannot loosen). This is
|
||
the same base+overlay pattern behind Kubernetes Kustomize, Helm value precedence,
|
||
and NixOS modules [8][9] — the industry already agrees on the shape; what is
|
||
missing is a cross-tool *view* of it.
|
||
|
||
### 3.2 The effective configuration is the only thing that's real
|
||
|
||
A file or a flag is partial evidence. The value that actually applies to a given
|
||
system/tenant/request is the resolved result of every relevant layer. The central
|
||
product capability — and the line between a config *database* and a config
|
||
*control plane* — is answering: **what value applies here, which layer won, what
|
||
did it override, which policy constrained it, and who is affected** [internal,
|
||
CompetitiveLandscape §"Effective configuration resolution"].
|
||
|
||
### 3.3 Merge semantics are where layering quietly fails
|
||
|
||
Vague merge behavior is the most dangerous part of layering. Define it explicitly:
|
||
|
||
```
|
||
scalar more specific layer replaces earlier value
|
||
object/map deep merge by key
|
||
array/list replace by default; keyed merge only if declared
|
||
null not deletion unless tombstone semantics are defined
|
||
secret never merged into normal config
|
||
policy restrictive rule wins unless explicitly delegated
|
||
```
|
||
|
||
The schema/validation choice matters here. **JSON Schema** validates structure and
|
||
constraints but keeps schema and data separate. **CUE** unifies types and values
|
||
in a single lattice where merge (`&`) is commutative, associative, and idempotent
|
||
— so the resolved result is *order-independent*, and the same definition both
|
||
validates data and reduces boilerplate [2][3]. By contrast Jsonnet's `+` mixin
|
||
composition is order-dependent (right-hand side wins on scalar conflicts) [2].
|
||
For a control plane whose whole value proposition is a *deterministic, explainable*
|
||
effective value, order-independent merge is a meaningful property, not a detail.
|
||
Notably, CUE itself now ships **CUE Hub**, explicitly branded "the Configuration
|
||
Control Plane" — independent validation that the category name is forming [6].
|
||
|
||
### 3.4 Mutability classes prevent the worst failure mode
|
||
|
||
Every key should declare how it can change: `build-time`, `deploy-time`,
|
||
`startup-time`, `hot-reloadable`, `per-request`, `emergency`. The recurring
|
||
failure is treating dangerous structural config like a harmless flag — exactly the
|
||
CrowdStrike-shaped risk where a "content update" had deploy-grade blast radius [5].
|
||
|
||
---
|
||
|
||
## 4. The adjacent topics (the converging market)
|
||
|
||
The control plane is not one product; it is a convergence of tool families.
|
||
ConfigAtlas's stance is **integrate and map, don't replace** [internal,
|
||
CompetitiveLandscape]. Summary of each adjacency and the research behind it:
|
||
|
||
### 4.1 Configuration-as-Data (the closest intellectual neighbor)
|
||
Brian Grant — creator of the Kubernetes Resource Model (KRM), now CTO of ConfigHub
|
||
— argues configuration should be *data*, authoritative and stored like data, with
|
||
code that operates on it kept separate [10][11]. ConfigHub stores each variant in
|
||
fully-rendered "WET" form (no templates/variables/generators), versioned with
|
||
metadata, and — because KRM *is* the API representation — can update config *from*
|
||
live state, mitigating drift bidirectionally [10][12]. This is the strongest
|
||
direct competitor and the sharpest articulation of "config is graph-shaped
|
||
operational data, not files." **ConfigAtlas differentiation:** discovery-first and
|
||
cross-tool — map config that already lives in many systems, rather than asking
|
||
everyone to move into one store.
|
||
|
||
### 4.2 GitOps / IaC — desired state and drift
|
||
Argo CD and Flux continuously reconcile live cluster state against Git-declared
|
||
desired state; any divergence is *drift*, flagged or auto-corrected on a sync loop
|
||
[13]. Terraform/OpenTofu do the same for infrastructure lifecycle. This camp owns
|
||
the "desired state" narrative. **ConfigAtlas complements it with the "effective
|
||
state" narrative:** GitOps tells you what you *intended* to deploy; ConfigAtlas
|
||
tells you which scopes contributed, what actually applies, who owns it, and what's
|
||
risky to change [internal].
|
||
|
||
### 4.3 Feature flags / runtime control — and the AI-era expansion
|
||
Feature management (LaunchDarkly, Unleash, Flagsmith, OpenFeature as the
|
||
vendor-neutral standard) owns live behavior change and **progressive delivery**:
|
||
ring-based rollout (internal → 1–5% canary → 10–25% beta → 100%), deterministic
|
||
cohorts for blast-radius containment, and kill switches / circuit breakers that
|
||
auto-deactivate on SLO breach [14][15]. The frontier is **AI configuration**:
|
||
LaunchDarkly's AI Configs / AgentControl move prompts, model selection, and tool
|
||
access out of code into runtime config that propagates in <200ms, with guarded
|
||
rollouts that auto-revert when eval metrics (accuracy, toxicity) drop [16][17].
|
||
This validates the core ConfigAtlas claim — the *kinds* of configuration keep
|
||
multiplying (now: agent behavior), so a map that spans kinds is increasingly
|
||
valuable. **ConfigAtlas treats flags as one scope class among many**, not the
|
||
whole plane [internal].
|
||
|
||
### 4.4 Secrets management — adjacent but kept separate
|
||
Vault, OpenBao, Infisical, Doppler, plus SOPS and External Secrets for the
|
||
GitOps path. Secrets differ in sensitivity, lifecycle, and blast radius and must
|
||
never be merged into ordinary config [internal]. **ConfigAtlas stores references
|
||
and dependencies, never values** — which config depends on which secret, where
|
||
it's injected, what's affected if it rotates.
|
||
|
||
### 4.5 Policy-as-code — the guardrail backend
|
||
OPA, Kyverno, Checkov answer "is this change allowed?" across K8s, CI/CD, IaC, and
|
||
more [internal]. They are ideal *validation backends* for a control plane but
|
||
don't model provenance, ownership, or effective behavior. **ConfigAtlas is the
|
||
context and evidence layer around them** — which policy applies, at which scope,
|
||
and why.
|
||
|
||
### 4.6 CMDB / developer portals / SSPM — the enterprise gravity wells
|
||
CMDBs (ServiceNow et al.) model assets and services; developer portals (Backstage,
|
||
Port, Cortex, OpsLevel) model ownership; SSPM tools (CoreView, AppOmni) model SaaS
|
||
posture drift [internal]. None model the layered behavioral config surface with
|
||
effective-value resolution. **ConfigAtlas integrates** — enriching catalogs and
|
||
portals rather than displacing them; a Backstage/Port plugin is a plausible
|
||
adoption path.
|
||
|
||
---
|
||
|
||
## 5. Reference architecture for a configuration control plane
|
||
|
||
Synthesizing the layering primer with the control-plane framing [1][internal]:
|
||
|
||
```
|
||
Config Canon vocabulary + schema (what a key means)
|
||
Config Registry every key: owner, type, allowed scopes, lifecycle, mutability, security class
|
||
Config Resolver deterministic layer ordering -> effective value (the "explain" engine)
|
||
Config Policy allowed values + allowed overrides (OPA/Kyverno/CUE backends)
|
||
Config Delivery env vars / ConfigMaps / sidecar / SDK / API lookup
|
||
Config Evidence snapshots, who/what/why/when, drift, rollout, rollback
|
||
```
|
||
|
||
The InfoQ framing adds three forward-looking elements that map directly onto this:
|
||
**reconciler-first control planes** (resolution as a continuous loop, à la GitOps),
|
||
**configuration knowledge graphs** (the `key → service → deployment → tenant →
|
||
feature → policy → secret → owner → incident` graph), and **AI-assisted decision
|
||
support** (surfacing blast radius and risk before a human approves a change) [1].
|
||
The knowledge-graph element is precisely ConfigAtlas's differentiator.
|
||
|
||
Guiding rule from the primer: **put config as close as possible to its owner, but
|
||
as high as necessary for consistency** — defaults with the product, guardrails
|
||
high and central, tenant prefs low, secrets outside, flags in the runtime plane,
|
||
infra state in GitOps.
|
||
|
||
---
|
||
|
||
## 6. The wedge and the white space
|
||
|
||
The defensible opening is **read-first configuration intelligence**, not
|
||
write-first control [internal, CompetitiveLandscape]. The category name
|
||
("Configuration Control Plane") is emerging and not yet owned — InfoQ frames it as
|
||
a pattern [1], CUE markets a product under the exact phrase [6], ConfigHub attacks
|
||
the same instinct from the data angle [10]. None yet own the **companywide living
|
||
configuration surface**: cross-tool discovery, effective-value resolution,
|
||
organizational scope/ownership governance, blast-radius/dependency intelligence,
|
||
and change evidence.
|
||
|
||
Sharpest positioning [internal]:
|
||
|
||
> **ConfigAtlas is not where all configuration must live. It is where
|
||
> configuration becomes visible, explainable, governable, and safe to change.**
|
||
|
||
---
|
||
|
||
## 7. Open questions to drive the next research pass
|
||
|
||
1. **Discovery connectors** — what is the minimum viable set of ingestion sources
|
||
(Git, K8s, Terraform state, a feature-flag platform, a secret manager) to
|
||
prove cross-tool effective-config resolution end to end?
|
||
2. **Effective-value provenance schema** — can the registry's entry schema carry
|
||
enough to render a full `config explain` (source layer, overrides, validating
|
||
schema, owner) without becoming a second source of truth for values?
|
||
3. **Graph model** — what is the canonical edge set for the configuration
|
||
knowledge graph, and does it reuse the State Hub's existing relationship model?
|
||
4. **CUE vs JSON Schema** for atlas entry validation — does order-independent
|
||
merge buy enough to justify the toolchain cost over JSON Schema? [2][3]
|
||
5. **AI-config as a first-class scope** — given the LaunchDarkly trajectory [16],
|
||
should "agent/model configuration" be a named scope class in the L-stack now?
|
||
</content>
|