config-atlas/research/configuration-control-plane.md

# Configuration Layering and the Configuration Control Plane — Research Digest

> Compiled 2026-06-26. Numbered references resolve in [`sources.md`](sources.md).
> This digest deepens the repo's own [ConfigLayering primer](../wiki/ConfigLayering.md)
> and [CompetitiveLandscape](../wiki/CompetitiveLandscape.md) with primary sources
> and the surrounding technical context.

---

## 1. The thesis in one paragraph

Configuration stopped being static data a long time ago. It is now *distributed
control information*: the live mechanism that changes how production systems
behave, in real time, often faster and with less ceremony than a code deploy. As
cloud-native scale grew, the industry independently converged on treating
configuration as a **control plane** — something that needs staged rollout,
blast-radius containment, dependency-aware validation, and automated rollback,
exactly like the deployment systems it sits beside [1]. **ConfigAtlas** bets that
before companies can *control* that surface safely, they first need to *see* it:
discover where configuration lives, classify it by kind and scope, resolve the
effective value, and attach ownership and evidence. Map the territory, then govern
it.

---

## 2. Why this matters now: configuration is the dominant failure mode

The strongest argument for a configuration control plane is the outage record. A
disproportionate share of large 2024–2026 incidents trace to a configuration
change rather than a code defect [4][5]:

- **CrowdStrike (Jul 2024)** — a faulty Falcon *sensor configuration* update
  blue-screened Windows hosts worldwide; estimated ~$5.4B impact to Fortune 500
  firms alone. A content/config push, not a binary release [5].
- **AT&T Mobility (Feb 2024)** — an equipment *configuration error* took down
  ~125M devices for 12+ hours, blocking ~92M calls including 25,000 to 911 [5].
- **Cloudflare (Nov 2025)** — a global outage taking down X, ChatGPT, Spotify and
  others, triggered by a software bug *exposed by a configuration change* [5].
- **Azure Front Door (Nov 2025) / Azure networking (2025)** — a control-plane
  defect and a networking *configuration change* produced multi-hour to ~50-hour
  degradations across services [4][7].

ThousandEyes' 2024 internet-outage analysis names configuration change as a
leading, recurring cause [4]. The lesson the hyperscalers drew is not "stop
changing config" — it is "make unsafe configuration changes progressively harder
to express, deploy, or overlook" [1]. That sentence is essentially the ConfigAtlas
mission restated as a safety property.

---

## 3. Configuration layering — the resolution model

Layering is the practice of composing one **effective configuration** from
multiple ordered scopes. The repo's primer [internal] gives the canonical stack;
the research backs *why* each design choice is non-negotiable.

### 3.1 The scope stack

```
L0 vendor/product defaults
L1 company baseline
L2 platform/domain baseline
L3 environment overlay (dev/test/stage/prod)
L4 region/zone/cluster overlay
L5 installation/deployment overlay
L6 tenant/customer/community overlay
L7 group/role overlay
L8 user/agent/workload overlay
L9 emergency/runtime override
```

"More specific wins" is the default, but **higher layers may declare
non-overridable guardrails** (a security baseline a tenant cannot loosen). This is
the same base+overlay pattern behind Kubernetes Kustomize, Helm value precedence,
and NixOS modules [8][9] — the industry already agrees on the shape; what is
missing is a cross-tool *view* of it.

### 3.2 The effective configuration is the only thing that's real

A file or a flag is partial evidence. The value that actually applies to a given
system/tenant/request is the resolved result of every relevant layer. The central
product capability — and the line between a config *database* and a config
*control plane* — is answering: **what value applies here, which layer won, what
did it override, which policy constrained it, and who is affected** [internal,
CompetitiveLandscape §"Effective configuration resolution"].

### 3.3 Merge semantics are where layering quietly fails

Vague merge behavior is the most dangerous part of layering. Define it explicitly:

```
scalar     more specific layer replaces earlier value
object/map deep merge by key
array/list replace by default; keyed merge only if declared
null       not deletion unless tombstone semantics are defined
secret     never merged into normal config
policy     restrictive rule wins unless explicitly delegated
```

The schema/validation choice matters here. **JSON Schema** validates structure and
constraints but keeps schema and data separate. **CUE** unifies types and values
in a single lattice where merge (`&`) is commutative, associative, and idempotent
— so the resolved result is *order-independent*, and the same definition both
validates data and reduces boilerplate [2][3]. By contrast Jsonnet's `+` mixin
composition is order-dependent (right-hand side wins on scalar conflicts) [2].
For a control plane whose whole value proposition is a *deterministic, explainable*
effective value, order-independent merge is a meaningful property, not a detail.
Notably, CUE itself now ships **CUE Hub**, explicitly branded "the Configuration
Control Plane" — independent validation that the category name is forming [6].

### 3.4 Mutability classes prevent the worst failure mode

Every key should declare how it can change: `build-time`, `deploy-time`,
`startup-time`, `hot-reloadable`, `per-request`, `emergency`. The recurring
failure is treating dangerous structural config like a harmless flag — exactly the
CrowdStrike-shaped risk where a "content update" had deploy-grade blast radius [5].

---

## 4. The adjacent topics (the converging market)

The control plane is not one product; it is a convergence of tool families.
ConfigAtlas's stance is **integrate and map, don't replace** [internal,
CompetitiveLandscape]. Summary of each adjacency and the research behind it:

### 4.1 Configuration-as-Data (the closest intellectual neighbor)
Brian Grant — creator of the Kubernetes Resource Model (KRM), now CTO of ConfigHub
— argues configuration should be *data*, authoritative and stored like data, with
code that operates on it kept separate [10][11]. ConfigHub stores each variant in
fully-rendered "WET" form (no templates/variables/generators), versioned with
metadata, and — because KRM *is* the API representation — can update config *from*
live state, mitigating drift bidirectionally [10][12]. This is the strongest
direct competitor and the sharpest articulation of "config is graph-shaped
operational data, not files." **ConfigAtlas differentiation:** discovery-first and
cross-tool — map config that already lives in many systems, rather than asking
everyone to move into one store.

### 4.2 GitOps / IaC — desired state and drift
Argo CD and Flux continuously reconcile live cluster state against Git-declared
desired state; any divergence is *drift*, flagged or auto-corrected on a sync loop
[13]. Terraform/OpenTofu do the same for infrastructure lifecycle. This camp owns
the "desired state" narrative. **ConfigAtlas complements it with the "effective
state" narrative:** GitOps tells you what you *intended* to deploy; ConfigAtlas
tells you which scopes contributed, what actually applies, who owns it, and what's
risky to change [internal].

### 4.3 Feature flags / runtime control — and the AI-era expansion
Feature management (LaunchDarkly, Unleash, Flagsmith, OpenFeature as the
vendor-neutral standard) owns live behavior change and **progressive delivery**:
ring-based rollout (internal → 1–5% canary → 10–25% beta → 100%), deterministic
cohorts for blast-radius containment, and kill switches / circuit breakers that
auto-deactivate on SLO breach [14][15]. The frontier is **AI configuration**:
LaunchDarkly's AI Configs / AgentControl move prompts, model selection, and tool
access out of code into runtime config that propagates in <200ms, with guarded
rollouts that auto-revert when eval metrics (accuracy, toxicity) drop [16][17].
This validates the core ConfigAtlas claim — the *kinds* of configuration keep
multiplying (now: agent behavior), so a map that spans kinds is increasingly
valuable. **ConfigAtlas treats flags as one scope class among many**, not the
whole plane [internal].

### 4.4 Secrets management — adjacent but kept separate
Vault, OpenBao, Infisical, Doppler, plus SOPS and External Secrets for the
GitOps path. Secrets differ in sensitivity, lifecycle, and blast radius and must
never be merged into ordinary config [internal]. **ConfigAtlas stores references
and dependencies, never values** — which config depends on which secret, where
it's injected, what's affected if it rotates.

### 4.5 Policy-as-code — the guardrail backend
OPA, Kyverno, Checkov answer "is this change allowed?" across K8s, CI/CD, IaC, and
more [internal]. They are ideal *validation backends* for a control plane but
don't model provenance, ownership, or effective behavior. **ConfigAtlas is the
context and evidence layer around them** — which policy applies, at which scope,
and why.

### 4.6 CMDB / developer portals / SSPM — the enterprise gravity wells
CMDBs (ServiceNow et al.) model assets and services; developer portals (Backstage,
Port, Cortex, OpsLevel) model ownership; SSPM tools (CoreView, AppOmni) model SaaS
posture drift [internal]. None model the layered behavioral config surface with
effective-value resolution. **ConfigAtlas integrates** — enriching catalogs and
portals rather than displacing them; a Backstage/Port plugin is a plausible
adoption path.

---

## 5. Reference architecture for a configuration control plane

Synthesizing the layering primer with the control-plane framing [1][internal]:

```
Config Canon       vocabulary + schema (what a key means)
Config Registry    every key: owner, type, allowed scopes, lifecycle, mutability, security class
Config Resolver    deterministic layer ordering -> effective value (the "explain" engine)
Config Policy      allowed values + allowed overrides (OPA/Kyverno/CUE backends)
Config Delivery    env vars / ConfigMaps / sidecar / SDK / API lookup
Config Evidence    snapshots, who/what/why/when, drift, rollout, rollback
```

The InfoQ framing adds three forward-looking elements that map directly onto this:
**reconciler-first control planes** (resolution as a continuous loop, à la GitOps),
**configuration knowledge graphs** (the `key → service → deployment → tenant →
feature → policy → secret → owner → incident` graph), and **AI-assisted decision
support** (surfacing blast radius and risk before a human approves a change) [1].
The knowledge-graph element is precisely ConfigAtlas's differentiator.

Guiding rule from the primer: **put config as close as possible to its owner, but
as high as necessary for consistency** — defaults with the product, guardrails
high and central, tenant prefs low, secrets outside, flags in the runtime plane,
infra state in GitOps.

---

## 6. The wedge and the white space

The defensible opening is **read-first configuration intelligence**, not
write-first control [internal, CompetitiveLandscape]. The category name
("Configuration Control Plane") is emerging and not yet owned — InfoQ frames it as
a pattern [1], CUE markets a product under the exact phrase [6], ConfigHub attacks
the same instinct from the data angle [10]. None yet own the **companywide living
configuration surface**: cross-tool discovery, effective-value resolution,
organizational scope/ownership governance, blast-radius/dependency intelligence,
and change evidence.

Sharpest positioning [internal]:

> **ConfigAtlas is not where all configuration must live. It is where
> configuration becomes visible, explainable, governable, and safe to change.**

---

## 7. Open questions to drive the next research pass

1. **Discovery connectors** — what is the minimum viable set of ingestion sources
   (Git, K8s, Terraform state, a feature-flag platform, a secret manager) to
   prove cross-tool effective-config resolution end to end?
2. **Effective-value provenance schema** — can the registry's entry schema carry
   enough to render a full `config explain` (source layer, overrides, validating
   schema, owner) without becoming a second source of truth for values?
3. **Graph model** — what is the canonical edge set for the configuration
   knowledge graph, and does it reuse the State Hub's existing relationship model?
4. **CUE vs JSON Schema** for atlas entry validation — does order-independent
   merge buy enough to justify the toolchain cost over JSON Schema? [2][3]
5. **AI-config as a first-class scope** — given the LaunchDarkly trajectory [16],
   should "agent/model configuration" be a named scope class in the L-stack now?
</content>