docs(adr): ADR-003 — Materialized Derived State with Fingerprint Invalidation

Formalises the caching pattern introduced in doi_cache: pre-compute and store repo-sourced derived data, invalidate by fingerprint (composite of DB timestamps + file mtimes), force-refresh on demand. Names the pattern against the literature (Materialized View, Derived Data Store, CQRS Read Model, ETag-style invalidation) and mandates its use for all future repo-sourced derived data with an implementation checklist. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2026-03-20 01:54:21 +01:00
parent d63e7310d5
commit 51e95ec21a
1 changed files with 174 additions and 0 deletions
--- a/canon/architecture/adr-003-materialized-derived-state.md
+++ b/canon/architecture/adr-003-materialized-derived-state.md
@@ -0,0 +1,174 @@
+---
+id: ADR-003
+type: architecture-decision-record
+title: "Materialized Derived State with Fingerprint Invalidation for Repo-Sourced Data"
+status: accepted
+decided_by: Bernd Worsch
+date: "2026-03-20"
+tags: ["architecture", "state-hub", "caching", "read-model", "materialized-view", "derived-state"]
+---
+
+# ADR-003: Materialized Derived State with Fingerprint Invalidation
+
+## Status
+
+Accepted.
+
+## Context
+
+The Custodian State Hub is a **read model** (CQRS terminology) — its data is
+fully derivable from canonical sources that live in repositories and the
+filesystem. No state-hub data is authoritative; it is always a derived view
+of what the repos contain.
+
+Several categories of data fit this description:
+
+| Data | Canonical source | State-hub table |
+|---|---|---|
+| SBOM dependencies | `uv.lock`, `package-lock.json`, etc. | `sbom_entries` |
+| Third-party service declarations | `tpsc.yaml` | `tpsc_entries` |
+| Provided capabilities | `SCOPE.md` `capability` blocks | `capability_catalog` |
+| DoI compliance tier | 14 criteria across repo files + DB | `doi_cache` |
+| Workplan task status | `workplans/*.md` | `tasks` |
+
+Early implementations either recomputed this data on every request (too slow)
+or ingested it once without invalidation (stale data goes undetected). Neither
+is acceptable for a system designed to give accurate, fast orientation.
+
+The `doi_cache` table, introduced in CUST-WP-0024, demonstrated a pattern that
+solves both problems. This ADR formalises that pattern and mandates its use
+for all repo-sourced derived data.
+
+## Pattern Name
+
+**Materialized Derived State with Fingerprint Invalidation.**
+
+This pattern is known under several names in the literature:
+
+- **Materialized View** (SQL standard, PostgreSQL) — the stored result of a
+  query or computation, refreshed on demand when source data changes.
+- **Derived Data Store** (Kleppmann, *Designing Data-Intensive Applications*,
+  Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream
+  sources; it is never the source of truth.
+- **Read Model / Projection** (CQRS / Event Sourcing) — a pre-computed view
+  maintained alongside a write model, rebuilt when relevant events occur.
+- **Fingerprint-based / Content-addressed invalidation** — analogous to HTTP
+  ETags: a cache entry is valid as long as a composite hash/timestamp of its
+  inputs matches the stored value.
+
+The State Hub already documents itself as a read model. This ADR extends that
+principle to specify *how* the read model stays fresh.
+
+## Decision
+
+### 1. All repo-sourced derived data MUST be materialised in the DB
+
+Data computed from repository files or repo records must be stored in a
+dedicated table rather than recomputed per request. Direct computation on
+every API call is only permissible for development tooling or when explicitly
+forced by the caller.
+
+### 2. Each materialised table MUST carry a `fingerprint` column
+
+The fingerprint is a deterministic string encoding all inputs that affect the
+computed result. It is compared on each read; if unchanged, the stored result
+is returned without recomputation. If changed, the result is recomputed and
+the stored value is updated.
+
+**Fingerprint composition rules:**
+- Include the `updated_at` timestamp of every DB record that feeds the
+  computation (repo record, related domain, goals, snapshots).
+- Include the `mtime` (filesystem modification time) of every file that feeds
+  the computation (`SCOPE.md`, `CLAUDE.md`, lockfiles, `tpsc.yaml`, etc.).
+- Join all components with `|` as a pipe-separated string — no hashing needed
+  since the string is compared by equality, not transmitted to clients.
+- If a file is absent, encode `filename:absent` rather than omitting it, so
+  file creation also triggers invalidation.
+
+**Reference implementation:** `state-hub/api/doi_engine.py::compute_fingerprint()`
+
+### 3. Every materialised endpoint MUST support `?force_refresh=true`
+
+Callers must always be able to bypass the cache and trigger a fresh
+computation. This is the escape hatch for debugging, post-ingest verification,
+and scheduled background refresh jobs.
+
+### 4. Writes to source data SHOULD update the repo record's `updated_at`
+
+Operations that change source data (SBOM ingest, TPSC ingest, capability
+ingest) must ensure `managed_repos.updated_at` is refreshed so the
+fingerprint detects the change on the next read. Where data lives in a
+related table (e.g. `tpsc_snapshots`), the fingerprint must include that
+table's `max(snapshot_at)` directly rather than relying on the repo record.
+
+### 5. The DB is never the source of truth — the rebuild principle holds
+
+Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting
+all canonical sources. Materialised tables are **caches**, not records of
+authority. They may be wiped and repopulated at any time without data loss.
+This means:
+- No materialised table may be the only copy of any information.
+- Schema migrations that wipe a materialised table are safe and expected.
+- Background jobs that periodically re-ingest all repos are valid and
+  encouraged.
+
+## Consequences
+
+### Positive
+
+- **Fast reads in steady state** — after the first computation, subsequent
+  reads hit the DB with no filesystem or subprocess overhead.
+- **Accurate on change** — fingerprint invalidation ensures stale data is
+  never silently served; the cache refreshes exactly when needed.
+- **Debuggable** — `force_refresh=true` and `checked_at` timestamps make it
+  easy to see when a value was last computed and to trigger a recheck.
+- **Consistent with the read model principle** — the pattern makes explicit
+  what was always implied: state-hub data is derived, not authoritative.
+
+### Negative / Trade-offs
+
+- **First-call latency** — cache misses are expensive (filesystem reads,
+  subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at
+  startup or after ingest.
+- **Fingerprint completeness** — if a new input is added to a computation and
+  not added to the fingerprint, stale results will be silently returned. The
+  fingerprint must be kept in sync with the computation.
+- **Filesystem dependency** — file mtimes are volatile (e.g. `git checkout`
+  rewrites mtimes). In practice this means a cache miss after every checkout,
+  not a correctness problem.
+
+## Implementation Checklist
+
+When adding a new category of repo-sourced derived data:
+
+- [ ] Create a `_cache` or `_snapshots` table with `fingerprint` and
+      `checked_at` columns.
+- [ ] Implement `compute_fingerprint(repo, ...)` in the relevant module.
+- [ ] Add `?force_refresh=true` query parameter to the read endpoint.
+- [ ] Ensure the ingest script (or write path) touches `managed_repos.updated_at`
+      or includes a related table's `max(timestamp)` in the fingerprint.
+- [ ] Verify the cache can be wiped and repopulated without data loss.
+- [ ] Document which inputs are included in the fingerprint in a comment
+      alongside `compute_fingerprint`.
+
+## Current Implementations
+
+| Derived data | Table | Fingerprint inputs | Force-refresh |
+|---|---|---|---|
+| DoI compliance tier | `doi_cache` | `repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)` | `?force_refresh=true` |
+
+## Planned Applications
+
+| Derived data | Table (proposed) | Notes |
+|---|---|---|
+| SBOM summary stats | `sbom_cache` | Fingerprint: `max(sbom_snapshots.snapshot_at)` |
+| Capability declarations | `capability_cache` | Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at` |
+| Workplan status summary | Already handled by consistency checker | Fingerprint: workplan file mtimes |
+
+## Related
+
+- ADR-001: Workplans and Work Items Are Repository Artefacts
+- ADR-002: Custodian Agent Runtime Design
+- `state-hub/api/doi_engine.py` — reference implementation
+- `state-hub/api/models/doi_cache.py` — reference schema
+- `state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py` — reference migration