From 51e95ec21ab1c340f7123f24bcfe5def2cb19ed1 Mon Sep 17 00:00:00 2001 From: tegwick Date: Fri, 20 Mar 2026 01:54:21 +0100 Subject: [PATCH] =?UTF-8?q?docs(adr):=20ADR-003=20=E2=80=94=20Materialized?= =?UTF-8?q?=20Derived=20State=20with=20Fingerprint=20Invalidation?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Formalises the caching pattern introduced in doi_cache: pre-compute and store repo-sourced derived data, invalidate by fingerprint (composite of DB timestamps + file mtimes), force-refresh on demand. Names the pattern against the literature (Materialized View, Derived Data Store, CQRS Read Model, ETag-style invalidation) and mandates its use for all future repo-sourced derived data with an implementation checklist. Co-Authored-By: Claude Sonnet 4.6 (1M context) --- .../adr-003-materialized-derived-state.md | 174 ++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 canon/architecture/adr-003-materialized-derived-state.md diff --git a/canon/architecture/adr-003-materialized-derived-state.md b/canon/architecture/adr-003-materialized-derived-state.md new file mode 100644 index 0000000..83e3f86 --- /dev/null +++ b/canon/architecture/adr-003-materialized-derived-state.md @@ -0,0 +1,174 @@ +--- +id: ADR-003 +type: architecture-decision-record +title: "Materialized Derived State with Fingerprint Invalidation for Repo-Sourced Data" +status: accepted +decided_by: Bernd Worsch +date: "2026-03-20" +tags: ["architecture", "state-hub", "caching", "read-model", "materialized-view", "derived-state"] +--- + +# ADR-003: Materialized Derived State with Fingerprint Invalidation + +## Status + +Accepted. + +## Context + +The Custodian State Hub is a **read model** (CQRS terminology) — its data is +fully derivable from canonical sources that live in repositories and the +filesystem. No state-hub data is authoritative; it is always a derived view +of what the repos contain. + +Several categories of data fit this description: + +| Data | Canonical source | State-hub table | +|---|---|---| +| SBOM dependencies | `uv.lock`, `package-lock.json`, etc. | `sbom_entries` | +| Third-party service declarations | `tpsc.yaml` | `tpsc_entries` | +| Provided capabilities | `SCOPE.md` `capability` blocks | `capability_catalog` | +| DoI compliance tier | 14 criteria across repo files + DB | `doi_cache` | +| Workplan task status | `workplans/*.md` | `tasks` | + +Early implementations either recomputed this data on every request (too slow) +or ingested it once without invalidation (stale data goes undetected). Neither +is acceptable for a system designed to give accurate, fast orientation. + +The `doi_cache` table, introduced in CUST-WP-0024, demonstrated a pattern that +solves both problems. This ADR formalises that pattern and mandates its use +for all repo-sourced derived data. + +## Pattern Name + +**Materialized Derived State with Fingerprint Invalidation.** + +This pattern is known under several names in the literature: + +- **Materialized View** (SQL standard, PostgreSQL) — the stored result of a + query or computation, refreshed on demand when source data changes. +- **Derived Data Store** (Kleppmann, *Designing Data-Intensive Applications*, + Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream + sources; it is never the source of truth. +- **Read Model / Projection** (CQRS / Event Sourcing) — a pre-computed view + maintained alongside a write model, rebuilt when relevant events occur. +- **Fingerprint-based / Content-addressed invalidation** — analogous to HTTP + ETags: a cache entry is valid as long as a composite hash/timestamp of its + inputs matches the stored value. + +The State Hub already documents itself as a read model. This ADR extends that +principle to specify *how* the read model stays fresh. + +## Decision + +### 1. All repo-sourced derived data MUST be materialised in the DB + +Data computed from repository files or repo records must be stored in a +dedicated table rather than recomputed per request. Direct computation on +every API call is only permissible for development tooling or when explicitly +forced by the caller. + +### 2. Each materialised table MUST carry a `fingerprint` column + +The fingerprint is a deterministic string encoding all inputs that affect the +computed result. It is compared on each read; if unchanged, the stored result +is returned without recomputation. If changed, the result is recomputed and +the stored value is updated. + +**Fingerprint composition rules:** +- Include the `updated_at` timestamp of every DB record that feeds the + computation (repo record, related domain, goals, snapshots). +- Include the `mtime` (filesystem modification time) of every file that feeds + the computation (`SCOPE.md`, `CLAUDE.md`, lockfiles, `tpsc.yaml`, etc.). +- Join all components with `|` as a pipe-separated string — no hashing needed + since the string is compared by equality, not transmitted to clients. +- If a file is absent, encode `filename:absent` rather than omitting it, so + file creation also triggers invalidation. + +**Reference implementation:** `state-hub/api/doi_engine.py::compute_fingerprint()` + +### 3. Every materialised endpoint MUST support `?force_refresh=true` + +Callers must always be able to bypass the cache and trigger a fresh +computation. This is the escape hatch for debugging, post-ingest verification, +and scheduled background refresh jobs. + +### 4. Writes to source data SHOULD update the repo record's `updated_at` + +Operations that change source data (SBOM ingest, TPSC ingest, capability +ingest) must ensure `managed_repos.updated_at` is refreshed so the +fingerprint detects the change on the next read. Where data lives in a +related table (e.g. `tpsc_snapshots`), the fingerprint must include that +table's `max(snapshot_at)` directly rather than relying on the repo record. + +### 5. The DB is never the source of truth — the rebuild principle holds + +Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting +all canonical sources. Materialised tables are **caches**, not records of +authority. They may be wiped and repopulated at any time without data loss. +This means: +- No materialised table may be the only copy of any information. +- Schema migrations that wipe a materialised table are safe and expected. +- Background jobs that periodically re-ingest all repos are valid and + encouraged. + +## Consequences + +### Positive + +- **Fast reads in steady state** — after the first computation, subsequent + reads hit the DB with no filesystem or subprocess overhead. +- **Accurate on change** — fingerprint invalidation ensures stale data is + never silently served; the cache refreshes exactly when needed. +- **Debuggable** — `force_refresh=true` and `checked_at` timestamps make it + easy to see when a value was last computed and to trigger a recheck. +- **Consistent with the read model principle** — the pattern makes explicit + what was always implied: state-hub data is derived, not authoritative. + +### Negative / Trade-offs + +- **First-call latency** — cache misses are expensive (filesystem reads, + subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at + startup or after ingest. +- **Fingerprint completeness** — if a new input is added to a computation and + not added to the fingerprint, stale results will be silently returned. The + fingerprint must be kept in sync with the computation. +- **Filesystem dependency** — file mtimes are volatile (e.g. `git checkout` + rewrites mtimes). In practice this means a cache miss after every checkout, + not a correctness problem. + +## Implementation Checklist + +When adding a new category of repo-sourced derived data: + +- [ ] Create a `_cache` or `_snapshots` table with `fingerprint` and + `checked_at` columns. +- [ ] Implement `compute_fingerprint(repo, ...)` in the relevant module. +- [ ] Add `?force_refresh=true` query parameter to the read endpoint. +- [ ] Ensure the ingest script (or write path) touches `managed_repos.updated_at` + or includes a related table's `max(timestamp)` in the fingerprint. +- [ ] Verify the cache can be wiped and repopulated without data loss. +- [ ] Document which inputs are included in the fingerprint in a comment + alongside `compute_fingerprint`. + +## Current Implementations + +| Derived data | Table | Fingerprint inputs | Force-refresh | +|---|---|---|---| +| DoI compliance tier | `doi_cache` | `repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)` | `?force_refresh=true` | + +## Planned Applications + +| Derived data | Table (proposed) | Notes | +|---|---|---| +| SBOM summary stats | `sbom_cache` | Fingerprint: `max(sbom_snapshots.snapshot_at)` | +| Capability declarations | `capability_cache` | Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at` | +| Workplan status summary | Already handled by consistency checker | Fingerprint: workplan file mtimes | + +## Related + +- ADR-001: Workplans and Work Items Are Repository Artefacts +- ADR-002: Custodian Agent Runtime Design +- `state-hub/api/doi_engine.py` — reference implementation +- `state-hub/api/models/doi_cache.py` — reference schema +- `state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py` — reference migration