--- id: ADR-003 type: architecture-decision-record title: "Materialized Derived State with Fingerprint Invalidation for Repo-Sourced Data" status: accepted decided_by: Bernd Worsch date: "2026-03-20" tags: ["architecture", "state-hub", "caching", "read-model", "materialized-view", "derived-state"] --- # ADR-003: Materialized Derived State with Fingerprint Invalidation ## Status Accepted. ## Context The Custodian State Hub is a **read model** (CQRS terminology) — its data is fully derivable from canonical sources that live in repositories and the filesystem. No state-hub data is authoritative; it is always a derived view of what the repos contain. Several categories of data fit this description: | Data | Canonical source | State-hub table | |---|---|---| | SBOM dependencies | `uv.lock`, `package-lock.json`, etc. | `sbom_entries` | | Third-party service declarations | `tpsc.yaml` | `tpsc_entries` | | Provided capabilities | `SCOPE.md` `capability` blocks | `capability_catalog` | | DoI compliance tier | 14 criteria across repo files + DB | `doi_cache` | | Workplan task status | `workplans/*.md` | `tasks` | Early implementations either recomputed this data on every request (too slow) or ingested it once without invalidation (stale data goes undetected). Neither is acceptable for a system designed to give accurate, fast orientation. The `doi_cache` table, introduced in CUST-WP-0024, demonstrated a pattern that solves both problems. This ADR formalises that pattern and mandates its use for all repo-sourced derived data. ## Pattern Name **Materialized Derived State with Fingerprint Invalidation.** This pattern is known under several names in the literature: - **Materialized View** (SQL standard, PostgreSQL) — the stored result of a query or computation, refreshed on demand when source data changes. - **Derived Data Store** (Kleppmann, *Designing Data-Intensive Applications*, Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream sources; it is never the source of truth. - **Read Model / Projection** (CQRS / Event Sourcing) — a pre-computed view maintained alongside a write model, rebuilt when relevant events occur. - **Fingerprint-based / Content-addressed invalidation** — analogous to HTTP ETags: a cache entry is valid as long as a composite hash/timestamp of its inputs matches the stored value. The State Hub already documents itself as a read model. This ADR extends that principle to specify *how* the read model stays fresh. ## Decision ### 1. All repo-sourced derived data MUST be materialised in the DB Data computed from repository files or repo records must be stored in a dedicated table rather than recomputed per request. Direct computation on every API call is only permissible for development tooling or when explicitly forced by the caller. ### 2. Each materialised table MUST carry a `fingerprint` column The fingerprint is a deterministic string encoding all inputs that affect the computed result. It is compared on each read; if unchanged, the stored result is returned without recomputation. If changed, the result is recomputed and the stored value is updated. **Fingerprint composition rules:** - Include the `updated_at` timestamp of every DB record that feeds the computation (repo record, related domain, goals, snapshots). - Include the `mtime` (filesystem modification time) of every file that feeds the computation (`SCOPE.md`, `CLAUDE.md`, lockfiles, `tpsc.yaml`, etc.). - Join all components with `|` as a pipe-separated string — no hashing needed since the string is compared by equality, not transmitted to clients. - If a file is absent, encode `filename:absent` rather than omitting it, so file creation also triggers invalidation. **Reference implementation:** `state-hub/api/doi_engine.py::compute_fingerprint()` ### 3. Every materialised endpoint MUST support `?force_refresh=true` Callers must always be able to bypass the cache and trigger a fresh computation. This is the escape hatch for debugging, post-ingest verification, and scheduled background refresh jobs. ### 4. Writes to source data SHOULD update the repo record's `updated_at` Operations that change source data (SBOM ingest, TPSC ingest, capability ingest) must ensure `managed_repos.updated_at` is refreshed so the fingerprint detects the change on the next read. Where data lives in a related table (e.g. `tpsc_snapshots`), the fingerprint must include that table's `max(snapshot_at)` directly rather than relying on the repo record. ### 5. The DB is never the source of truth — the rebuild principle holds Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting all canonical sources. Materialised tables are **caches**, not records of authority. They may be wiped and repopulated at any time without data loss. This means: - No materialised table may be the only copy of any information. - Schema migrations that wipe a materialised table are safe and expected. - Background jobs that periodically re-ingest all repos are valid and encouraged. ## Consequences ### Positive - **Fast reads in steady state** — after the first computation, subsequent reads hit the DB with no filesystem or subprocess overhead. - **Accurate on change** — fingerprint invalidation ensures stale data is never silently served; the cache refreshes exactly when needed. - **Debuggable** — `force_refresh=true` and `checked_at` timestamps make it easy to see when a value was last computed and to trigger a recheck. - **Consistent with the read model principle** — the pattern makes explicit what was always implied: state-hub data is derived, not authoritative. ### Negative / Trade-offs - **First-call latency** — cache misses are expensive (filesystem reads, subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at startup or after ingest. - **Fingerprint completeness** — if a new input is added to a computation and not added to the fingerprint, stale results will be silently returned. The fingerprint must be kept in sync with the computation. - **Filesystem dependency** — file mtimes are volatile (e.g. `git checkout` rewrites mtimes). In practice this means a cache miss after every checkout, not a correctness problem. ## Implementation Checklist When adding a new category of repo-sourced derived data: - [ ] Create a `_cache` or `_snapshots` table with `fingerprint` and `checked_at` columns. - [ ] Implement `compute_fingerprint(repo, ...)` in the relevant module. - [ ] Add `?force_refresh=true` query parameter to the read endpoint. - [ ] Ensure the ingest script (or write path) touches `managed_repos.updated_at` or includes a related table's `max(timestamp)` in the fingerprint. - [ ] Verify the cache can be wiped and repopulated without data loss. - [ ] Document which inputs are included in the fingerprint in a comment alongside `compute_fingerprint`. ## Current Implementations | Derived data | Table | Fingerprint inputs | Force-refresh | |---|---|---|---| | DoI compliance tier | `doi_cache` | `repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)` | `?force_refresh=true` | ## Planned Applications | Derived data | Table (proposed) | Notes | |---|---|---| | SBOM summary stats | `sbom_cache` | Fingerprint: `max(sbom_snapshots.snapshot_at)` | | Capability declarations | `capability_cache` | Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at` | | Workplan status summary | Already handled by consistency checker | Fingerprint: workplan file mtimes | ## Related - ADR-001: Workplans and Work Items Are Repository Artefacts - ADR-002: Custodian Agent Runtime Design - `state-hub/api/doi_engine.py` — reference implementation - `state-hub/api/models/doi_cache.py` — reference schema - `state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py` — reference migration