docs(adr): ADR-003 — Materialized Derived State with Fingerprint Invalidation
Formalises the caching pattern introduced in doi_cache: pre-compute and store repo-sourced derived data, invalidate by fingerprint (composite of DB timestamps + file mtimes), force-refresh on demand. Names the pattern against the literature (Materialized View, Derived Data Store, CQRS Read Model, ETag-style invalidation) and mandates its use for all future repo-sourced derived data with an implementation checklist. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
174
canon/architecture/adr-003-materialized-derived-state.md
Normal file
174
canon/architecture/adr-003-materialized-derived-state.md
Normal file
@@ -0,0 +1,174 @@
|
||||
---
|
||||
id: ADR-003
|
||||
type: architecture-decision-record
|
||||
title: "Materialized Derived State with Fingerprint Invalidation for Repo-Sourced Data"
|
||||
status: accepted
|
||||
decided_by: Bernd Worsch
|
||||
date: "2026-03-20"
|
||||
tags: ["architecture", "state-hub", "caching", "read-model", "materialized-view", "derived-state"]
|
||||
---
|
||||
|
||||
# ADR-003: Materialized Derived State with Fingerprint Invalidation
|
||||
|
||||
## Status
|
||||
|
||||
Accepted.
|
||||
|
||||
## Context
|
||||
|
||||
The Custodian State Hub is a **read model** (CQRS terminology) — its data is
|
||||
fully derivable from canonical sources that live in repositories and the
|
||||
filesystem. No state-hub data is authoritative; it is always a derived view
|
||||
of what the repos contain.
|
||||
|
||||
Several categories of data fit this description:
|
||||
|
||||
| Data | Canonical source | State-hub table |
|
||||
|---|---|---|
|
||||
| SBOM dependencies | `uv.lock`, `package-lock.json`, etc. | `sbom_entries` |
|
||||
| Third-party service declarations | `tpsc.yaml` | `tpsc_entries` |
|
||||
| Provided capabilities | `SCOPE.md` `capability` blocks | `capability_catalog` |
|
||||
| DoI compliance tier | 14 criteria across repo files + DB | `doi_cache` |
|
||||
| Workplan task status | `workplans/*.md` | `tasks` |
|
||||
|
||||
Early implementations either recomputed this data on every request (too slow)
|
||||
or ingested it once without invalidation (stale data goes undetected). Neither
|
||||
is acceptable for a system designed to give accurate, fast orientation.
|
||||
|
||||
The `doi_cache` table, introduced in CUST-WP-0024, demonstrated a pattern that
|
||||
solves both problems. This ADR formalises that pattern and mandates its use
|
||||
for all repo-sourced derived data.
|
||||
|
||||
## Pattern Name
|
||||
|
||||
**Materialized Derived State with Fingerprint Invalidation.**
|
||||
|
||||
This pattern is known under several names in the literature:
|
||||
|
||||
- **Materialized View** (SQL standard, PostgreSQL) — the stored result of a
|
||||
query or computation, refreshed on demand when source data changes.
|
||||
- **Derived Data Store** (Kleppmann, *Designing Data-Intensive Applications*,
|
||||
Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream
|
||||
sources; it is never the source of truth.
|
||||
- **Read Model / Projection** (CQRS / Event Sourcing) — a pre-computed view
|
||||
maintained alongside a write model, rebuilt when relevant events occur.
|
||||
- **Fingerprint-based / Content-addressed invalidation** — analogous to HTTP
|
||||
ETags: a cache entry is valid as long as a composite hash/timestamp of its
|
||||
inputs matches the stored value.
|
||||
|
||||
The State Hub already documents itself as a read model. This ADR extends that
|
||||
principle to specify *how* the read model stays fresh.
|
||||
|
||||
## Decision
|
||||
|
||||
### 1. All repo-sourced derived data MUST be materialised in the DB
|
||||
|
||||
Data computed from repository files or repo records must be stored in a
|
||||
dedicated table rather than recomputed per request. Direct computation on
|
||||
every API call is only permissible for development tooling or when explicitly
|
||||
forced by the caller.
|
||||
|
||||
### 2. Each materialised table MUST carry a `fingerprint` column
|
||||
|
||||
The fingerprint is a deterministic string encoding all inputs that affect the
|
||||
computed result. It is compared on each read; if unchanged, the stored result
|
||||
is returned without recomputation. If changed, the result is recomputed and
|
||||
the stored value is updated.
|
||||
|
||||
**Fingerprint composition rules:**
|
||||
- Include the `updated_at` timestamp of every DB record that feeds the
|
||||
computation (repo record, related domain, goals, snapshots).
|
||||
- Include the `mtime` (filesystem modification time) of every file that feeds
|
||||
the computation (`SCOPE.md`, `CLAUDE.md`, lockfiles, `tpsc.yaml`, etc.).
|
||||
- Join all components with `|` as a pipe-separated string — no hashing needed
|
||||
since the string is compared by equality, not transmitted to clients.
|
||||
- If a file is absent, encode `filename:absent` rather than omitting it, so
|
||||
file creation also triggers invalidation.
|
||||
|
||||
**Reference implementation:** `state-hub/api/doi_engine.py::compute_fingerprint()`
|
||||
|
||||
### 3. Every materialised endpoint MUST support `?force_refresh=true`
|
||||
|
||||
Callers must always be able to bypass the cache and trigger a fresh
|
||||
computation. This is the escape hatch for debugging, post-ingest verification,
|
||||
and scheduled background refresh jobs.
|
||||
|
||||
### 4. Writes to source data SHOULD update the repo record's `updated_at`
|
||||
|
||||
Operations that change source data (SBOM ingest, TPSC ingest, capability
|
||||
ingest) must ensure `managed_repos.updated_at` is refreshed so the
|
||||
fingerprint detects the change on the next read. Where data lives in a
|
||||
related table (e.g. `tpsc_snapshots`), the fingerprint must include that
|
||||
table's `max(snapshot_at)` directly rather than relying on the repo record.
|
||||
|
||||
### 5. The DB is never the source of truth — the rebuild principle holds
|
||||
|
||||
Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting
|
||||
all canonical sources. Materialised tables are **caches**, not records of
|
||||
authority. They may be wiped and repopulated at any time without data loss.
|
||||
This means:
|
||||
- No materialised table may be the only copy of any information.
|
||||
- Schema migrations that wipe a materialised table are safe and expected.
|
||||
- Background jobs that periodically re-ingest all repos are valid and
|
||||
encouraged.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
|
||||
- **Fast reads in steady state** — after the first computation, subsequent
|
||||
reads hit the DB with no filesystem or subprocess overhead.
|
||||
- **Accurate on change** — fingerprint invalidation ensures stale data is
|
||||
never silently served; the cache refreshes exactly when needed.
|
||||
- **Debuggable** — `force_refresh=true` and `checked_at` timestamps make it
|
||||
easy to see when a value was last computed and to trigger a recheck.
|
||||
- **Consistent with the read model principle** — the pattern makes explicit
|
||||
what was always implied: state-hub data is derived, not authoritative.
|
||||
|
||||
### Negative / Trade-offs
|
||||
|
||||
- **First-call latency** — cache misses are expensive (filesystem reads,
|
||||
subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at
|
||||
startup or after ingest.
|
||||
- **Fingerprint completeness** — if a new input is added to a computation and
|
||||
not added to the fingerprint, stale results will be silently returned. The
|
||||
fingerprint must be kept in sync with the computation.
|
||||
- **Filesystem dependency** — file mtimes are volatile (e.g. `git checkout`
|
||||
rewrites mtimes). In practice this means a cache miss after every checkout,
|
||||
not a correctness problem.
|
||||
|
||||
## Implementation Checklist
|
||||
|
||||
When adding a new category of repo-sourced derived data:
|
||||
|
||||
- [ ] Create a `_cache` or `_snapshots` table with `fingerprint` and
|
||||
`checked_at` columns.
|
||||
- [ ] Implement `compute_fingerprint(repo, ...)` in the relevant module.
|
||||
- [ ] Add `?force_refresh=true` query parameter to the read endpoint.
|
||||
- [ ] Ensure the ingest script (or write path) touches `managed_repos.updated_at`
|
||||
or includes a related table's `max(timestamp)` in the fingerprint.
|
||||
- [ ] Verify the cache can be wiped and repopulated without data loss.
|
||||
- [ ] Document which inputs are included in the fingerprint in a comment
|
||||
alongside `compute_fingerprint`.
|
||||
|
||||
## Current Implementations
|
||||
|
||||
| Derived data | Table | Fingerprint inputs | Force-refresh |
|
||||
|---|---|---|---|
|
||||
| DoI compliance tier | `doi_cache` | `repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)` | `?force_refresh=true` |
|
||||
|
||||
## Planned Applications
|
||||
|
||||
| Derived data | Table (proposed) | Notes |
|
||||
|---|---|---|
|
||||
| SBOM summary stats | `sbom_cache` | Fingerprint: `max(sbom_snapshots.snapshot_at)` |
|
||||
| Capability declarations | `capability_cache` | Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at` |
|
||||
| Workplan status summary | Already handled by consistency checker | Fingerprint: workplan file mtimes |
|
||||
|
||||
## Related
|
||||
|
||||
- ADR-001: Workplans and Work Items Are Repository Artefacts
|
||||
- ADR-002: Custodian Agent Runtime Design
|
||||
- `state-hub/api/doi_engine.py` — reference implementation
|
||||
- `state-hub/api/models/doi_cache.py` — reference schema
|
||||
- `state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py` — reference migration
|
||||
Reference in New Issue
Block a user