Files

tegwick 51e95ec21a docs(adr): ADR-003 — Materialized Derived State with Fingerprint Invalidation

Formalises the caching pattern introduced in doi_cache: pre-compute and store
repo-sourced derived data, invalidate by fingerprint (composite of DB timestamps
+ file mtimes), force-refresh on demand.

Names the pattern against the literature (Materialized View, Derived Data Store,
CQRS Read Model, ETag-style invalidation) and mandates its use for all future
repo-sourced derived data with an implementation checklist.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

2026-03-20 01:54:21 +01:00

7.8 KiB

Raw Blame History

id, type, title, status, decided_by, date, tags

type

title

status

decided_by

date

ADR-003: Materialized Derived State with Fingerprint Invalidation

Status

Accepted.

Context

The Custodian State Hub is a read model (CQRS terminology) — its data is fully derivable from canonical sources that live in repositories and the filesystem. No state-hub data is authoritative; it is always a derived view of what the repos contain.

Several categories of data fit this description:

Data	Canonical source	State-hub table
SBOM dependencies	`uv.lock`, `package-lock.json`, etc.	`sbom_entries`
Third-party service declarations	`tpsc.yaml`	`tpsc_entries`
Provided capabilities	`SCOPE.md` `capability` blocks	`capability_catalog`
DoI compliance tier	14 criteria across repo files + DB	`doi_cache`
Workplan task status	`workplans/*.md`	`tasks`

Early implementations either recomputed this data on every request (too slow) or ingested it once without invalidation (stale data goes undetected). Neither is acceptable for a system designed to give accurate, fast orientation.

The doi_cache table, introduced in CUST-WP-0024, demonstrated a pattern that solves both problems. This ADR formalises that pattern and mandates its use for all repo-sourced derived data.

Pattern Name

Materialized Derived State with Fingerprint Invalidation.

This pattern is known under several names in the literature:

Materialized View (SQL standard, PostgreSQL) — the stored result of a query or computation, refreshed on demand when source data changes.
Derived Data Store (Kleppmann, Designing Data-Intensive Applications, Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream sources; it is never the source of truth.
Read Model / Projection (CQRS / Event Sourcing) — a pre-computed view maintained alongside a write model, rebuilt when relevant events occur.
Fingerprint-based / Content-addressed invalidation — analogous to HTTP ETags: a cache entry is valid as long as a composite hash/timestamp of its inputs matches the stored value.

The State Hub already documents itself as a read model. This ADR extends that principle to specify how the read model stays fresh.

Decision

1. All repo-sourced derived data MUST be materialised in the DB

Data computed from repository files or repo records must be stored in a dedicated table rather than recomputed per request. Direct computation on every API call is only permissible for development tooling or when explicitly forced by the caller.

2. Each materialised table MUST carry a `fingerprint` column

The fingerprint is a deterministic string encoding all inputs that affect the computed result. It is compared on each read; if unchanged, the stored result is returned without recomputation. If changed, the result is recomputed and the stored value is updated.

Fingerprint composition rules:

Include the updated_at timestamp of every DB record that feeds the computation (repo record, related domain, goals, snapshots).
Include the mtime (filesystem modification time) of every file that feeds the computation (SCOPE.md, CLAUDE.md, lockfiles, tpsc.yaml, etc.).
Join all components with | as a pipe-separated string — no hashing needed since the string is compared by equality, not transmitted to clients.
If a file is absent, encode filename:absent rather than omitting it, so file creation also triggers invalidation.

Reference implementation: state-hub/api/doi_engine.py::compute_fingerprint()

3. Every materialised endpoint MUST support `?force_refresh=true`

Callers must always be able to bypass the cache and trigger a fresh computation. This is the escape hatch for debugging, post-ingest verification, and scheduled background refresh jobs.

4. Writes to source data SHOULD update the repo record's `updated_at`

Operations that change source data (SBOM ingest, TPSC ingest, capability ingest) must ensure managed_repos.updated_at is refreshed so the fingerprint detects the change on the next read. Where data lives in a related table (e.g. tpsc_snapshots), the fingerprint must include that table's max(snapshot_at) directly rather than relying on the repo record.

5. The DB is never the source of truth — the rebuild principle holds

Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting all canonical sources. Materialised tables are caches, not records of authority. They may be wiped and repopulated at any time without data loss. This means:

No materialised table may be the only copy of any information.
Schema migrations that wipe a materialised table are safe and expected.
Background jobs that periodically re-ingest all repos are valid and encouraged.

Consequences

Positive

Fast reads in steady state — after the first computation, subsequent reads hit the DB with no filesystem or subprocess overhead.
Accurate on change — fingerprint invalidation ensures stale data is never silently served; the cache refreshes exactly when needed.
Debuggable — force_refresh=true and checked_at timestamps make it easy to see when a value was last computed and to trigger a recheck.
Consistent with the read model principle — the pattern makes explicit what was always implied: state-hub data is derived, not authoritative.

Negative / Trade-offs

First-call latency — cache misses are expensive (filesystem reads, subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at startup or after ingest.
Fingerprint completeness — if a new input is added to a computation and not added to the fingerprint, stale results will be silently returned. The fingerprint must be kept in sync with the computation.
Filesystem dependency — file mtimes are volatile (e.g. git checkout rewrites mtimes). In practice this means a cache miss after every checkout, not a correctness problem.

Implementation Checklist

When adding a new category of repo-sourced derived data:

Create a _cache or _snapshots table with fingerprint and checked_at columns.
Implement compute_fingerprint(repo, ...) in the relevant module.
Add ?force_refresh=true query parameter to the read endpoint.
Ensure the ingest script (or write path) touches managed_repos.updated_at or includes a related table's max(timestamp) in the fingerprint.
Verify the cache can be wiped and repopulated without data loss.
Document which inputs are included in the fingerprint in a comment alongside compute_fingerprint.

Current Implementations

Derived data	Table	Fingerprint inputs	Force-refresh
DoI compliance tier	`doi_cache`	`repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)`	`?force_refresh=true`

Planned Applications

Derived data	Table (proposed)	Notes
SBOM summary stats	`sbom_cache`	Fingerprint: `max(sbom_snapshots.snapshot_at)`
Capability declarations	`capability_cache`	Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at`
Workplan status summary	Already handled by consistency checker	Fingerprint: workplan file mtimes

ADR-001: Workplans and Work Items Are Repository Artefacts
ADR-002: Custodian Agent Runtime Design
state-hub/api/doi_engine.py — reference implementation
state-hub/api/models/doi_cache.py — reference schema
state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py — reference migration

7.8 KiB Raw Blame History