Formalises the caching pattern introduced in doi_cache: pre-compute and store repo-sourced derived data, invalidate by fingerprint (composite of DB timestamps + file mtimes), force-refresh on demand. Names the pattern against the literature (Materialized View, Derived Data Store, CQRS Read Model, ETag-style invalidation) and mandates its use for all future repo-sourced derived data with an implementation checklist. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
175 lines
7.8 KiB
Markdown
175 lines
7.8 KiB
Markdown
---
|
|
id: ADR-003
|
|
type: architecture-decision-record
|
|
title: "Materialized Derived State with Fingerprint Invalidation for Repo-Sourced Data"
|
|
status: accepted
|
|
decided_by: Bernd Worsch
|
|
date: "2026-03-20"
|
|
tags: ["architecture", "state-hub", "caching", "read-model", "materialized-view", "derived-state"]
|
|
---
|
|
|
|
# ADR-003: Materialized Derived State with Fingerprint Invalidation
|
|
|
|
## Status
|
|
|
|
Accepted.
|
|
|
|
## Context
|
|
|
|
The Custodian State Hub is a **read model** (CQRS terminology) — its data is
|
|
fully derivable from canonical sources that live in repositories and the
|
|
filesystem. No state-hub data is authoritative; it is always a derived view
|
|
of what the repos contain.
|
|
|
|
Several categories of data fit this description:
|
|
|
|
| Data | Canonical source | State-hub table |
|
|
|---|---|---|
|
|
| SBOM dependencies | `uv.lock`, `package-lock.json`, etc. | `sbom_entries` |
|
|
| Third-party service declarations | `tpsc.yaml` | `tpsc_entries` |
|
|
| Provided capabilities | `SCOPE.md` `capability` blocks | `capability_catalog` |
|
|
| DoI compliance tier | 14 criteria across repo files + DB | `doi_cache` |
|
|
| Workplan task status | `workplans/*.md` | `tasks` |
|
|
|
|
Early implementations either recomputed this data on every request (too slow)
|
|
or ingested it once without invalidation (stale data goes undetected). Neither
|
|
is acceptable for a system designed to give accurate, fast orientation.
|
|
|
|
The `doi_cache` table, introduced in CUST-WP-0024, demonstrated a pattern that
|
|
solves both problems. This ADR formalises that pattern and mandates its use
|
|
for all repo-sourced derived data.
|
|
|
|
## Pattern Name
|
|
|
|
**Materialized Derived State with Fingerprint Invalidation.**
|
|
|
|
This pattern is known under several names in the literature:
|
|
|
|
- **Materialized View** (SQL standard, PostgreSQL) — the stored result of a
|
|
query or computation, refreshed on demand when source data changes.
|
|
- **Derived Data Store** (Kleppmann, *Designing Data-Intensive Applications*,
|
|
Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream
|
|
sources; it is never the source of truth.
|
|
- **Read Model / Projection** (CQRS / Event Sourcing) — a pre-computed view
|
|
maintained alongside a write model, rebuilt when relevant events occur.
|
|
- **Fingerprint-based / Content-addressed invalidation** — analogous to HTTP
|
|
ETags: a cache entry is valid as long as a composite hash/timestamp of its
|
|
inputs matches the stored value.
|
|
|
|
The State Hub already documents itself as a read model. This ADR extends that
|
|
principle to specify *how* the read model stays fresh.
|
|
|
|
## Decision
|
|
|
|
### 1. All repo-sourced derived data MUST be materialised in the DB
|
|
|
|
Data computed from repository files or repo records must be stored in a
|
|
dedicated table rather than recomputed per request. Direct computation on
|
|
every API call is only permissible for development tooling or when explicitly
|
|
forced by the caller.
|
|
|
|
### 2. Each materialised table MUST carry a `fingerprint` column
|
|
|
|
The fingerprint is a deterministic string encoding all inputs that affect the
|
|
computed result. It is compared on each read; if unchanged, the stored result
|
|
is returned without recomputation. If changed, the result is recomputed and
|
|
the stored value is updated.
|
|
|
|
**Fingerprint composition rules:**
|
|
- Include the `updated_at` timestamp of every DB record that feeds the
|
|
computation (repo record, related domain, goals, snapshots).
|
|
- Include the `mtime` (filesystem modification time) of every file that feeds
|
|
the computation (`SCOPE.md`, `CLAUDE.md`, lockfiles, `tpsc.yaml`, etc.).
|
|
- Join all components with `|` as a pipe-separated string — no hashing needed
|
|
since the string is compared by equality, not transmitted to clients.
|
|
- If a file is absent, encode `filename:absent` rather than omitting it, so
|
|
file creation also triggers invalidation.
|
|
|
|
**Reference implementation:** `state-hub/api/doi_engine.py::compute_fingerprint()`
|
|
|
|
### 3. Every materialised endpoint MUST support `?force_refresh=true`
|
|
|
|
Callers must always be able to bypass the cache and trigger a fresh
|
|
computation. This is the escape hatch for debugging, post-ingest verification,
|
|
and scheduled background refresh jobs.
|
|
|
|
### 4. Writes to source data SHOULD update the repo record's `updated_at`
|
|
|
|
Operations that change source data (SBOM ingest, TPSC ingest, capability
|
|
ingest) must ensure `managed_repos.updated_at` is refreshed so the
|
|
fingerprint detects the change on the next read. Where data lives in a
|
|
related table (e.g. `tpsc_snapshots`), the fingerprint must include that
|
|
table's `max(snapshot_at)` directly rather than relying on the repo record.
|
|
|
|
### 5. The DB is never the source of truth — the rebuild principle holds
|
|
|
|
Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting
|
|
all canonical sources. Materialised tables are **caches**, not records of
|
|
authority. They may be wiped and repopulated at any time without data loss.
|
|
This means:
|
|
- No materialised table may be the only copy of any information.
|
|
- Schema migrations that wipe a materialised table are safe and expected.
|
|
- Background jobs that periodically re-ingest all repos are valid and
|
|
encouraged.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **Fast reads in steady state** — after the first computation, subsequent
|
|
reads hit the DB with no filesystem or subprocess overhead.
|
|
- **Accurate on change** — fingerprint invalidation ensures stale data is
|
|
never silently served; the cache refreshes exactly when needed.
|
|
- **Debuggable** — `force_refresh=true` and `checked_at` timestamps make it
|
|
easy to see when a value was last computed and to trigger a recheck.
|
|
- **Consistent with the read model principle** — the pattern makes explicit
|
|
what was always implied: state-hub data is derived, not authoritative.
|
|
|
|
### Negative / Trade-offs
|
|
|
|
- **First-call latency** — cache misses are expensive (filesystem reads,
|
|
subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at
|
|
startup or after ingest.
|
|
- **Fingerprint completeness** — if a new input is added to a computation and
|
|
not added to the fingerprint, stale results will be silently returned. The
|
|
fingerprint must be kept in sync with the computation.
|
|
- **Filesystem dependency** — file mtimes are volatile (e.g. `git checkout`
|
|
rewrites mtimes). In practice this means a cache miss after every checkout,
|
|
not a correctness problem.
|
|
|
|
## Implementation Checklist
|
|
|
|
When adding a new category of repo-sourced derived data:
|
|
|
|
- [ ] Create a `_cache` or `_snapshots` table with `fingerprint` and
|
|
`checked_at` columns.
|
|
- [ ] Implement `compute_fingerprint(repo, ...)` in the relevant module.
|
|
- [ ] Add `?force_refresh=true` query parameter to the read endpoint.
|
|
- [ ] Ensure the ingest script (or write path) touches `managed_repos.updated_at`
|
|
or includes a related table's `max(timestamp)` in the fingerprint.
|
|
- [ ] Verify the cache can be wiped and repopulated without data loss.
|
|
- [ ] Document which inputs are included in the fingerprint in a comment
|
|
alongside `compute_fingerprint`.
|
|
|
|
## Current Implementations
|
|
|
|
| Derived data | Table | Fingerprint inputs | Force-refresh |
|
|
|---|---|---|---|
|
|
| DoI compliance tier | `doi_cache` | `repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)` | `?force_refresh=true` |
|
|
|
|
## Planned Applications
|
|
|
|
| Derived data | Table (proposed) | Notes |
|
|
|---|---|---|
|
|
| SBOM summary stats | `sbom_cache` | Fingerprint: `max(sbom_snapshots.snapshot_at)` |
|
|
| Capability declarations | `capability_cache` | Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at` |
|
|
| Workplan status summary | Already handled by consistency checker | Fingerprint: workplan file mtimes |
|
|
|
|
## Related
|
|
|
|
- ADR-001: Workplans and Work Items Are Repository Artefacts
|
|
- ADR-002: Custodian Agent Runtime Design
|
|
- `state-hub/api/doi_engine.py` — reference implementation
|
|
- `state-hub/api/models/doi_cache.py` — reference schema
|
|
- `state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py` — reference migration
|