the-custodian/canon/architecture/adr-003-materialized-derived-state.md

---
id: ADR-003
type: architecture-decision-record
title: "Materialized Derived State with Fingerprint Invalidation for Repo-Sourced Data"
status: accepted
decided_by: Bernd Worsch
date: "2026-03-20"
tags: ["architecture", "state-hub", "caching", "read-model", "materialized-view", "derived-state"]
---

# ADR-003: Materialized Derived State with Fingerprint Invalidation

## Status

Accepted.

## Context

The Custodian State Hub is a **read model** (CQRS terminology) — its data is
fully derivable from canonical sources that live in repositories and the
filesystem. No state-hub data is authoritative; it is always a derived view
of what the repos contain.

Several categories of data fit this description:

| Data | Canonical source | State-hub table |
|---|---|---|
| SBOM dependencies | `uv.lock`, `package-lock.json`, etc. | `sbom_entries` |
| Third-party service declarations | `tpsc.yaml` | `tpsc_entries` |
| Provided capabilities | `SCOPE.md` `capability` blocks | `capability_catalog` |
| DoI compliance tier | 14 criteria across repo files + DB | `doi_cache` |
| Workplan task status | `workplans/*.md` | `tasks` |

Early implementations either recomputed this data on every request (too slow)
or ingested it once without invalidation (stale data goes undetected). Neither
is acceptable for a system designed to give accurate, fast orientation.

The `doi_cache` table, introduced in CUST-WP-0024, demonstrated a pattern that
solves both problems. This ADR formalises that pattern and mandates its use
for all repo-sourced derived data.

## Pattern Name

**Materialized Derived State with Fingerprint Invalidation.**

This pattern is known under several names in the literature:

- **Materialized View** (SQL standard, PostgreSQL) — the stored result of a
  query or computation, refreshed on demand when source data changes.
- **Derived Data Store** (Kleppmann, *Designing Data-Intensive Applications*,
  Ch. 3 & 11) — a system whose entire dataset can be rebuilt from upstream
  sources; it is never the source of truth.
- **Read Model / Projection** (CQRS / Event Sourcing) — a pre-computed view
  maintained alongside a write model, rebuilt when relevant events occur.
- **Fingerprint-based / Content-addressed invalidation** — analogous to HTTP
  ETags: a cache entry is valid as long as a composite hash/timestamp of its
  inputs matches the stored value.

The State Hub already documents itself as a read model. This ADR extends that
principle to specify *how* the read model stays fresh.

## Decision

### 1. All repo-sourced derived data MUST be materialised in the DB

Data computed from repository files or repo records must be stored in a
dedicated table rather than recomputed per request. Direct computation on
every API call is only permissible for development tooling or when explicitly
forced by the caller.

### 2. Each materialised table MUST carry a `fingerprint` column

The fingerprint is a deterministic string encoding all inputs that affect the
computed result. It is compared on each read; if unchanged, the stored result
is returned without recomputation. If changed, the result is recomputed and
the stored value is updated.

**Fingerprint composition rules:**
- Include the `updated_at` timestamp of every DB record that feeds the
  computation (repo record, related domain, goals, snapshots).
- Include the `mtime` (filesystem modification time) of every file that feeds
  the computation (`SCOPE.md`, `CLAUDE.md`, lockfiles, `tpsc.yaml`, etc.).
- Join all components with `|` as a pipe-separated string — no hashing needed
  since the string is compared by equality, not transmitted to clients.
- If a file is absent, encode `filename:absent` rather than omitting it, so
  file creation also triggers invalidation.

**Reference implementation:** `state-hub/api/doi_engine.py::compute_fingerprint()`

### 3. Every materialised endpoint MUST support `?force_refresh=true`

Callers must always be able to bypass the cache and trigger a fresh
computation. This is the escape hatch for debugging, post-ingest verification,
and scheduled background refresh jobs.

### 4. Writes to source data SHOULD update the repo record's `updated_at`

Operations that change source data (SBOM ingest, TPSC ingest, capability
ingest) must ensure `managed_repos.updated_at` is refreshed so the
fingerprint detects the change on the next read. Where data lives in a
related table (e.g. `tpsc_snapshots`), the fingerprint must include that
table's `max(snapshot_at)` directly rather than relying on the repo record.

### 5. The DB is never the source of truth — the rebuild principle holds

Per ADR-001, the state-hub must be rebuildable from scratch by re-ingesting
all canonical sources. Materialised tables are **caches**, not records of
authority. They may be wiped and repopulated at any time without data loss.
This means:
- No materialised table may be the only copy of any information.
- Schema migrations that wipe a materialised table are safe and expected.
- Background jobs that periodically re-ingest all repos are valid and
  encouraged.

## Consequences

### Positive

- **Fast reads in steady state** — after the first computation, subsequent
  reads hit the DB with no filesystem or subprocess overhead.
- **Accurate on change** — fingerprint invalidation ensures stale data is
  never silently served; the cache refreshes exactly when needed.
- **Debuggable** — `force_refresh=true` and `checked_at` timestamps make it
  easy to see when a value was last computed and to trigger a recheck.
- **Consistent with the read model principle** — the pattern makes explicit
  what was always implied: state-hub data is derived, not authoritative.

### Negative / Trade-offs

- **First-call latency** — cache misses are expensive (filesystem reads,
  subprocess calls, HTTP self-calls). Mitigated by pre-warming caches at
  startup or after ingest.
- **Fingerprint completeness** — if a new input is added to a computation and
  not added to the fingerprint, stale results will be silently returned. The
  fingerprint must be kept in sync with the computation.
- **Filesystem dependency** — file mtimes are volatile (e.g. `git checkout`
  rewrites mtimes). In practice this means a cache miss after every checkout,
  not a correctness problem.

## Implementation Checklist

When adding a new category of repo-sourced derived data:

- [ ] Create a `_cache` or `_snapshots` table with `fingerprint` and
      `checked_at` columns.
- [ ] Implement `compute_fingerprint(repo, ...)` in the relevant module.
- [ ] Add `?force_refresh=true` query parameter to the read endpoint.
- [ ] Ensure the ingest script (or write path) touches `managed_repos.updated_at`
      or includes a related table's `max(timestamp)` in the fingerprint.
- [ ] Verify the cache can be wiped and repopulated without data loss.
- [ ] Document which inputs are included in the fingerprint in a comment
      alongside `compute_fingerprint`.

## Current Implementations

| Derived data | Table | Fingerprint inputs | Force-refresh |
|---|---|---|---|
| DoI compliance tier | `doi_cache` | `repo.updated_at`, `max(tpsc_snapshots.snapshot_at)`, `max(repo_goals.updated_at)`, `mtime(SCOPE.md)`, `mtime(CLAUDE.md)`, `mtime(tpsc.yaml)` | `?force_refresh=true` |

## Planned Applications

| Derived data | Table (proposed) | Notes |
|---|---|---|
| SBOM summary stats | `sbom_cache` | Fingerprint: `max(sbom_snapshots.snapshot_at)` |
| Capability declarations | `capability_cache` | Fingerprint: `mtime(SCOPE.md)`, `repo.updated_at` |
| Workplan status summary | Already handled by consistency checker | Fingerprint: workplan file mtimes |

## Related

- ADR-001: Workplans and Work Items Are Repository Artefacts
- ADR-002: Custodian Agent Runtime Design
- `state-hub/api/doi_engine.py` — reference implementation
- `state-hub/api/models/doi_cache.py` — reference schema
- `state-hub/migrations/versions/k8f9a0b1c2d3_doi_cache.py` — reference migration