Replace the fixed 15s TTL on GET /state/summary with per-table revision watermarks, stale-while-revalidate background refresh, and a progress-tail section split. SQLAlchemy write hooks invalidate core or progress sections on mutation. Adds tests, benchmark script, and operator docs.
12 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, finished, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | finished | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|---|---|
| STATE-WP-0066 | workplan | State summary revision cache and stale-while-revalidate | custodian | state-hub | finished | codex | custodian | 2026-06-22 | 2026-06-22 | 2026-06-22 | f738cd77-6b8b-40e5-b348-dc304c7821f1 |
STATE-WP-0066 — State summary revision cache and stale-while-revalidate
Summary
Upgrade /state/summary caching from a fixed 15-second TTL to a
revision-gated cache: serve the cached snapshot when underlying hub data
has not changed, and use stale-while-revalidate when it has. This extends
STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel
latency spikes that made /state/summary a poor readiness-probe target on
Railiance01.
/state/health remains the correct probe endpoint for infrastructure checks.
This workplan optimizes summary for dashboard, MCP, and activity-core consumers
that legitimately need the full snapshot.
Problem
Current behaviour (api/routers/state.py):
- In-process TTL cache (
_SUMMARY_TTL = 15s) — expires on wall clock, not data change. - A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and Python-side SBOM licence scanning (~200ms+ locally; worse through ops-bridge → WSL).
- ETag middleware helps repeat identical responses (304) but still pays full rebuild cost on miss.
progress_eventsis append-only; hourly sweeps and agent notes invalidate freshness expectations forrecent_progresseven when totals and workplan rows are unchanged.
Observed impact:
- Bridge readiness probe timeouts when summary was used (fixed in
activity-core
40fa851; do not regress). - Dashboard workplans page and MCP
get_state_summarypoll summary on 15–60s intervals — unnecessary rebuilds during quiet periods. - Concurrent cache misses contend on a single
AsyncSession(no gather).
Goal
- Revision watermark — one cheap query (or small query set) determines whether cached summary is still valid.
- Stale-while-revalidate — return last good snapshot immediately on revision change; rebuild in background.
- Section-level freshness — split
recent_progressfrom the stable core so sweep traffic does not force full rebuilds. - Eager invalidation — mutation routes clear or bump revision so writes are visible without waiting for the next poll.
- Observable cache — response headers document hit/miss/stale/revision.
Design sketch
Revision fingerprint
Before a full rebuild, compute a SummaryRevision from indexed
MAX(updated_at) / MAX(created_at) across tables that feed summary:
| Signal | Tables / columns |
|---|---|
| Core entities | topics, workplans, tasks, decisions, workplan_dependencies → updated_at |
| Append-only events | progress_events → created_at |
| Portfolio | managed_repos, contributions, capability_requests → updated_at |
| Domain rollup | domains, extension_points, technical_debt → updated_at |
| SBOM slice | sbom_entries (or latest sbom_snapshots) → relevant timestamp |
Store revision (ISO timestamp or hash of per-table maxima) alongside cached
StateSummary. If incoming revision matches cached revision → return cache
(X-StateHub-Cache: hit-revision).
Reuse the fingerprint pattern from api/doi_engine.py (compute_fingerprint).
Stale-while-revalidate
Mirror api/routers/workstreams.py _workplan_index behaviour:
- Revision unchanged → return cache (fast path).
- Revision changed, cache present → return stale cache, start background
rebuild task (
X-StateHub-Cache: stale). - No cache → await rebuild (
X-StateHub-Cache: miss). - Respect
Cache-Control: no-cacheand a?refresh=truequery param to force synchronous rebuild (for tests and operators).
Section split (phase 2 within this workplan)
| Section key | Invalidation |
|---|---|
core |
topics, workplans, tasks, decisions, deps, domains, totals, next_steps |
progress_tail |
MAX(progress_events.created_at) only |
Serve merged StateSummary; rebuild only the section whose revision changed.
Eager invalidation
Add invalidate_summary_cache() (or revision bump) called from:
POST /progress/PATCH /tasks/{id},POST /tasks/PATCH /decisions/{id},POST /decisions/PATCH /workplans/{id},POST /workplans/(and legacy/workstreams/)POST /workplan-dependencies/(and legacy alias)
Keep invalidation in-process (single uvicorn worker assumption). Document multi-worker limitation; defer shared revision store unless needed.
Response metadata
Extend headers (and optionally _meta in schema — only if consumers need it):
X-StateHub-Cache: hit-revision | stale | miss
X-StateHub-Revision: <iso-or-hash>
X-StateHub-Elapsed-Ms: <existing>
Preserve truthful generated_at on cache hits (when the snapshot was built).
Out of scope
- Redis or external cache layer.
- Multi-worker shared cache (document only).
- Replacing
/state/overview(already the dashboard fast path). - Changing summary response shape for existing consumers (additive
_metaonly if needed). - PostgreSQL materialized views or NOTIFY/LISTEN.
Dependencies
- STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
- activity-core bridge readiness fix (
40fa851) — health probe; independent of this workplan but motivates summary optimization for other callers.
T01 — Revision fingerprint module
id: STATE-WP-0066-T01
status: done
priority: high
state_hub_task_id: "8ee836ec-048c-44f6-b16e-e7454d07371a"
Extract summary cache logic from api/routers/state.py into a dedicated
module (e.g. api/services/summary_cache.py).
Deliverables:
SummaryRevisiondataclass: per-table maxima + combined fingerprint string.async def fetch_summary_revision(session) -> SummaryRevision— one or few SQL queries using existing indexes; target <20ms on current data volume.- Unit tests with mocked session rows proving fingerprint changes when any contributing table changes.
- Document covered tables and why each is included (flow-engine inputs,
next_steps, domain rollups, licence scan).
Done when revision fetch is tested in isolation and profiled under local DB.
T02 — Revision-gated cache (replace TTL-only path)
id: STATE-WP-0066-T02
status: done
priority: high
state_hub_task_id: "c5fc8ab8-5fc8-463b-9ae7-c304a7e0383e"
Wire revision check into GET /state/summary:
- If revision matches cached revision → return cache regardless of age.
- Remove or demote
_SUMMARY_TTLto a safety cap only (e.g. 5 min max stale age if background refresh fails). - Add
X-StateHub-Cache: hit-revisionandX-StateHub-Revisionheaders. - Honour
Cache-Control: no-cacheand?refresh=truefor forced rebuild. - Update
tests/conftest.pycache reset helper for new module globals.
Done when repeated summary requests with unchanged DB show hit-revision and
skip the heavy query path (assert via mock or elapsed-ms header threshold).
T03 — Stale-while-revalidate background refresh
id: STATE-WP-0066-T03
status: done
priority: high
state_hub_task_id: "aa079e17-6103-4539-878d-b451035e5f8a"
When revision differs but a cached snapshot exists:
- Return stale snapshot immediately (
X-StateHub-Cache: stale). - Start a single background
asynciotask to rebuild (dedupe concurrent refresh — same pattern as_INDEX_REFRESH_TASKin workstreams router). - On refresh completion, atomically swap cache + revision.
- On refresh failure, retain stale cache; set
_SUMMARY_LAST_ERRORfor optional diagnostics header. - Cold start (no cache) still blocks until first build completes.
Done when a revision bump serves stale data in <50ms while background rebuild
runs; second request after rebuild shows hit-revision.
T04 — Section-level cache for recent_progress
id: STATE-WP-0066-T04
status: done
priority: medium
state_hub_task_id: "99b14a10-ff1b-4609-a62b-5da19b79be68"
Split cache into core and progress_tail sections:
progress_tailrebuild: single queryORDER BY created_at DESC LIMIT 20.corerebuild: existing summary path minus recent-progress fetch.- Merge at serve time into
StateSummary. - Revision mismatch on progress only → rebuild progress section only; reuse cached core.
Done when inserting a progress event refreshes recent_progress without
re-running domain/SBOM/flow-engine work (verify via query count or spy).
T05 — Eager invalidation on mutation routes
id: STATE-WP-0066-T05
status: done
priority: medium
state_hub_task_id: "0df9c1c2-6edf-4be3-bee8-b75e0d24fc02"
Call invalidate_summary_cache() (or equivalent revision bump) from write
paths listed in the design sketch.
- Invalidation must be synchronous before response returns (so the next GET sees fresh revision).
- Progress
POSTinvalidates at least theprogress_tailsection. - Task/workplan/decision
PATCHinvalidatescore. - Add focused tests: mutate → GET summary reflects change without
?refresh.
Done when mutation routes trigger invalidation and tests pass.
T06 — Benchmark and regression tests
id: STATE-WP-0066-T06
status: done
priority: high
state_hub_task_id: "5f68f999-d5ca-40df-a054-aaf762837342"
Prove cache effectiveness under realistic load:
- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers
/state/summarywith unchanged revision — p95 < 50ms. - Cache-miss path (forced
?refresh=true) stays < 500ms on current local data volume (STATE-WP-0056 T07 bar). - MCP smoke tests still pass (
tests/test_mcp_smoke.py). - Router shape tests still pass (
tests/test_routers_core.py). - Add tests for stale-while-revalidate and section-split behaviour.
Done when benchmark script is documented in workplan notes or Makefile target and CI tests are green.
T07 — Documentation and consumer guidance
id: STATE-WP-0066-T07
status: done
priority: low
state_hub_task_id: "aee0f349-761f-42b7-82b2-6a319936a68e"
Update:
README.md—/state/summarycaching section (revision + stale headers).dashboard/src/docs/live-data.md— which pages need summary vs overview.mcp_server/TOOLS.md— note cache behaviour forget_state_summary.SCOPE.mdone-liner if summary caching is an operational concern.
Clarify for operators:
- Infrastructure probes →
/state/health. - Full snapshot →
/state/summary(cached;?refresh=trueto force). - Dashboard overview →
/state/overview(already lighter).
Done when docs match implemented headers and invalidation semantics.
T08 — Verify through ops-bridge path
id: STATE-WP-0066-T08
status: done
priority: medium
state_hub_task_id: "b4e967ab-57ba-4f0f-9161-373b383248b5"
End-to-end check from Railiance01 (or documented manual runbook):
bridge check state-hub-railiance01healthy.- Through node
127.0.0.1:18080: summary returns 200 withhit-revisionorstaleon repeated polls; p95 < 2s through tunnel (probe timeout is 5s). - Confirm
actcore-state-hub-bridgereadiness stays 1/1 (health probe, not summary). - Log findings as State Hub progress event when verified.
Done when bridge-path latency is recorded and no readiness regressions over a 30-minute observation window.
Acceptance criteria
- Quiet hub: consecutive
/state/summaryrequests hit revision cache, not full rebuild. - Data change: next request serves stale snapshot immediately, then fresh data after background refresh.
- Progress-only change: core section not rebuilt (T04).
- Mutation → visible on next GET without manual API restart.
- No breaking changes to
StateSummaryJSON for existing consumers. - Bridge readiness remains on
/state/health.
Notes for implementer
AsyncSessiondoes not support concurrent operations on one session — background refresh must useasync_session_factory()(same as workplan index).generated_aton cache hit should reflect build time, not request time.- Consider extracting
_build_state_summary(session)from the route handler to simplify testing (may already be partially inline). - If revision query itself becomes hot, cache the revision for ~1s (micro-TTL) as a second-level optimization — only if profiling warrants it.