Files
state-hub/workplans/STATE-WP-0066-state-summary-revision-cache.md
tegwick 94c7817339 feat(summary): revision-gated cache with stale-while-revalidate (STATE-WP-0066)
Replace the fixed 15s TTL on GET /state/summary with per-table revision
watermarks, stale-while-revalidate background refresh, and a progress-tail
section split. SQLAlchemy write hooks invalidate core or progress sections
on mutation. Adds tests, benchmark script, and operator docs.
2026-06-22 16:27:32 +02:00

12 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, finished, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated finished state_hub_workstream_id
STATE-WP-0066 workplan State summary revision cache and stale-while-revalidate custodian state-hub finished codex custodian 2026-06-22 2026-06-22 2026-06-22 f738cd77-6b8b-40e5-b348-dc304c7821f1

STATE-WP-0066 — State summary revision cache and stale-while-revalidate

Summary

Upgrade /state/summary caching from a fixed 15-second TTL to a revision-gated cache: serve the cached snapshot when underlying hub data has not changed, and use stale-while-revalidate when it has. This extends STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel latency spikes that made /state/summary a poor readiness-probe target on Railiance01.

/state/health remains the correct probe endpoint for infrastructure checks. This workplan optimizes summary for dashboard, MCP, and activity-core consumers that legitimately need the full snapshot.

Problem

Current behaviour (api/routers/state.py):

  • In-process TTL cache (_SUMMARY_TTL = 15s) — expires on wall clock, not data change.
  • A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and Python-side SBOM licence scanning (~200ms+ locally; worse through ops-bridge → WSL).
  • ETag middleware helps repeat identical responses (304) but still pays full rebuild cost on miss.
  • progress_events is append-only; hourly sweeps and agent notes invalidate freshness expectations for recent_progress even when totals and workplan rows are unchanged.

Observed impact:

  • Bridge readiness probe timeouts when summary was used (fixed in activity-core 40fa851; do not regress).
  • Dashboard workplans page and MCP get_state_summary poll summary on 1560s intervals — unnecessary rebuilds during quiet periods.
  • Concurrent cache misses contend on a single AsyncSession (no gather).

Goal

  1. Revision watermark — one cheap query (or small query set) determines whether cached summary is still valid.
  2. Stale-while-revalidate — return last good snapshot immediately on revision change; rebuild in background.
  3. Section-level freshness — split recent_progress from the stable core so sweep traffic does not force full rebuilds.
  4. Eager invalidation — mutation routes clear or bump revision so writes are visible without waiting for the next poll.
  5. Observable cache — response headers document hit/miss/stale/revision.

Design sketch

Revision fingerprint

Before a full rebuild, compute a SummaryRevision from indexed MAX(updated_at) / MAX(created_at) across tables that feed summary:

Signal Tables / columns
Core entities topics, workplans, tasks, decisions, workplan_dependenciesupdated_at
Append-only events progress_eventscreated_at
Portfolio managed_repos, contributions, capability_requestsupdated_at
Domain rollup domains, extension_points, technical_debtupdated_at
SBOM slice sbom_entries (or latest sbom_snapshots) → relevant timestamp

Store revision (ISO timestamp or hash of per-table maxima) alongside cached StateSummary. If incoming revision matches cached revision → return cache (X-StateHub-Cache: hit-revision).

Reuse the fingerprint pattern from api/doi_engine.py (compute_fingerprint).

Stale-while-revalidate

Mirror api/routers/workstreams.py _workplan_index behaviour:

  • Revision unchanged → return cache (fast path).
  • Revision changed, cache present → return stale cache, start background rebuild task (X-StateHub-Cache: stale).
  • No cache → await rebuild (X-StateHub-Cache: miss).
  • Respect Cache-Control: no-cache and a ?refresh=true query param to force synchronous rebuild (for tests and operators).

Section split (phase 2 within this workplan)

Section key Invalidation
core topics, workplans, tasks, decisions, deps, domains, totals, next_steps
progress_tail MAX(progress_events.created_at) only

Serve merged StateSummary; rebuild only the section whose revision changed.

Eager invalidation

Add invalidate_summary_cache() (or revision bump) called from:

  • POST /progress/
  • PATCH /tasks/{id}, POST /tasks/
  • PATCH /decisions/{id}, POST /decisions/
  • PATCH /workplans/{id}, POST /workplans/ (and legacy /workstreams/)
  • POST /workplan-dependencies/ (and legacy alias)

Keep invalidation in-process (single uvicorn worker assumption). Document multi-worker limitation; defer shared revision store unless needed.

Response metadata

Extend headers (and optionally _meta in schema — only if consumers need it):

X-StateHub-Cache: hit-revision | stale | miss
X-StateHub-Revision: <iso-or-hash>
X-StateHub-Elapsed-Ms: <existing>

Preserve truthful generated_at on cache hits (when the snapshot was built).

Out of scope

  • Redis or external cache layer.
  • Multi-worker shared cache (document only).
  • Replacing /state/overview (already the dashboard fast path).
  • Changing summary response shape for existing consumers (additive _meta only if needed).
  • PostgreSQL materialized views or NOTIFY/LISTEN.

Dependencies

  • STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
  • activity-core bridge readiness fix (40fa851) — health probe; independent of this workplan but motivates summary optimization for other callers.

T01 — Revision fingerprint module

id: STATE-WP-0066-T01
status: done
priority: high
state_hub_task_id: "8ee836ec-048c-44f6-b16e-e7454d07371a"

Extract summary cache logic from api/routers/state.py into a dedicated module (e.g. api/services/summary_cache.py).

Deliverables:

  • SummaryRevision dataclass: per-table maxima + combined fingerprint string.
  • async def fetch_summary_revision(session) -> SummaryRevision — one or few SQL queries using existing indexes; target <20ms on current data volume.
  • Unit tests with mocked session rows proving fingerprint changes when any contributing table changes.
  • Document covered tables and why each is included (flow-engine inputs, next_steps, domain rollups, licence scan).

Done when revision fetch is tested in isolation and profiled under local DB.

T02 — Revision-gated cache (replace TTL-only path)

id: STATE-WP-0066-T02
status: done
priority: high
state_hub_task_id: "c5fc8ab8-5fc8-463b-9ae7-c304a7e0383e"

Wire revision check into GET /state/summary:

  • If revision matches cached revision → return cache regardless of age.
  • Remove or demote _SUMMARY_TTL to a safety cap only (e.g. 5 min max stale age if background refresh fails).
  • Add X-StateHub-Cache: hit-revision and X-StateHub-Revision headers.
  • Honour Cache-Control: no-cache and ?refresh=true for forced rebuild.
  • Update tests/conftest.py cache reset helper for new module globals.

Done when repeated summary requests with unchanged DB show hit-revision and skip the heavy query path (assert via mock or elapsed-ms header threshold).

T03 — Stale-while-revalidate background refresh

id: STATE-WP-0066-T03
status: done
priority: high
state_hub_task_id: "aa079e17-6103-4539-878d-b451035e5f8a"

When revision differs but a cached snapshot exists:

  • Return stale snapshot immediately (X-StateHub-Cache: stale).
  • Start a single background asyncio task to rebuild (dedupe concurrent refresh — same pattern as _INDEX_REFRESH_TASK in workstreams router).
  • On refresh completion, atomically swap cache + revision.
  • On refresh failure, retain stale cache; set _SUMMARY_LAST_ERROR for optional diagnostics header.
  • Cold start (no cache) still blocks until first build completes.

Done when a revision bump serves stale data in <50ms while background rebuild runs; second request after rebuild shows hit-revision.

T04 — Section-level cache for recent_progress

id: STATE-WP-0066-T04
status: done
priority: medium
state_hub_task_id: "99b14a10-ff1b-4609-a62b-5da19b79be68"

Split cache into core and progress_tail sections:

  • progress_tail rebuild: single query ORDER BY created_at DESC LIMIT 20.
  • core rebuild: existing summary path minus recent-progress fetch.
  • Merge at serve time into StateSummary.
  • Revision mismatch on progress only → rebuild progress section only; reuse cached core.

Done when inserting a progress event refreshes recent_progress without re-running domain/SBOM/flow-engine work (verify via query count or spy).

T05 — Eager invalidation on mutation routes

id: STATE-WP-0066-T05
status: done
priority: medium
state_hub_task_id: "0df9c1c2-6edf-4be3-bee8-b75e0d24fc02"

Call invalidate_summary_cache() (or equivalent revision bump) from write paths listed in the design sketch.

  • Invalidation must be synchronous before response returns (so the next GET sees fresh revision).
  • Progress POST invalidates at least the progress_tail section.
  • Task/workplan/decision PATCH invalidates core.
  • Add focused tests: mutate → GET summary reflects change without ?refresh.

Done when mutation routes trigger invalidation and tests pass.

T06 — Benchmark and regression tests

id: STATE-WP-0066-T06
status: done
priority: high
state_hub_task_id: "5f68f999-d5ca-40df-a054-aaf762837342"

Prove cache effectiveness under realistic load:

  • Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers /state/summary with unchanged revision — p95 < 50ms.
  • Cache-miss path (forced ?refresh=true) stays < 500ms on current local data volume (STATE-WP-0056 T07 bar).
  • MCP smoke tests still pass (tests/test_mcp_smoke.py).
  • Router shape tests still pass (tests/test_routers_core.py).
  • Add tests for stale-while-revalidate and section-split behaviour.

Done when benchmark script is documented in workplan notes or Makefile target and CI tests are green.

T07 — Documentation and consumer guidance

id: STATE-WP-0066-T07
status: done
priority: low
state_hub_task_id: "aee0f349-761f-42b7-82b2-6a319936a68e"

Update:

  • README.md/state/summary caching section (revision + stale headers).
  • dashboard/src/docs/live-data.md — which pages need summary vs overview.
  • mcp_server/TOOLS.md — note cache behaviour for get_state_summary.
  • SCOPE.md one-liner if summary caching is an operational concern.

Clarify for operators:

  • Infrastructure probes → /state/health.
  • Full snapshot → /state/summary (cached; ?refresh=true to force).
  • Dashboard overview → /state/overview (already lighter).

Done when docs match implemented headers and invalidation semantics.

T08 — Verify through ops-bridge path

id: STATE-WP-0066-T08
status: done
priority: medium
state_hub_task_id: "b4e967ab-57ba-4f0f-9161-373b383248b5"

End-to-end check from Railiance01 (or documented manual runbook):

  • bridge check state-hub-railiance01 healthy.
  • Through node 127.0.0.1:18080: summary returns 200 with hit-revision or stale on repeated polls; p95 < 2s through tunnel (probe timeout is 5s).
  • Confirm actcore-state-hub-bridge readiness stays 1/1 (health probe, not summary).
  • Log findings as State Hub progress event when verified.

Done when bridge-path latency is recorded and no readiness regressions over a 30-minute observation window.

Acceptance criteria

  • Quiet hub: consecutive /state/summary requests hit revision cache, not full rebuild.
  • Data change: next request serves stale snapshot immediately, then fresh data after background refresh.
  • Progress-only change: core section not rebuilt (T04).
  • Mutation → visible on next GET without manual API restart.
  • No breaking changes to StateSummary JSON for existing consumers.
  • Bridge readiness remains on /state/health.

Notes for implementer

  • AsyncSession does not support concurrent operations on one session — background refresh must use async_session_factory() (same as workplan index).
  • generated_at on cache hit should reflect build time, not request time.
  • Consider extracting _build_state_summary(session) from the route handler to simplify testing (may already be partially inline).
  • If revision query itself becomes hot, cache the revision for ~1s (micro-TTL) as a second-level optimization — only if profiling warrants it.