--- id: STATE-WP-0066 type: workplan title: "State summary revision cache and stale-while-revalidate" domain: custodian repo: state-hub status: finished owner: codex topic_slug: custodian created: "2026-06-22" updated: "2026-06-22" finished: "2026-06-22" state_hub_workstream_id: "f738cd77-6b8b-40e5-b348-dc304c7821f1" --- # STATE-WP-0066 — State summary revision cache and stale-while-revalidate ## Summary Upgrade `/state/summary` caching from a fixed 15-second TTL to a **revision-gated** cache: serve the cached snapshot when underlying hub data has not changed, and use **stale-while-revalidate** when it has. This extends STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel latency spikes that made `/state/summary` a poor readiness-probe target on Railiance01. `/state/health` remains the correct probe endpoint for infrastructure checks. This workplan optimizes summary for dashboard, MCP, and activity-core consumers that legitimately need the full snapshot. ## Problem Current behaviour (`api/routers/state.py`): - In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data change. - A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and Python-side SBOM licence scanning (~200ms+ locally; worse through ops-bridge → WSL). - ETag middleware helps repeat identical responses (304) but still pays full rebuild cost on miss. - `progress_events` is append-only; hourly sweeps and agent notes invalidate freshness expectations for `recent_progress` even when totals and workplan rows are unchanged. Observed impact: - Bridge readiness probe timeouts when summary was used (fixed in activity-core `40fa851`; do not regress). - Dashboard workplans page and MCP `get_state_summary` poll summary on 15–60s intervals — unnecessary rebuilds during quiet periods. - Concurrent cache misses contend on a single `AsyncSession` (no gather). ## Goal 1. **Revision watermark** — one cheap query (or small query set) determines whether cached summary is still valid. 2. **Stale-while-revalidate** — return last good snapshot immediately on revision change; rebuild in background. 3. **Section-level freshness** — split `recent_progress` from the stable core so sweep traffic does not force full rebuilds. 4. **Eager invalidation** — mutation routes clear or bump revision so writes are visible without waiting for the next poll. 5. **Observable cache** — response headers document hit/miss/stale/revision. ## Design sketch ### Revision fingerprint Before a full rebuild, compute a `SummaryRevision` from indexed `MAX(updated_at)` / `MAX(created_at)` across tables that feed summary: | Signal | Tables / columns | |--------|------------------| | Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies` → `updated_at` | | Append-only events | `progress_events` → `created_at` | | Portfolio | `managed_repos`, `contributions`, `capability_requests` → `updated_at` | | Domain rollup | `domains`, `extension_points`, `technical_debt` → `updated_at` | | SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp | Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached `StateSummary`. If incoming revision matches cached revision → return cache (`X-StateHub-Cache: hit-revision`). Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`). ### Stale-while-revalidate Mirror `api/routers/workstreams.py` `_workplan_index` behaviour: - Revision unchanged → return cache (fast path). - Revision changed, cache present → return stale cache, start background rebuild task (`X-StateHub-Cache: stale`). - No cache → await rebuild (`X-StateHub-Cache: miss`). - Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force synchronous rebuild (for tests and operators). ### Section split (phase 2 within this workplan) | Section key | Invalidation | |-------------|--------------| | `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps | | `progress_tail` | `MAX(progress_events.created_at)` only | Serve merged `StateSummary`; rebuild only the section whose revision changed. ### Eager invalidation Add `invalidate_summary_cache()` (or revision bump) called from: - `POST /progress/` - `PATCH /tasks/{id}`, `POST /tasks/` - `PATCH /decisions/{id}`, `POST /decisions/` - `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`) - `POST /workplan-dependencies/` (and legacy alias) Keep invalidation in-process (single uvicorn worker assumption). Document multi-worker limitation; defer shared revision store unless needed. ### Response metadata Extend headers (and optionally `_meta` in schema — only if consumers need it): ``` X-StateHub-Cache: hit-revision | stale | miss X-StateHub-Revision: X-StateHub-Elapsed-Ms: ``` Preserve truthful `generated_at` on cache hits (when the snapshot was built). ## Out of scope - Redis or external cache layer. - Multi-worker shared cache (document only). - Replacing `/state/overview` (already the dashboard fast path). - Changing summary response shape for existing consumers (additive `_meta` only if needed). - PostgreSQL materialized views or NOTIFY/LISTEN. ## Dependencies - STATE-WP-0056 T07 (done) — baseline cache-miss profiling. - activity-core bridge readiness fix (`40fa851`) — health probe; independent of this workplan but motivates summary optimization for other callers. ## T01 — Revision fingerprint module ```task id: STATE-WP-0066-T01 status: done priority: high state_hub_task_id: "8ee836ec-048c-44f6-b16e-e7454d07371a" ``` Extract summary cache logic from `api/routers/state.py` into a dedicated module (e.g. `api/services/summary_cache.py`). Deliverables: - `SummaryRevision` dataclass: per-table maxima + combined fingerprint string. - `async def fetch_summary_revision(session) -> SummaryRevision` — one or few SQL queries using existing indexes; target <20ms on current data volume. - Unit tests with mocked session rows proving fingerprint changes when any contributing table changes. - Document covered tables and why each is included (flow-engine inputs, `next_steps`, domain rollups, licence scan). Done when revision fetch is tested in isolation and profiled under local DB. ## T02 — Revision-gated cache (replace TTL-only path) ```task id: STATE-WP-0066-T02 status: done priority: high state_hub_task_id: "c5fc8ab8-5fc8-463b-9ae7-c304a7e0383e" ``` Wire revision check into `GET /state/summary`: - If revision matches cached revision → return cache regardless of age. - Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale age if background refresh fails). - Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers. - Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild. - Update `tests/conftest.py` cache reset helper for new module globals. Done when repeated summary requests with unchanged DB show `hit-revision` and skip the heavy query path (assert via mock or elapsed-ms header threshold). ## T03 — Stale-while-revalidate background refresh ```task id: STATE-WP-0066-T03 status: done priority: high state_hub_task_id: "aa079e17-6103-4539-878d-b451035e5f8a" ``` When revision differs but a cached snapshot exists: - Return stale snapshot immediately (`X-StateHub-Cache: stale`). - Start a single background `asyncio` task to rebuild (dedupe concurrent refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router). - On refresh completion, atomically swap cache + revision. - On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for optional diagnostics header. - Cold start (no cache) still blocks until first build completes. Done when a revision bump serves stale data in <50ms while background rebuild runs; second request after rebuild shows `hit-revision`. ## T04 — Section-level cache for `recent_progress` ```task id: STATE-WP-0066-T04 status: done priority: medium state_hub_task_id: "99b14a10-ff1b-4609-a62b-5da19b79be68" ``` Split cache into `core` and `progress_tail` sections: - `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`. - `core` rebuild: existing summary path minus recent-progress fetch. - Merge at serve time into `StateSummary`. - Revision mismatch on progress only → rebuild progress section only; reuse cached core. Done when inserting a progress event refreshes `recent_progress` without re-running domain/SBOM/flow-engine work (verify via query count or spy). ## T05 — Eager invalidation on mutation routes ```task id: STATE-WP-0066-T05 status: done priority: medium state_hub_task_id: "0df9c1c2-6edf-4be3-bee8-b75e0d24fc02" ``` Call `invalidate_summary_cache()` (or equivalent revision bump) from write paths listed in the design sketch. - Invalidation must be synchronous before response returns (so the next GET sees fresh revision). - Progress `POST` invalidates at least the `progress_tail` section. - Task/workplan/decision `PATCH` invalidates `core`. - Add focused tests: mutate → GET summary reflects change without `?refresh`. Done when mutation routes trigger invalidation and tests pass. ## T06 — Benchmark and regression tests ```task id: STATE-WP-0066-T06 status: done priority: high state_hub_task_id: "5f68f999-d5ca-40df-a054-aaf762837342" ``` Prove cache effectiveness under realistic load: - Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers `/state/summary` with unchanged revision — p95 < 50ms. - Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data volume (STATE-WP-0056 T07 bar). - MCP smoke tests still pass (`tests/test_mcp_smoke.py`). - Router shape tests still pass (`tests/test_routers_core.py`). - Add tests for stale-while-revalidate and section-split behaviour. Done when benchmark script is documented in workplan notes or Makefile target and CI tests are green. ## T07 — Documentation and consumer guidance ```task id: STATE-WP-0066-T07 status: done priority: low state_hub_task_id: "aee0f349-761f-42b7-82b2-6a319936a68e" ``` Update: - `README.md` — `/state/summary` caching section (revision + stale headers). - `dashboard/src/docs/live-data.md` — which pages need summary vs overview. - `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`. - `SCOPE.md` one-liner if summary caching is an operational concern. Clarify for operators: - Infrastructure probes → `/state/health`. - Full snapshot → `/state/summary` (cached; `?refresh=true` to force). - Dashboard overview → `/state/overview` (already lighter). Done when docs match implemented headers and invalidation semantics. ## T08 — Verify through ops-bridge path ```task id: STATE-WP-0066-T08 status: done priority: medium state_hub_task_id: "b4e967ab-57ba-4f0f-9161-373b383248b5" ``` End-to-end check from Railiance01 (or documented manual runbook): - `bridge check state-hub-railiance01` healthy. - Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or `stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s). - Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not summary). - Log findings as State Hub progress event when verified. Done when bridge-path latency is recorded and no readiness regressions over a 30-minute observation window. ## Acceptance criteria - Quiet hub: consecutive `/state/summary` requests hit revision cache, not full rebuild. - Data change: next request serves stale snapshot immediately, then fresh data after background refresh. - Progress-only change: core section not rebuilt (T04). - Mutation → visible on next GET without manual API restart. - No breaking changes to `StateSummary` JSON for existing consumers. - Bridge readiness remains on `/state/health`. ## Notes for implementer - `AsyncSession` does not support concurrent operations on one session — background refresh must use `async_session_factory()` (same as workplan index). - `generated_at` on cache hit should reflect build time, not request time. - Consider extracting `_build_state_summary(session)` from the route handler to simplify testing (may already be partially inline). - If revision query itself becomes hot, cache the revision for ~1s (micro-TTL) as a second-level optimization — only if profiling warrants it.