Files
state-hub/workplans/STATE-WP-0066-state-summary-revision-cache.md
tegwick 94c7817339 feat(summary): revision-gated cache with stale-while-revalidate (STATE-WP-0066)
Replace the fixed 15s TTL on GET /state/summary with per-table revision
watermarks, stale-while-revalidate background refresh, and a progress-tail
section split. SQLAlchemy write hooks invalidate core or progress sections
on mutation. Adds tests, benchmark script, and operator docs.
2026-06-22 16:27:32 +02:00

341 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: STATE-WP-0066
type: workplan
title: "State summary revision cache and stale-while-revalidate"
domain: custodian
repo: state-hub
status: finished
owner: codex
topic_slug: custodian
created: "2026-06-22"
updated: "2026-06-22"
finished: "2026-06-22"
state_hub_workstream_id: "f738cd77-6b8b-40e5-b348-dc304c7821f1"
---
# STATE-WP-0066 — State summary revision cache and stale-while-revalidate
## Summary
Upgrade `/state/summary` caching from a fixed 15-second TTL to a
**revision-gated** cache: serve the cached snapshot when underlying hub data
has not changed, and use **stale-while-revalidate** when it has. This extends
STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel
latency spikes that made `/state/summary` a poor readiness-probe target on
Railiance01.
`/state/health` remains the correct probe endpoint for infrastructure checks.
This workplan optimizes summary for dashboard, MCP, and activity-core consumers
that legitimately need the full snapshot.
## Problem
Current behaviour (`api/routers/state.py`):
- In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data
change.
- A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and
Python-side SBOM licence scanning (~200ms+ locally; worse through
ops-bridge → WSL).
- ETag middleware helps repeat identical responses (304) but still pays full
rebuild cost on miss.
- `progress_events` is append-only; hourly sweeps and agent notes invalidate
freshness expectations for `recent_progress` even when totals and workplan
rows are unchanged.
Observed impact:
- Bridge readiness probe timeouts when summary was used (fixed in
activity-core `40fa851`; do not regress).
- Dashboard workplans page and MCP `get_state_summary` poll summary on
1560s intervals — unnecessary rebuilds during quiet periods.
- Concurrent cache misses contend on a single `AsyncSession` (no gather).
## Goal
1. **Revision watermark** — one cheap query (or small query set) determines
whether cached summary is still valid.
2. **Stale-while-revalidate** — return last good snapshot immediately on
revision change; rebuild in background.
3. **Section-level freshness** — split `recent_progress` from the stable core
so sweep traffic does not force full rebuilds.
4. **Eager invalidation** — mutation routes clear or bump revision so writes
are visible without waiting for the next poll.
5. **Observable cache** — response headers document hit/miss/stale/revision.
## Design sketch
### Revision fingerprint
Before a full rebuild, compute a `SummaryRevision` from indexed
`MAX(updated_at)` / `MAX(created_at)` across tables that feed summary:
| Signal | Tables / columns |
|--------|------------------|
| Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies``updated_at` |
| Append-only events | `progress_events``created_at` |
| Portfolio | `managed_repos`, `contributions`, `capability_requests``updated_at` |
| Domain rollup | `domains`, `extension_points`, `technical_debt``updated_at` |
| SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp |
Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached
`StateSummary`. If incoming revision matches cached revision → return cache
(`X-StateHub-Cache: hit-revision`).
Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`).
### Stale-while-revalidate
Mirror `api/routers/workstreams.py` `_workplan_index` behaviour:
- Revision unchanged → return cache (fast path).
- Revision changed, cache present → return stale cache, start background
rebuild task (`X-StateHub-Cache: stale`).
- No cache → await rebuild (`X-StateHub-Cache: miss`).
- Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force
synchronous rebuild (for tests and operators).
### Section split (phase 2 within this workplan)
| Section key | Invalidation |
|-------------|--------------|
| `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps |
| `progress_tail` | `MAX(progress_events.created_at)` only |
Serve merged `StateSummary`; rebuild only the section whose revision changed.
### Eager invalidation
Add `invalidate_summary_cache()` (or revision bump) called from:
- `POST /progress/`
- `PATCH /tasks/{id}`, `POST /tasks/`
- `PATCH /decisions/{id}`, `POST /decisions/`
- `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`)
- `POST /workplan-dependencies/` (and legacy alias)
Keep invalidation in-process (single uvicorn worker assumption). Document
multi-worker limitation; defer shared revision store unless needed.
### Response metadata
Extend headers (and optionally `_meta` in schema — only if consumers need it):
```
X-StateHub-Cache: hit-revision | stale | miss
X-StateHub-Revision: <iso-or-hash>
X-StateHub-Elapsed-Ms: <existing>
```
Preserve truthful `generated_at` on cache hits (when the snapshot was built).
## Out of scope
- Redis or external cache layer.
- Multi-worker shared cache (document only).
- Replacing `/state/overview` (already the dashboard fast path).
- Changing summary response shape for existing consumers (additive `_meta`
only if needed).
- PostgreSQL materialized views or NOTIFY/LISTEN.
## Dependencies
- STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
- activity-core bridge readiness fix (`40fa851`) — health probe; independent of
this workplan but motivates summary optimization for other callers.
## T01 — Revision fingerprint module
```task
id: STATE-WP-0066-T01
status: done
priority: high
state_hub_task_id: "8ee836ec-048c-44f6-b16e-e7454d07371a"
```
Extract summary cache logic from `api/routers/state.py` into a dedicated
module (e.g. `api/services/summary_cache.py`).
Deliverables:
- `SummaryRevision` dataclass: per-table maxima + combined fingerprint string.
- `async def fetch_summary_revision(session) -> SummaryRevision` — one or few
SQL queries using existing indexes; target <20ms on current data volume.
- Unit tests with mocked session rows proving fingerprint changes when any
contributing table changes.
- Document covered tables and why each is included (flow-engine inputs,
`next_steps`, domain rollups, licence scan).
Done when revision fetch is tested in isolation and profiled under local DB.
## T02 — Revision-gated cache (replace TTL-only path)
```task
id: STATE-WP-0066-T02
status: done
priority: high
state_hub_task_id: "c5fc8ab8-5fc8-463b-9ae7-c304a7e0383e"
```
Wire revision check into `GET /state/summary`:
- If revision matches cached revision → return cache regardless of age.
- Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale
age if background refresh fails).
- Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers.
- Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild.
- Update `tests/conftest.py` cache reset helper for new module globals.
Done when repeated summary requests with unchanged DB show `hit-revision` and
skip the heavy query path (assert via mock or elapsed-ms header threshold).
## T03 — Stale-while-revalidate background refresh
```task
id: STATE-WP-0066-T03
status: done
priority: high
state_hub_task_id: "aa079e17-6103-4539-878d-b451035e5f8a"
```
When revision differs but a cached snapshot exists:
- Return stale snapshot immediately (`X-StateHub-Cache: stale`).
- Start a single background `asyncio` task to rebuild (dedupe concurrent
refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router).
- On refresh completion, atomically swap cache + revision.
- On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for
optional diagnostics header.
- Cold start (no cache) still blocks until first build completes.
Done when a revision bump serves stale data in <50ms while background rebuild
runs; second request after rebuild shows `hit-revision`.
## T04 — Section-level cache for `recent_progress`
```task
id: STATE-WP-0066-T04
status: done
priority: medium
state_hub_task_id: "99b14a10-ff1b-4609-a62b-5da19b79be68"
```
Split cache into `core` and `progress_tail` sections:
- `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`.
- `core` rebuild: existing summary path minus recent-progress fetch.
- Merge at serve time into `StateSummary`.
- Revision mismatch on progress only → rebuild progress section only; reuse
cached core.
Done when inserting a progress event refreshes `recent_progress` without
re-running domain/SBOM/flow-engine work (verify via query count or spy).
## T05 — Eager invalidation on mutation routes
```task
id: STATE-WP-0066-T05
status: done
priority: medium
state_hub_task_id: "0df9c1c2-6edf-4be3-bee8-b75e0d24fc02"
```
Call `invalidate_summary_cache()` (or equivalent revision bump) from write
paths listed in the design sketch.
- Invalidation must be synchronous before response returns (so the next GET
sees fresh revision).
- Progress `POST` invalidates at least the `progress_tail` section.
- Task/workplan/decision `PATCH` invalidates `core`.
- Add focused tests: mutate → GET summary reflects change without `?refresh`.
Done when mutation routes trigger invalidation and tests pass.
## T06 — Benchmark and regression tests
```task
id: STATE-WP-0066-T06
status: done
priority: high
state_hub_task_id: "5f68f999-d5ca-40df-a054-aaf762837342"
```
Prove cache effectiveness under realistic load:
- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers
`/state/summary` with unchanged revision — p95 < 50ms.
- Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data
volume (STATE-WP-0056 T07 bar).
- MCP smoke tests still pass (`tests/test_mcp_smoke.py`).
- Router shape tests still pass (`tests/test_routers_core.py`).
- Add tests for stale-while-revalidate and section-split behaviour.
Done when benchmark script is documented in workplan notes or Makefile target
and CI tests are green.
## T07 — Documentation and consumer guidance
```task
id: STATE-WP-0066-T07
status: done
priority: low
state_hub_task_id: "aee0f349-761f-42b7-82b2-6a319936a68e"
```
Update:
- `README.md``/state/summary` caching section (revision + stale headers).
- `dashboard/src/docs/live-data.md` — which pages need summary vs overview.
- `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`.
- `SCOPE.md` one-liner if summary caching is an operational concern.
Clarify for operators:
- Infrastructure probes → `/state/health`.
- Full snapshot → `/state/summary` (cached; `?refresh=true` to force).
- Dashboard overview → `/state/overview` (already lighter).
Done when docs match implemented headers and invalidation semantics.
## T08 — Verify through ops-bridge path
```task
id: STATE-WP-0066-T08
status: done
priority: medium
state_hub_task_id: "b4e967ab-57ba-4f0f-9161-373b383248b5"
```
End-to-end check from Railiance01 (or documented manual runbook):
- `bridge check state-hub-railiance01` healthy.
- Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or
`stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s).
- Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not
summary).
- Log findings as State Hub progress event when verified.
Done when bridge-path latency is recorded and no readiness regressions over a
30-minute observation window.
## Acceptance criteria
- Quiet hub: consecutive `/state/summary` requests hit revision cache, not full
rebuild.
- Data change: next request serves stale snapshot immediately, then fresh data
after background refresh.
- Progress-only change: core section not rebuilt (T04).
- Mutation → visible on next GET without manual API restart.
- No breaking changes to `StateSummary` JSON for existing consumers.
- Bridge readiness remains on `/state/health`.
## Notes for implementer
- `AsyncSession` does not support concurrent operations on one session —
background refresh must use `async_session_factory()` (same as workplan
index).
- `generated_at` on cache hit should reflect build time, not request time.
- Consider extracting `_build_state_summary(session)` from the route handler to
simplify testing (may already be partially inline).
- If revision query itself becomes hot, cache the revision for ~1s (micro-TTL)
as a second-level optimization — only if profiling warrants it.