Add STATE-WP-0066 workplan for state summary revision cache

Defines revision-gated caching, stale-while-revalidate, section split for
recent_progress, mutation invalidation, and bridge-path verification.
This commit is contained in:
2026-06-22 16:15:33 +02:00
parent 0949d4c0d8
commit ffaaf48fcb

View File

@@ -0,0 +1,340 @@
---
id: STATE-WP-0066
type: workplan
title: "State summary revision cache and stale-while-revalidate"
domain: custodian
repo: state-hub
status: ready
owner: codex
topic_slug: custodian
created: "2026-06-22"
updated: "2026-06-22"
state_hub_workstream_id: ""
---
# STATE-WP-0066 — State summary revision cache and stale-while-revalidate
## Summary
Upgrade `/state/summary` caching from a fixed 15-second TTL to a
**revision-gated** cache: serve the cached snapshot when underlying hub data
has not changed, and use **stale-while-revalidate** when it has. This extends
STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel
latency spikes that made `/state/summary` a poor readiness-probe target on
Railiance01.
`/state/health` remains the correct probe endpoint for infrastructure checks.
This workplan optimizes summary for dashboard, MCP, and activity-core consumers
that legitimately need the full snapshot.
## Problem
Current behaviour (`api/routers/state.py`):
- In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data
change.
- A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and
Python-side SBOM licence scanning (~200ms+ locally; worse through
ops-bridge → WSL).
- ETag middleware helps repeat identical responses (304) but still pays full
rebuild cost on miss.
- `progress_events` is append-only; hourly sweeps and agent notes invalidate
freshness expectations for `recent_progress` even when totals and workplan
rows are unchanged.
Observed impact:
- Bridge readiness probe timeouts when summary was used (fixed in
activity-core `40fa851`; do not regress).
- Dashboard workplans page and MCP `get_state_summary` poll summary on
1560s intervals — unnecessary rebuilds during quiet periods.
- Concurrent cache misses contend on a single `AsyncSession` (no gather).
## Goal
1. **Revision watermark** — one cheap query (or small query set) determines
whether cached summary is still valid.
2. **Stale-while-revalidate** — return last good snapshot immediately on
revision change; rebuild in background.
3. **Section-level freshness** — split `recent_progress` from the stable core
so sweep traffic does not force full rebuilds.
4. **Eager invalidation** — mutation routes clear or bump revision so writes
are visible without waiting for the next poll.
5. **Observable cache** — response headers document hit/miss/stale/revision.
## Design sketch
### Revision fingerprint
Before a full rebuild, compute a `SummaryRevision` from indexed
`MAX(updated_at)` / `MAX(created_at)` across tables that feed summary:
| Signal | Tables / columns |
|--------|------------------|
| Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies``updated_at` |
| Append-only events | `progress_events``created_at` |
| Portfolio | `managed_repos`, `contributions`, `capability_requests``updated_at` |
| Domain rollup | `domains`, `extension_points`, `technical_debt``updated_at` |
| SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp |
Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached
`StateSummary`. If incoming revision matches cached revision → return cache
(`X-StateHub-Cache: hit-revision`).
Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`).
### Stale-while-revalidate
Mirror `api/routers/workstreams.py` `_workplan_index` behaviour:
- Revision unchanged → return cache (fast path).
- Revision changed, cache present → return stale cache, start background
rebuild task (`X-StateHub-Cache: stale`).
- No cache → await rebuild (`X-StateHub-Cache: miss`).
- Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force
synchronous rebuild (for tests and operators).
### Section split (phase 2 within this workplan)
| Section key | Invalidation |
|-------------|--------------|
| `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps |
| `progress_tail` | `MAX(progress_events.created_at)` only |
Serve merged `StateSummary`; rebuild only the section whose revision changed.
### Eager invalidation
Add `invalidate_summary_cache()` (or revision bump) called from:
- `POST /progress/`
- `PATCH /tasks/{id}`, `POST /tasks/`
- `PATCH /decisions/{id}`, `POST /decisions/`
- `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`)
- `POST /workplan-dependencies/` (and legacy alias)
Keep invalidation in-process (single uvicorn worker assumption). Document
multi-worker limitation; defer shared revision store unless needed.
### Response metadata
Extend headers (and optionally `_meta` in schema — only if consumers need it):
```
X-StateHub-Cache: hit-revision | stale | miss
X-StateHub-Revision: <iso-or-hash>
X-StateHub-Elapsed-Ms: <existing>
```
Preserve truthful `generated_at` on cache hits (when the snapshot was built).
## Out of scope
- Redis or external cache layer.
- Multi-worker shared cache (document only).
- Replacing `/state/overview` (already the dashboard fast path).
- Changing summary response shape for existing consumers (additive `_meta`
only if needed).
- PostgreSQL materialized views or NOTIFY/LISTEN.
## Dependencies
- STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
- activity-core bridge readiness fix (`40fa851`) — health probe; independent of
this workplan but motivates summary optimization for other callers.
## T01 — Revision fingerprint module
```task
id: STATE-WP-0066-T01
status: todo
priority: high
state_hub_task_id: ""
```
Extract summary cache logic from `api/routers/state.py` into a dedicated
module (e.g. `api/services/summary_cache.py`).
Deliverables:
- `SummaryRevision` dataclass: per-table maxima + combined fingerprint string.
- `async def fetch_summary_revision(session) -> SummaryRevision` — one or few
SQL queries using existing indexes; target <20ms on current data volume.
- Unit tests with mocked session rows proving fingerprint changes when any
contributing table changes.
- Document covered tables and why each is included (flow-engine inputs,
`next_steps`, domain rollups, licence scan).
Done when revision fetch is tested in isolation and profiled under local DB.
## T02 — Revision-gated cache (replace TTL-only path)
```task
id: STATE-WP-0066-T02
status: todo
priority: high
state_hub_task_id: ""
```
Wire revision check into `GET /state/summary`:
- If revision matches cached revision → return cache regardless of age.
- Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale
age if background refresh fails).
- Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers.
- Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild.
- Update `tests/conftest.py` cache reset helper for new module globals.
Done when repeated summary requests with unchanged DB show `hit-revision` and
skip the heavy query path (assert via mock or elapsed-ms header threshold).
## T03 — Stale-while-revalidate background refresh
```task
id: STATE-WP-0066-T03
status: todo
priority: high
state_hub_task_id: ""
```
When revision differs but a cached snapshot exists:
- Return stale snapshot immediately (`X-StateHub-Cache: stale`).
- Start a single background `asyncio` task to rebuild (dedupe concurrent
refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router).
- On refresh completion, atomically swap cache + revision.
- On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for
optional diagnostics header.
- Cold start (no cache) still blocks until first build completes.
Done when a revision bump serves stale data in <50ms while background rebuild
runs; second request after rebuild shows `hit-revision`.
## T04 — Section-level cache for `recent_progress`
```task
id: STATE-WP-0066-T04
status: todo
priority: medium
state_hub_task_id: ""
```
Split cache into `core` and `progress_tail` sections:
- `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`.
- `core` rebuild: existing summary path minus recent-progress fetch.
- Merge at serve time into `StateSummary`.
- Revision mismatch on progress only → rebuild progress section only; reuse
cached core.
Done when inserting a progress event refreshes `recent_progress` without
re-running domain/SBOM/flow-engine work (verify via query count or spy).
## T05 — Eager invalidation on mutation routes
```task
id: STATE-WP-0066-T05
status: todo
priority: medium
state_hub_task_id: ""
```
Call `invalidate_summary_cache()` (or equivalent revision bump) from write
paths listed in the design sketch.
- Invalidation must be synchronous before response returns (so the next GET
sees fresh revision).
- Progress `POST` invalidates at least the `progress_tail` section.
- Task/workplan/decision `PATCH` invalidates `core`.
- Add focused tests: mutate → GET summary reflects change without `?refresh`.
Done when mutation routes trigger invalidation and tests pass.
## T06 — Benchmark and regression tests
```task
id: STATE-WP-0066-T06
status: todo
priority: high
state_hub_task_id: ""
```
Prove cache effectiveness under realistic load:
- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers
`/state/summary` with unchanged revision — p95 < 50ms.
- Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data
volume (STATE-WP-0056 T07 bar).
- MCP smoke tests still pass (`tests/test_mcp_smoke.py`).
- Router shape tests still pass (`tests/test_routers_core.py`).
- Add tests for stale-while-revalidate and section-split behaviour.
Done when benchmark script is documented in workplan notes or Makefile target
and CI tests are green.
## T07 — Documentation and consumer guidance
```task
id: STATE-WP-0066-T07
status: todo
priority: low
state_hub_task_id: ""
```
Update:
- `README.md``/state/summary` caching section (revision + stale headers).
- `dashboard/src/docs/live-data.md` — which pages need summary vs overview.
- `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`.
- `SCOPE.md` one-liner if summary caching is an operational concern.
Clarify for operators:
- Infrastructure probes → `/state/health`.
- Full snapshot → `/state/summary` (cached; `?refresh=true` to force).
- Dashboard overview → `/state/overview` (already lighter).
Done when docs match implemented headers and invalidation semantics.
## T08 — Verify through ops-bridge path
```task
id: STATE-WP-0066-T08
status: todo
priority: medium
state_hub_task_id: ""
```
End-to-end check from Railiance01 (or documented manual runbook):
- `bridge check state-hub-railiance01` healthy.
- Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or
`stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s).
- Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not
summary).
- Log findings as State Hub progress event when verified.
Done when bridge-path latency is recorded and no readiness regressions over a
30-minute observation window.
## Acceptance criteria
- Quiet hub: consecutive `/state/summary` requests hit revision cache, not full
rebuild.
- Data change: next request serves stale snapshot immediately, then fresh data
after background refresh.
- Progress-only change: core section not rebuilt (T04).
- Mutation → visible on next GET without manual API restart.
- No breaking changes to `StateSummary` JSON for existing consumers.
- Bridge readiness remains on `/state/health`.
## Notes for implementer
- `AsyncSession` does not support concurrent operations on one session —
background refresh must use `async_session_factory()` (same as workplan
index).
- `generated_at` on cache hit should reflect build time, not request time.
- Consider extracting `_build_state_summary(session)` from the route handler to
simplify testing (may already be partially inline).
- If revision query itself becomes hot, cache the revision for ~1s (micro-TTL)
as a second-level optimization — only if profiling warrants it.