diff --git a/workplans/STATE-WP-0066-state-summary-revision-cache.md b/workplans/STATE-WP-0066-state-summary-revision-cache.md new file mode 100644 index 0000000..576d77f --- /dev/null +++ b/workplans/STATE-WP-0066-state-summary-revision-cache.md @@ -0,0 +1,340 @@ +--- +id: STATE-WP-0066 +type: workplan +title: "State summary revision cache and stale-while-revalidate" +domain: custodian +repo: state-hub +status: ready +owner: codex +topic_slug: custodian +created: "2026-06-22" +updated: "2026-06-22" +state_hub_workstream_id: "" +--- + +# STATE-WP-0066 — State summary revision cache and stale-while-revalidate + +## Summary + +Upgrade `/state/summary` caching from a fixed 15-second TTL to a +**revision-gated** cache: serve the cached snapshot when underlying hub data +has not changed, and use **stale-while-revalidate** when it has. This extends +STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel +latency spikes that made `/state/summary` a poor readiness-probe target on +Railiance01. + +`/state/health` remains the correct probe endpoint for infrastructure checks. +This workplan optimizes summary for dashboard, MCP, and activity-core consumers +that legitimately need the full snapshot. + +## Problem + +Current behaviour (`api/routers/state.py`): + +- In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data + change. +- A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and + Python-side SBOM licence scanning (~200ms+ locally; worse through + ops-bridge → WSL). +- ETag middleware helps repeat identical responses (304) but still pays full + rebuild cost on miss. +- `progress_events` is append-only; hourly sweeps and agent notes invalidate + freshness expectations for `recent_progress` even when totals and workplan + rows are unchanged. + +Observed impact: + +- Bridge readiness probe timeouts when summary was used (fixed in + activity-core `40fa851`; do not regress). +- Dashboard workplans page and MCP `get_state_summary` poll summary on + 15–60s intervals — unnecessary rebuilds during quiet periods. +- Concurrent cache misses contend on a single `AsyncSession` (no gather). + +## Goal + +1. **Revision watermark** — one cheap query (or small query set) determines + whether cached summary is still valid. +2. **Stale-while-revalidate** — return last good snapshot immediately on + revision change; rebuild in background. +3. **Section-level freshness** — split `recent_progress` from the stable core + so sweep traffic does not force full rebuilds. +4. **Eager invalidation** — mutation routes clear or bump revision so writes + are visible without waiting for the next poll. +5. **Observable cache** — response headers document hit/miss/stale/revision. + +## Design sketch + +### Revision fingerprint + +Before a full rebuild, compute a `SummaryRevision` from indexed +`MAX(updated_at)` / `MAX(created_at)` across tables that feed summary: + +| Signal | Tables / columns | +|--------|------------------| +| Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies` → `updated_at` | +| Append-only events | `progress_events` → `created_at` | +| Portfolio | `managed_repos`, `contributions`, `capability_requests` → `updated_at` | +| Domain rollup | `domains`, `extension_points`, `technical_debt` → `updated_at` | +| SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp | + +Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached +`StateSummary`. If incoming revision matches cached revision → return cache +(`X-StateHub-Cache: hit-revision`). + +Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`). + +### Stale-while-revalidate + +Mirror `api/routers/workstreams.py` `_workplan_index` behaviour: + +- Revision unchanged → return cache (fast path). +- Revision changed, cache present → return stale cache, start background + rebuild task (`X-StateHub-Cache: stale`). +- No cache → await rebuild (`X-StateHub-Cache: miss`). +- Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force + synchronous rebuild (for tests and operators). + +### Section split (phase 2 within this workplan) + +| Section key | Invalidation | +|-------------|--------------| +| `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps | +| `progress_tail` | `MAX(progress_events.created_at)` only | + +Serve merged `StateSummary`; rebuild only the section whose revision changed. + +### Eager invalidation + +Add `invalidate_summary_cache()` (or revision bump) called from: + +- `POST /progress/` +- `PATCH /tasks/{id}`, `POST /tasks/` +- `PATCH /decisions/{id}`, `POST /decisions/` +- `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`) +- `POST /workplan-dependencies/` (and legacy alias) + +Keep invalidation in-process (single uvicorn worker assumption). Document +multi-worker limitation; defer shared revision store unless needed. + +### Response metadata + +Extend headers (and optionally `_meta` in schema — only if consumers need it): + +``` +X-StateHub-Cache: hit-revision | stale | miss +X-StateHub-Revision: +X-StateHub-Elapsed-Ms: +``` + +Preserve truthful `generated_at` on cache hits (when the snapshot was built). + +## Out of scope + +- Redis or external cache layer. +- Multi-worker shared cache (document only). +- Replacing `/state/overview` (already the dashboard fast path). +- Changing summary response shape for existing consumers (additive `_meta` + only if needed). +- PostgreSQL materialized views or NOTIFY/LISTEN. + +## Dependencies + +- STATE-WP-0056 T07 (done) — baseline cache-miss profiling. +- activity-core bridge readiness fix (`40fa851`) — health probe; independent of + this workplan but motivates summary optimization for other callers. + +## T01 — Revision fingerprint module + +```task +id: STATE-WP-0066-T01 +status: todo +priority: high +state_hub_task_id: "" +``` + +Extract summary cache logic from `api/routers/state.py` into a dedicated +module (e.g. `api/services/summary_cache.py`). + +Deliverables: + +- `SummaryRevision` dataclass: per-table maxima + combined fingerprint string. +- `async def fetch_summary_revision(session) -> SummaryRevision` — one or few + SQL queries using existing indexes; target <20ms on current data volume. +- Unit tests with mocked session rows proving fingerprint changes when any + contributing table changes. +- Document covered tables and why each is included (flow-engine inputs, + `next_steps`, domain rollups, licence scan). + +Done when revision fetch is tested in isolation and profiled under local DB. + +## T02 — Revision-gated cache (replace TTL-only path) + +```task +id: STATE-WP-0066-T02 +status: todo +priority: high +state_hub_task_id: "" +``` + +Wire revision check into `GET /state/summary`: + +- If revision matches cached revision → return cache regardless of age. +- Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale + age if background refresh fails). +- Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers. +- Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild. +- Update `tests/conftest.py` cache reset helper for new module globals. + +Done when repeated summary requests with unchanged DB show `hit-revision` and +skip the heavy query path (assert via mock or elapsed-ms header threshold). + +## T03 — Stale-while-revalidate background refresh + +```task +id: STATE-WP-0066-T03 +status: todo +priority: high +state_hub_task_id: "" +``` + +When revision differs but a cached snapshot exists: + +- Return stale snapshot immediately (`X-StateHub-Cache: stale`). +- Start a single background `asyncio` task to rebuild (dedupe concurrent + refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router). +- On refresh completion, atomically swap cache + revision. +- On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for + optional diagnostics header. +- Cold start (no cache) still blocks until first build completes. + +Done when a revision bump serves stale data in <50ms while background rebuild +runs; second request after rebuild shows `hit-revision`. + +## T04 — Section-level cache for `recent_progress` + +```task +id: STATE-WP-0066-T04 +status: todo +priority: medium +state_hub_task_id: "" +``` + +Split cache into `core` and `progress_tail` sections: + +- `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`. +- `core` rebuild: existing summary path minus recent-progress fetch. +- Merge at serve time into `StateSummary`. +- Revision mismatch on progress only → rebuild progress section only; reuse + cached core. + +Done when inserting a progress event refreshes `recent_progress` without +re-running domain/SBOM/flow-engine work (verify via query count or spy). + +## T05 — Eager invalidation on mutation routes + +```task +id: STATE-WP-0066-T05 +status: todo +priority: medium +state_hub_task_id: "" +``` + +Call `invalidate_summary_cache()` (or equivalent revision bump) from write +paths listed in the design sketch. + +- Invalidation must be synchronous before response returns (so the next GET + sees fresh revision). +- Progress `POST` invalidates at least the `progress_tail` section. +- Task/workplan/decision `PATCH` invalidates `core`. +- Add focused tests: mutate → GET summary reflects change without `?refresh`. + +Done when mutation routes trigger invalidation and tests pass. + +## T06 — Benchmark and regression tests + +```task +id: STATE-WP-0066-T06 +status: todo +priority: high +state_hub_task_id: "" +``` + +Prove cache effectiveness under realistic load: + +- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers + `/state/summary` with unchanged revision — p95 < 50ms. +- Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data + volume (STATE-WP-0056 T07 bar). +- MCP smoke tests still pass (`tests/test_mcp_smoke.py`). +- Router shape tests still pass (`tests/test_routers_core.py`). +- Add tests for stale-while-revalidate and section-split behaviour. + +Done when benchmark script is documented in workplan notes or Makefile target +and CI tests are green. + +## T07 — Documentation and consumer guidance + +```task +id: STATE-WP-0066-T07 +status: todo +priority: low +state_hub_task_id: "" +``` + +Update: + +- `README.md` — `/state/summary` caching section (revision + stale headers). +- `dashboard/src/docs/live-data.md` — which pages need summary vs overview. +- `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`. +- `SCOPE.md` one-liner if summary caching is an operational concern. + +Clarify for operators: + +- Infrastructure probes → `/state/health`. +- Full snapshot → `/state/summary` (cached; `?refresh=true` to force). +- Dashboard overview → `/state/overview` (already lighter). + +Done when docs match implemented headers and invalidation semantics. + +## T08 — Verify through ops-bridge path + +```task +id: STATE-WP-0066-T08 +status: todo +priority: medium +state_hub_task_id: "" +``` + +End-to-end check from Railiance01 (or documented manual runbook): + +- `bridge check state-hub-railiance01` healthy. +- Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or + `stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s). +- Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not + summary). +- Log findings as State Hub progress event when verified. + +Done when bridge-path latency is recorded and no readiness regressions over a +30-minute observation window. + +## Acceptance criteria + +- Quiet hub: consecutive `/state/summary` requests hit revision cache, not full + rebuild. +- Data change: next request serves stale snapshot immediately, then fresh data + after background refresh. +- Progress-only change: core section not rebuilt (T04). +- Mutation → visible on next GET without manual API restart. +- No breaking changes to `StateSummary` JSON for existing consumers. +- Bridge readiness remains on `/state/health`. + +## Notes for implementer + +- `AsyncSession` does not support concurrent operations on one session — + background refresh must use `async_session_factory()` (same as workplan + index). +- `generated_at` on cache hit should reflect build time, not request time. +- Consider extracting `_build_state_summary(session)` from the route handler to + simplify testing (may already be partially inline). +- If revision query itself becomes hot, cache the revision for ~1s (micro-TTL) + as a second-level optimization — only if profiling warrants it. \ No newline at end of file