generated from coulomb/repo-seed
Add STATE-WP-0066 workplan for state summary revision cache
Defines revision-gated caching, stale-while-revalidate, section split for recent_progress, mutation invalidation, and bridge-path verification.
This commit is contained in:
340
workplans/STATE-WP-0066-state-summary-revision-cache.md
Normal file
340
workplans/STATE-WP-0066-state-summary-revision-cache.md
Normal file
@@ -0,0 +1,340 @@
|
||||
---
|
||||
id: STATE-WP-0066
|
||||
type: workplan
|
||||
title: "State summary revision cache and stale-while-revalidate"
|
||||
domain: custodian
|
||||
repo: state-hub
|
||||
status: ready
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-22"
|
||||
updated: "2026-06-22"
|
||||
state_hub_workstream_id: ""
|
||||
---
|
||||
|
||||
# STATE-WP-0066 — State summary revision cache and stale-while-revalidate
|
||||
|
||||
## Summary
|
||||
|
||||
Upgrade `/state/summary` caching from a fixed 15-second TTL to a
|
||||
**revision-gated** cache: serve the cached snapshot when underlying hub data
|
||||
has not changed, and use **stale-while-revalidate** when it has. This extends
|
||||
STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel
|
||||
latency spikes that made `/state/summary` a poor readiness-probe target on
|
||||
Railiance01.
|
||||
|
||||
`/state/health` remains the correct probe endpoint for infrastructure checks.
|
||||
This workplan optimizes summary for dashboard, MCP, and activity-core consumers
|
||||
that legitimately need the full snapshot.
|
||||
|
||||
## Problem
|
||||
|
||||
Current behaviour (`api/routers/state.py`):
|
||||
|
||||
- In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data
|
||||
change.
|
||||
- A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and
|
||||
Python-side SBOM licence scanning (~200ms+ locally; worse through
|
||||
ops-bridge → WSL).
|
||||
- ETag middleware helps repeat identical responses (304) but still pays full
|
||||
rebuild cost on miss.
|
||||
- `progress_events` is append-only; hourly sweeps and agent notes invalidate
|
||||
freshness expectations for `recent_progress` even when totals and workplan
|
||||
rows are unchanged.
|
||||
|
||||
Observed impact:
|
||||
|
||||
- Bridge readiness probe timeouts when summary was used (fixed in
|
||||
activity-core `40fa851`; do not regress).
|
||||
- Dashboard workplans page and MCP `get_state_summary` poll summary on
|
||||
15–60s intervals — unnecessary rebuilds during quiet periods.
|
||||
- Concurrent cache misses contend on a single `AsyncSession` (no gather).
|
||||
|
||||
## Goal
|
||||
|
||||
1. **Revision watermark** — one cheap query (or small query set) determines
|
||||
whether cached summary is still valid.
|
||||
2. **Stale-while-revalidate** — return last good snapshot immediately on
|
||||
revision change; rebuild in background.
|
||||
3. **Section-level freshness** — split `recent_progress` from the stable core
|
||||
so sweep traffic does not force full rebuilds.
|
||||
4. **Eager invalidation** — mutation routes clear or bump revision so writes
|
||||
are visible without waiting for the next poll.
|
||||
5. **Observable cache** — response headers document hit/miss/stale/revision.
|
||||
|
||||
## Design sketch
|
||||
|
||||
### Revision fingerprint
|
||||
|
||||
Before a full rebuild, compute a `SummaryRevision` from indexed
|
||||
`MAX(updated_at)` / `MAX(created_at)` across tables that feed summary:
|
||||
|
||||
| Signal | Tables / columns |
|
||||
|--------|------------------|
|
||||
| Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies` → `updated_at` |
|
||||
| Append-only events | `progress_events` → `created_at` |
|
||||
| Portfolio | `managed_repos`, `contributions`, `capability_requests` → `updated_at` |
|
||||
| Domain rollup | `domains`, `extension_points`, `technical_debt` → `updated_at` |
|
||||
| SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp |
|
||||
|
||||
Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached
|
||||
`StateSummary`. If incoming revision matches cached revision → return cache
|
||||
(`X-StateHub-Cache: hit-revision`).
|
||||
|
||||
Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`).
|
||||
|
||||
### Stale-while-revalidate
|
||||
|
||||
Mirror `api/routers/workstreams.py` `_workplan_index` behaviour:
|
||||
|
||||
- Revision unchanged → return cache (fast path).
|
||||
- Revision changed, cache present → return stale cache, start background
|
||||
rebuild task (`X-StateHub-Cache: stale`).
|
||||
- No cache → await rebuild (`X-StateHub-Cache: miss`).
|
||||
- Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force
|
||||
synchronous rebuild (for tests and operators).
|
||||
|
||||
### Section split (phase 2 within this workplan)
|
||||
|
||||
| Section key | Invalidation |
|
||||
|-------------|--------------|
|
||||
| `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps |
|
||||
| `progress_tail` | `MAX(progress_events.created_at)` only |
|
||||
|
||||
Serve merged `StateSummary`; rebuild only the section whose revision changed.
|
||||
|
||||
### Eager invalidation
|
||||
|
||||
Add `invalidate_summary_cache()` (or revision bump) called from:
|
||||
|
||||
- `POST /progress/`
|
||||
- `PATCH /tasks/{id}`, `POST /tasks/`
|
||||
- `PATCH /decisions/{id}`, `POST /decisions/`
|
||||
- `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`)
|
||||
- `POST /workplan-dependencies/` (and legacy alias)
|
||||
|
||||
Keep invalidation in-process (single uvicorn worker assumption). Document
|
||||
multi-worker limitation; defer shared revision store unless needed.
|
||||
|
||||
### Response metadata
|
||||
|
||||
Extend headers (and optionally `_meta` in schema — only if consumers need it):
|
||||
|
||||
```
|
||||
X-StateHub-Cache: hit-revision | stale | miss
|
||||
X-StateHub-Revision: <iso-or-hash>
|
||||
X-StateHub-Elapsed-Ms: <existing>
|
||||
```
|
||||
|
||||
Preserve truthful `generated_at` on cache hits (when the snapshot was built).
|
||||
|
||||
## Out of scope
|
||||
|
||||
- Redis or external cache layer.
|
||||
- Multi-worker shared cache (document only).
|
||||
- Replacing `/state/overview` (already the dashboard fast path).
|
||||
- Changing summary response shape for existing consumers (additive `_meta`
|
||||
only if needed).
|
||||
- PostgreSQL materialized views or NOTIFY/LISTEN.
|
||||
|
||||
## Dependencies
|
||||
|
||||
- STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
|
||||
- activity-core bridge readiness fix (`40fa851`) — health probe; independent of
|
||||
this workplan but motivates summary optimization for other callers.
|
||||
|
||||
## T01 — Revision fingerprint module
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T01
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
Extract summary cache logic from `api/routers/state.py` into a dedicated
|
||||
module (e.g. `api/services/summary_cache.py`).
|
||||
|
||||
Deliverables:
|
||||
|
||||
- `SummaryRevision` dataclass: per-table maxima + combined fingerprint string.
|
||||
- `async def fetch_summary_revision(session) -> SummaryRevision` — one or few
|
||||
SQL queries using existing indexes; target <20ms on current data volume.
|
||||
- Unit tests with mocked session rows proving fingerprint changes when any
|
||||
contributing table changes.
|
||||
- Document covered tables and why each is included (flow-engine inputs,
|
||||
`next_steps`, domain rollups, licence scan).
|
||||
|
||||
Done when revision fetch is tested in isolation and profiled under local DB.
|
||||
|
||||
## T02 — Revision-gated cache (replace TTL-only path)
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T02
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
Wire revision check into `GET /state/summary`:
|
||||
|
||||
- If revision matches cached revision → return cache regardless of age.
|
||||
- Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale
|
||||
age if background refresh fails).
|
||||
- Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers.
|
||||
- Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild.
|
||||
- Update `tests/conftest.py` cache reset helper for new module globals.
|
||||
|
||||
Done when repeated summary requests with unchanged DB show `hit-revision` and
|
||||
skip the heavy query path (assert via mock or elapsed-ms header threshold).
|
||||
|
||||
## T03 — Stale-while-revalidate background refresh
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T03
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
When revision differs but a cached snapshot exists:
|
||||
|
||||
- Return stale snapshot immediately (`X-StateHub-Cache: stale`).
|
||||
- Start a single background `asyncio` task to rebuild (dedupe concurrent
|
||||
refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router).
|
||||
- On refresh completion, atomically swap cache + revision.
|
||||
- On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for
|
||||
optional diagnostics header.
|
||||
- Cold start (no cache) still blocks until first build completes.
|
||||
|
||||
Done when a revision bump serves stale data in <50ms while background rebuild
|
||||
runs; second request after rebuild shows `hit-revision`.
|
||||
|
||||
## T04 — Section-level cache for `recent_progress`
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T04
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
Split cache into `core` and `progress_tail` sections:
|
||||
|
||||
- `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`.
|
||||
- `core` rebuild: existing summary path minus recent-progress fetch.
|
||||
- Merge at serve time into `StateSummary`.
|
||||
- Revision mismatch on progress only → rebuild progress section only; reuse
|
||||
cached core.
|
||||
|
||||
Done when inserting a progress event refreshes `recent_progress` without
|
||||
re-running domain/SBOM/flow-engine work (verify via query count or spy).
|
||||
|
||||
## T05 — Eager invalidation on mutation routes
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T05
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
Call `invalidate_summary_cache()` (or equivalent revision bump) from write
|
||||
paths listed in the design sketch.
|
||||
|
||||
- Invalidation must be synchronous before response returns (so the next GET
|
||||
sees fresh revision).
|
||||
- Progress `POST` invalidates at least the `progress_tail` section.
|
||||
- Task/workplan/decision `PATCH` invalidates `core`.
|
||||
- Add focused tests: mutate → GET summary reflects change without `?refresh`.
|
||||
|
||||
Done when mutation routes trigger invalidation and tests pass.
|
||||
|
||||
## T06 — Benchmark and regression tests
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T06
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
Prove cache effectiveness under realistic load:
|
||||
|
||||
- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers
|
||||
`/state/summary` with unchanged revision — p95 < 50ms.
|
||||
- Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data
|
||||
volume (STATE-WP-0056 T07 bar).
|
||||
- MCP smoke tests still pass (`tests/test_mcp_smoke.py`).
|
||||
- Router shape tests still pass (`tests/test_routers_core.py`).
|
||||
- Add tests for stale-while-revalidate and section-split behaviour.
|
||||
|
||||
Done when benchmark script is documented in workplan notes or Makefile target
|
||||
and CI tests are green.
|
||||
|
||||
## T07 — Documentation and consumer guidance
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T07
|
||||
status: todo
|
||||
priority: low
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
Update:
|
||||
|
||||
- `README.md` — `/state/summary` caching section (revision + stale headers).
|
||||
- `dashboard/src/docs/live-data.md` — which pages need summary vs overview.
|
||||
- `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`.
|
||||
- `SCOPE.md` one-liner if summary caching is an operational concern.
|
||||
|
||||
Clarify for operators:
|
||||
|
||||
- Infrastructure probes → `/state/health`.
|
||||
- Full snapshot → `/state/summary` (cached; `?refresh=true` to force).
|
||||
- Dashboard overview → `/state/overview` (already lighter).
|
||||
|
||||
Done when docs match implemented headers and invalidation semantics.
|
||||
|
||||
## T08 — Verify through ops-bridge path
|
||||
|
||||
```task
|
||||
id: STATE-WP-0066-T08
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: ""
|
||||
```
|
||||
|
||||
End-to-end check from Railiance01 (or documented manual runbook):
|
||||
|
||||
- `bridge check state-hub-railiance01` healthy.
|
||||
- Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or
|
||||
`stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s).
|
||||
- Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not
|
||||
summary).
|
||||
- Log findings as State Hub progress event when verified.
|
||||
|
||||
Done when bridge-path latency is recorded and no readiness regressions over a
|
||||
30-minute observation window.
|
||||
|
||||
## Acceptance criteria
|
||||
|
||||
- Quiet hub: consecutive `/state/summary` requests hit revision cache, not full
|
||||
rebuild.
|
||||
- Data change: next request serves stale snapshot immediately, then fresh data
|
||||
after background refresh.
|
||||
- Progress-only change: core section not rebuilt (T04).
|
||||
- Mutation → visible on next GET without manual API restart.
|
||||
- No breaking changes to `StateSummary` JSON for existing consumers.
|
||||
- Bridge readiness remains on `/state/health`.
|
||||
|
||||
## Notes for implementer
|
||||
|
||||
- `AsyncSession` does not support concurrent operations on one session —
|
||||
background refresh must use `async_session_factory()` (same as workplan
|
||||
index).
|
||||
- `generated_at` on cache hit should reflect build time, not request time.
|
||||
- Consider extracting `_build_state_summary(session)` from the route handler to
|
||||
simplify testing (may already be partially inline).
|
||||
- If revision query itself becomes hot, cache the revision for ~1s (micro-TTL)
|
||||
as a second-level optimization — only if profiling warrants it.
|
||||
Reference in New Issue
Block a user