generated from coulomb/repo-seed
Add STATE-WP-0066 workplan for state summary revision cache
Defines revision-gated caching, stale-while-revalidate, section split for recent_progress, mutation invalidation, and bridge-path verification.
This commit is contained in:
340
workplans/STATE-WP-0066-state-summary-revision-cache.md
Normal file
340
workplans/STATE-WP-0066-state-summary-revision-cache.md
Normal file
@@ -0,0 +1,340 @@
|
|||||||
|
---
|
||||||
|
id: STATE-WP-0066
|
||||||
|
type: workplan
|
||||||
|
title: "State summary revision cache and stale-while-revalidate"
|
||||||
|
domain: custodian
|
||||||
|
repo: state-hub
|
||||||
|
status: ready
|
||||||
|
owner: codex
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-06-22"
|
||||||
|
updated: "2026-06-22"
|
||||||
|
state_hub_workstream_id: ""
|
||||||
|
---
|
||||||
|
|
||||||
|
# STATE-WP-0066 — State summary revision cache and stale-while-revalidate
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Upgrade `/state/summary` caching from a fixed 15-second TTL to a
|
||||||
|
**revision-gated** cache: serve the cached snapshot when underlying hub data
|
||||||
|
has not changed, and use **stale-while-revalidate** when it has. This extends
|
||||||
|
STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel
|
||||||
|
latency spikes that made `/state/summary` a poor readiness-probe target on
|
||||||
|
Railiance01.
|
||||||
|
|
||||||
|
`/state/health` remains the correct probe endpoint for infrastructure checks.
|
||||||
|
This workplan optimizes summary for dashboard, MCP, and activity-core consumers
|
||||||
|
that legitimately need the full snapshot.
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Current behaviour (`api/routers/state.py`):
|
||||||
|
|
||||||
|
- In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data
|
||||||
|
change.
|
||||||
|
- A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and
|
||||||
|
Python-side SBOM licence scanning (~200ms+ locally; worse through
|
||||||
|
ops-bridge → WSL).
|
||||||
|
- ETag middleware helps repeat identical responses (304) but still pays full
|
||||||
|
rebuild cost on miss.
|
||||||
|
- `progress_events` is append-only; hourly sweeps and agent notes invalidate
|
||||||
|
freshness expectations for `recent_progress` even when totals and workplan
|
||||||
|
rows are unchanged.
|
||||||
|
|
||||||
|
Observed impact:
|
||||||
|
|
||||||
|
- Bridge readiness probe timeouts when summary was used (fixed in
|
||||||
|
activity-core `40fa851`; do not regress).
|
||||||
|
- Dashboard workplans page and MCP `get_state_summary` poll summary on
|
||||||
|
15–60s intervals — unnecessary rebuilds during quiet periods.
|
||||||
|
- Concurrent cache misses contend on a single `AsyncSession` (no gather).
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
1. **Revision watermark** — one cheap query (or small query set) determines
|
||||||
|
whether cached summary is still valid.
|
||||||
|
2. **Stale-while-revalidate** — return last good snapshot immediately on
|
||||||
|
revision change; rebuild in background.
|
||||||
|
3. **Section-level freshness** — split `recent_progress` from the stable core
|
||||||
|
so sweep traffic does not force full rebuilds.
|
||||||
|
4. **Eager invalidation** — mutation routes clear or bump revision so writes
|
||||||
|
are visible without waiting for the next poll.
|
||||||
|
5. **Observable cache** — response headers document hit/miss/stale/revision.
|
||||||
|
|
||||||
|
## Design sketch
|
||||||
|
|
||||||
|
### Revision fingerprint
|
||||||
|
|
||||||
|
Before a full rebuild, compute a `SummaryRevision` from indexed
|
||||||
|
`MAX(updated_at)` / `MAX(created_at)` across tables that feed summary:
|
||||||
|
|
||||||
|
| Signal | Tables / columns |
|
||||||
|
|--------|------------------|
|
||||||
|
| Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies` → `updated_at` |
|
||||||
|
| Append-only events | `progress_events` → `created_at` |
|
||||||
|
| Portfolio | `managed_repos`, `contributions`, `capability_requests` → `updated_at` |
|
||||||
|
| Domain rollup | `domains`, `extension_points`, `technical_debt` → `updated_at` |
|
||||||
|
| SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp |
|
||||||
|
|
||||||
|
Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached
|
||||||
|
`StateSummary`. If incoming revision matches cached revision → return cache
|
||||||
|
(`X-StateHub-Cache: hit-revision`).
|
||||||
|
|
||||||
|
Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`).
|
||||||
|
|
||||||
|
### Stale-while-revalidate
|
||||||
|
|
||||||
|
Mirror `api/routers/workstreams.py` `_workplan_index` behaviour:
|
||||||
|
|
||||||
|
- Revision unchanged → return cache (fast path).
|
||||||
|
- Revision changed, cache present → return stale cache, start background
|
||||||
|
rebuild task (`X-StateHub-Cache: stale`).
|
||||||
|
- No cache → await rebuild (`X-StateHub-Cache: miss`).
|
||||||
|
- Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force
|
||||||
|
synchronous rebuild (for tests and operators).
|
||||||
|
|
||||||
|
### Section split (phase 2 within this workplan)
|
||||||
|
|
||||||
|
| Section key | Invalidation |
|
||||||
|
|-------------|--------------|
|
||||||
|
| `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps |
|
||||||
|
| `progress_tail` | `MAX(progress_events.created_at)` only |
|
||||||
|
|
||||||
|
Serve merged `StateSummary`; rebuild only the section whose revision changed.
|
||||||
|
|
||||||
|
### Eager invalidation
|
||||||
|
|
||||||
|
Add `invalidate_summary_cache()` (or revision bump) called from:
|
||||||
|
|
||||||
|
- `POST /progress/`
|
||||||
|
- `PATCH /tasks/{id}`, `POST /tasks/`
|
||||||
|
- `PATCH /decisions/{id}`, `POST /decisions/`
|
||||||
|
- `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`)
|
||||||
|
- `POST /workplan-dependencies/` (and legacy alias)
|
||||||
|
|
||||||
|
Keep invalidation in-process (single uvicorn worker assumption). Document
|
||||||
|
multi-worker limitation; defer shared revision store unless needed.
|
||||||
|
|
||||||
|
### Response metadata
|
||||||
|
|
||||||
|
Extend headers (and optionally `_meta` in schema — only if consumers need it):
|
||||||
|
|
||||||
|
```
|
||||||
|
X-StateHub-Cache: hit-revision | stale | miss
|
||||||
|
X-StateHub-Revision: <iso-or-hash>
|
||||||
|
X-StateHub-Elapsed-Ms: <existing>
|
||||||
|
```
|
||||||
|
|
||||||
|
Preserve truthful `generated_at` on cache hits (when the snapshot was built).
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Redis or external cache layer.
|
||||||
|
- Multi-worker shared cache (document only).
|
||||||
|
- Replacing `/state/overview` (already the dashboard fast path).
|
||||||
|
- Changing summary response shape for existing consumers (additive `_meta`
|
||||||
|
only if needed).
|
||||||
|
- PostgreSQL materialized views or NOTIFY/LISTEN.
|
||||||
|
|
||||||
|
## Dependencies
|
||||||
|
|
||||||
|
- STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
|
||||||
|
- activity-core bridge readiness fix (`40fa851`) — health probe; independent of
|
||||||
|
this workplan but motivates summary optimization for other callers.
|
||||||
|
|
||||||
|
## T01 — Revision fingerprint module
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T01
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Extract summary cache logic from `api/routers/state.py` into a dedicated
|
||||||
|
module (e.g. `api/services/summary_cache.py`).
|
||||||
|
|
||||||
|
Deliverables:
|
||||||
|
|
||||||
|
- `SummaryRevision` dataclass: per-table maxima + combined fingerprint string.
|
||||||
|
- `async def fetch_summary_revision(session) -> SummaryRevision` — one or few
|
||||||
|
SQL queries using existing indexes; target <20ms on current data volume.
|
||||||
|
- Unit tests with mocked session rows proving fingerprint changes when any
|
||||||
|
contributing table changes.
|
||||||
|
- Document covered tables and why each is included (flow-engine inputs,
|
||||||
|
`next_steps`, domain rollups, licence scan).
|
||||||
|
|
||||||
|
Done when revision fetch is tested in isolation and profiled under local DB.
|
||||||
|
|
||||||
|
## T02 — Revision-gated cache (replace TTL-only path)
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T02
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Wire revision check into `GET /state/summary`:
|
||||||
|
|
||||||
|
- If revision matches cached revision → return cache regardless of age.
|
||||||
|
- Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale
|
||||||
|
age if background refresh fails).
|
||||||
|
- Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers.
|
||||||
|
- Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild.
|
||||||
|
- Update `tests/conftest.py` cache reset helper for new module globals.
|
||||||
|
|
||||||
|
Done when repeated summary requests with unchanged DB show `hit-revision` and
|
||||||
|
skip the heavy query path (assert via mock or elapsed-ms header threshold).
|
||||||
|
|
||||||
|
## T03 — Stale-while-revalidate background refresh
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T03
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
When revision differs but a cached snapshot exists:
|
||||||
|
|
||||||
|
- Return stale snapshot immediately (`X-StateHub-Cache: stale`).
|
||||||
|
- Start a single background `asyncio` task to rebuild (dedupe concurrent
|
||||||
|
refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router).
|
||||||
|
- On refresh completion, atomically swap cache + revision.
|
||||||
|
- On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for
|
||||||
|
optional diagnostics header.
|
||||||
|
- Cold start (no cache) still blocks until first build completes.
|
||||||
|
|
||||||
|
Done when a revision bump serves stale data in <50ms while background rebuild
|
||||||
|
runs; second request after rebuild shows `hit-revision`.
|
||||||
|
|
||||||
|
## T04 — Section-level cache for `recent_progress`
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T04
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Split cache into `core` and `progress_tail` sections:
|
||||||
|
|
||||||
|
- `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`.
|
||||||
|
- `core` rebuild: existing summary path minus recent-progress fetch.
|
||||||
|
- Merge at serve time into `StateSummary`.
|
||||||
|
- Revision mismatch on progress only → rebuild progress section only; reuse
|
||||||
|
cached core.
|
||||||
|
|
||||||
|
Done when inserting a progress event refreshes `recent_progress` without
|
||||||
|
re-running domain/SBOM/flow-engine work (verify via query count or spy).
|
||||||
|
|
||||||
|
## T05 — Eager invalidation on mutation routes
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T05
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Call `invalidate_summary_cache()` (or equivalent revision bump) from write
|
||||||
|
paths listed in the design sketch.
|
||||||
|
|
||||||
|
- Invalidation must be synchronous before response returns (so the next GET
|
||||||
|
sees fresh revision).
|
||||||
|
- Progress `POST` invalidates at least the `progress_tail` section.
|
||||||
|
- Task/workplan/decision `PATCH` invalidates `core`.
|
||||||
|
- Add focused tests: mutate → GET summary reflects change without `?refresh`.
|
||||||
|
|
||||||
|
Done when mutation routes trigger invalidation and tests pass.
|
||||||
|
|
||||||
|
## T06 — Benchmark and regression tests
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T06
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Prove cache effectiveness under realistic load:
|
||||||
|
|
||||||
|
- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers
|
||||||
|
`/state/summary` with unchanged revision — p95 < 50ms.
|
||||||
|
- Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data
|
||||||
|
volume (STATE-WP-0056 T07 bar).
|
||||||
|
- MCP smoke tests still pass (`tests/test_mcp_smoke.py`).
|
||||||
|
- Router shape tests still pass (`tests/test_routers_core.py`).
|
||||||
|
- Add tests for stale-while-revalidate and section-split behaviour.
|
||||||
|
|
||||||
|
Done when benchmark script is documented in workplan notes or Makefile target
|
||||||
|
and CI tests are green.
|
||||||
|
|
||||||
|
## T07 — Documentation and consumer guidance
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T07
|
||||||
|
status: todo
|
||||||
|
priority: low
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
Update:
|
||||||
|
|
||||||
|
- `README.md` — `/state/summary` caching section (revision + stale headers).
|
||||||
|
- `dashboard/src/docs/live-data.md` — which pages need summary vs overview.
|
||||||
|
- `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`.
|
||||||
|
- `SCOPE.md` one-liner if summary caching is an operational concern.
|
||||||
|
|
||||||
|
Clarify for operators:
|
||||||
|
|
||||||
|
- Infrastructure probes → `/state/health`.
|
||||||
|
- Full snapshot → `/state/summary` (cached; `?refresh=true` to force).
|
||||||
|
- Dashboard overview → `/state/overview` (already lighter).
|
||||||
|
|
||||||
|
Done when docs match implemented headers and invalidation semantics.
|
||||||
|
|
||||||
|
## T08 — Verify through ops-bridge path
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: STATE-WP-0066-T08
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
state_hub_task_id: ""
|
||||||
|
```
|
||||||
|
|
||||||
|
End-to-end check from Railiance01 (or documented manual runbook):
|
||||||
|
|
||||||
|
- `bridge check state-hub-railiance01` healthy.
|
||||||
|
- Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or
|
||||||
|
`stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s).
|
||||||
|
- Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not
|
||||||
|
summary).
|
||||||
|
- Log findings as State Hub progress event when verified.
|
||||||
|
|
||||||
|
Done when bridge-path latency is recorded and no readiness regressions over a
|
||||||
|
30-minute observation window.
|
||||||
|
|
||||||
|
## Acceptance criteria
|
||||||
|
|
||||||
|
- Quiet hub: consecutive `/state/summary` requests hit revision cache, not full
|
||||||
|
rebuild.
|
||||||
|
- Data change: next request serves stale snapshot immediately, then fresh data
|
||||||
|
after background refresh.
|
||||||
|
- Progress-only change: core section not rebuilt (T04).
|
||||||
|
- Mutation → visible on next GET without manual API restart.
|
||||||
|
- No breaking changes to `StateSummary` JSON for existing consumers.
|
||||||
|
- Bridge readiness remains on `/state/health`.
|
||||||
|
|
||||||
|
## Notes for implementer
|
||||||
|
|
||||||
|
- `AsyncSession` does not support concurrent operations on one session —
|
||||||
|
background refresh must use `async_session_factory()` (same as workplan
|
||||||
|
index).
|
||||||
|
- `generated_at` on cache hit should reflect build time, not request time.
|
||||||
|
- Consider extracting `_build_state_summary(session)` from the route handler to
|
||||||
|
simplify testing (may already be partially inline).
|
||||||
|
- If revision query itself becomes hot, cache the revision for ~1s (micro-TTL)
|
||||||
|
as a second-level optimization — only if profiling warrants it.
|
||||||
Reference in New Issue
Block a user