generated from coulomb/repo-seed
Replace the fixed 15s TTL on GET /state/summary with per-table revision watermarks, stale-while-revalidate background refresh, and a progress-tail section split. SQLAlchemy write hooks invalidate core or progress sections on mutation. Adds tests, benchmark script, and operator docs.
341 lines
12 KiB
Markdown
341 lines
12 KiB
Markdown
---
|
||
id: STATE-WP-0066
|
||
type: workplan
|
||
title: "State summary revision cache and stale-while-revalidate"
|
||
domain: custodian
|
||
repo: state-hub
|
||
status: finished
|
||
owner: codex
|
||
topic_slug: custodian
|
||
created: "2026-06-22"
|
||
updated: "2026-06-22"
|
||
finished: "2026-06-22"
|
||
state_hub_workstream_id: "f738cd77-6b8b-40e5-b348-dc304c7821f1"
|
||
---
|
||
|
||
# STATE-WP-0066 — State summary revision cache and stale-while-revalidate
|
||
|
||
## Summary
|
||
|
||
Upgrade `/state/summary` caching from a fixed 15-second TTL to a
|
||
**revision-gated** cache: serve the cached snapshot when underlying hub data
|
||
has not changed, and use **stale-while-revalidate** when it has. This extends
|
||
STATE-WP-0056 T07 (cache-miss cost reduction) and addresses the bridge/tunnel
|
||
latency spikes that made `/state/summary` a poor readiness-probe target on
|
||
Railiance01.
|
||
|
||
`/state/health` remains the correct probe endpoint for infrastructure checks.
|
||
This workplan optimizes summary for dashboard, MCP, and activity-core consumers
|
||
that legitimately need the full snapshot.
|
||
|
||
## Problem
|
||
|
||
Current behaviour (`api/routers/state.py`):
|
||
|
||
- In-process TTL cache (`_SUMMARY_TTL = 15s`) — expires on wall clock, not data
|
||
change.
|
||
- A cache miss runs ~20 sequential DB queries plus flow-engine evaluation and
|
||
Python-side SBOM licence scanning (~200ms+ locally; worse through
|
||
ops-bridge → WSL).
|
||
- ETag middleware helps repeat identical responses (304) but still pays full
|
||
rebuild cost on miss.
|
||
- `progress_events` is append-only; hourly sweeps and agent notes invalidate
|
||
freshness expectations for `recent_progress` even when totals and workplan
|
||
rows are unchanged.
|
||
|
||
Observed impact:
|
||
|
||
- Bridge readiness probe timeouts when summary was used (fixed in
|
||
activity-core `40fa851`; do not regress).
|
||
- Dashboard workplans page and MCP `get_state_summary` poll summary on
|
||
15–60s intervals — unnecessary rebuilds during quiet periods.
|
||
- Concurrent cache misses contend on a single `AsyncSession` (no gather).
|
||
|
||
## Goal
|
||
|
||
1. **Revision watermark** — one cheap query (or small query set) determines
|
||
whether cached summary is still valid.
|
||
2. **Stale-while-revalidate** — return last good snapshot immediately on
|
||
revision change; rebuild in background.
|
||
3. **Section-level freshness** — split `recent_progress` from the stable core
|
||
so sweep traffic does not force full rebuilds.
|
||
4. **Eager invalidation** — mutation routes clear or bump revision so writes
|
||
are visible without waiting for the next poll.
|
||
5. **Observable cache** — response headers document hit/miss/stale/revision.
|
||
|
||
## Design sketch
|
||
|
||
### Revision fingerprint
|
||
|
||
Before a full rebuild, compute a `SummaryRevision` from indexed
|
||
`MAX(updated_at)` / `MAX(created_at)` across tables that feed summary:
|
||
|
||
| Signal | Tables / columns |
|
||
|--------|------------------|
|
||
| Core entities | `topics`, `workplans`, `tasks`, `decisions`, `workplan_dependencies` → `updated_at` |
|
||
| Append-only events | `progress_events` → `created_at` |
|
||
| Portfolio | `managed_repos`, `contributions`, `capability_requests` → `updated_at` |
|
||
| Domain rollup | `domains`, `extension_points`, `technical_debt` → `updated_at` |
|
||
| SBOM slice | `sbom_entries` (or latest `sbom_snapshots`) → relevant timestamp |
|
||
|
||
Store `revision` (ISO timestamp or hash of per-table maxima) alongside cached
|
||
`StateSummary`. If incoming revision matches cached revision → return cache
|
||
(`X-StateHub-Cache: hit-revision`).
|
||
|
||
Reuse the fingerprint pattern from `api/doi_engine.py` (`compute_fingerprint`).
|
||
|
||
### Stale-while-revalidate
|
||
|
||
Mirror `api/routers/workstreams.py` `_workplan_index` behaviour:
|
||
|
||
- Revision unchanged → return cache (fast path).
|
||
- Revision changed, cache present → return stale cache, start background
|
||
rebuild task (`X-StateHub-Cache: stale`).
|
||
- No cache → await rebuild (`X-StateHub-Cache: miss`).
|
||
- Respect `Cache-Control: no-cache` and a `?refresh=true` query param to force
|
||
synchronous rebuild (for tests and operators).
|
||
|
||
### Section split (phase 2 within this workplan)
|
||
|
||
| Section key | Invalidation |
|
||
|-------------|--------------|
|
||
| `core` | topics, workplans, tasks, decisions, deps, domains, totals, next_steps |
|
||
| `progress_tail` | `MAX(progress_events.created_at)` only |
|
||
|
||
Serve merged `StateSummary`; rebuild only the section whose revision changed.
|
||
|
||
### Eager invalidation
|
||
|
||
Add `invalidate_summary_cache()` (or revision bump) called from:
|
||
|
||
- `POST /progress/`
|
||
- `PATCH /tasks/{id}`, `POST /tasks/`
|
||
- `PATCH /decisions/{id}`, `POST /decisions/`
|
||
- `PATCH /workplans/{id}`, `POST /workplans/` (and legacy `/workstreams/`)
|
||
- `POST /workplan-dependencies/` (and legacy alias)
|
||
|
||
Keep invalidation in-process (single uvicorn worker assumption). Document
|
||
multi-worker limitation; defer shared revision store unless needed.
|
||
|
||
### Response metadata
|
||
|
||
Extend headers (and optionally `_meta` in schema — only if consumers need it):
|
||
|
||
```
|
||
X-StateHub-Cache: hit-revision | stale | miss
|
||
X-StateHub-Revision: <iso-or-hash>
|
||
X-StateHub-Elapsed-Ms: <existing>
|
||
```
|
||
|
||
Preserve truthful `generated_at` on cache hits (when the snapshot was built).
|
||
|
||
## Out of scope
|
||
|
||
- Redis or external cache layer.
|
||
- Multi-worker shared cache (document only).
|
||
- Replacing `/state/overview` (already the dashboard fast path).
|
||
- Changing summary response shape for existing consumers (additive `_meta`
|
||
only if needed).
|
||
- PostgreSQL materialized views or NOTIFY/LISTEN.
|
||
|
||
## Dependencies
|
||
|
||
- STATE-WP-0056 T07 (done) — baseline cache-miss profiling.
|
||
- activity-core bridge readiness fix (`40fa851`) — health probe; independent of
|
||
this workplan but motivates summary optimization for other callers.
|
||
|
||
## T01 — Revision fingerprint module
|
||
|
||
```task
|
||
id: STATE-WP-0066-T01
|
||
status: done
|
||
priority: high
|
||
state_hub_task_id: "8ee836ec-048c-44f6-b16e-e7454d07371a"
|
||
```
|
||
|
||
Extract summary cache logic from `api/routers/state.py` into a dedicated
|
||
module (e.g. `api/services/summary_cache.py`).
|
||
|
||
Deliverables:
|
||
|
||
- `SummaryRevision` dataclass: per-table maxima + combined fingerprint string.
|
||
- `async def fetch_summary_revision(session) -> SummaryRevision` — one or few
|
||
SQL queries using existing indexes; target <20ms on current data volume.
|
||
- Unit tests with mocked session rows proving fingerprint changes when any
|
||
contributing table changes.
|
||
- Document covered tables and why each is included (flow-engine inputs,
|
||
`next_steps`, domain rollups, licence scan).
|
||
|
||
Done when revision fetch is tested in isolation and profiled under local DB.
|
||
|
||
## T02 — Revision-gated cache (replace TTL-only path)
|
||
|
||
```task
|
||
id: STATE-WP-0066-T02
|
||
status: done
|
||
priority: high
|
||
state_hub_task_id: "c5fc8ab8-5fc8-463b-9ae7-c304a7e0383e"
|
||
```
|
||
|
||
Wire revision check into `GET /state/summary`:
|
||
|
||
- If revision matches cached revision → return cache regardless of age.
|
||
- Remove or demote `_SUMMARY_TTL` to a safety cap only (e.g. 5 min max stale
|
||
age if background refresh fails).
|
||
- Add `X-StateHub-Cache: hit-revision` and `X-StateHub-Revision` headers.
|
||
- Honour `Cache-Control: no-cache` and `?refresh=true` for forced rebuild.
|
||
- Update `tests/conftest.py` cache reset helper for new module globals.
|
||
|
||
Done when repeated summary requests with unchanged DB show `hit-revision` and
|
||
skip the heavy query path (assert via mock or elapsed-ms header threshold).
|
||
|
||
## T03 — Stale-while-revalidate background refresh
|
||
|
||
```task
|
||
id: STATE-WP-0066-T03
|
||
status: done
|
||
priority: high
|
||
state_hub_task_id: "aa079e17-6103-4539-878d-b451035e5f8a"
|
||
```
|
||
|
||
When revision differs but a cached snapshot exists:
|
||
|
||
- Return stale snapshot immediately (`X-StateHub-Cache: stale`).
|
||
- Start a single background `asyncio` task to rebuild (dedupe concurrent
|
||
refresh — same pattern as `_INDEX_REFRESH_TASK` in workstreams router).
|
||
- On refresh completion, atomically swap cache + revision.
|
||
- On refresh failure, retain stale cache; set `_SUMMARY_LAST_ERROR` for
|
||
optional diagnostics header.
|
||
- Cold start (no cache) still blocks until first build completes.
|
||
|
||
Done when a revision bump serves stale data in <50ms while background rebuild
|
||
runs; second request after rebuild shows `hit-revision`.
|
||
|
||
## T04 — Section-level cache for `recent_progress`
|
||
|
||
```task
|
||
id: STATE-WP-0066-T04
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "99b14a10-ff1b-4609-a62b-5da19b79be68"
|
||
```
|
||
|
||
Split cache into `core` and `progress_tail` sections:
|
||
|
||
- `progress_tail` rebuild: single query `ORDER BY created_at DESC LIMIT 20`.
|
||
- `core` rebuild: existing summary path minus recent-progress fetch.
|
||
- Merge at serve time into `StateSummary`.
|
||
- Revision mismatch on progress only → rebuild progress section only; reuse
|
||
cached core.
|
||
|
||
Done when inserting a progress event refreshes `recent_progress` without
|
||
re-running domain/SBOM/flow-engine work (verify via query count or spy).
|
||
|
||
## T05 — Eager invalidation on mutation routes
|
||
|
||
```task
|
||
id: STATE-WP-0066-T05
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "0df9c1c2-6edf-4be3-bee8-b75e0d24fc02"
|
||
```
|
||
|
||
Call `invalidate_summary_cache()` (or equivalent revision bump) from write
|
||
paths listed in the design sketch.
|
||
|
||
- Invalidation must be synchronous before response returns (so the next GET
|
||
sees fresh revision).
|
||
- Progress `POST` invalidates at least the `progress_tail` section.
|
||
- Task/workplan/decision `PATCH` invalidates `core`.
|
||
- Add focused tests: mutate → GET summary reflects change without `?refresh`.
|
||
|
||
Done when mutation routes trigger invalidation and tests pass.
|
||
|
||
## T06 — Benchmark and regression tests
|
||
|
||
```task
|
||
id: STATE-WP-0066-T06
|
||
status: done
|
||
priority: high
|
||
state_hub_task_id: "5f68f999-d5ca-40df-a054-aaf762837342"
|
||
```
|
||
|
||
Prove cache effectiveness under realistic load:
|
||
|
||
- Extend or add script (cf. STATE-WP-0056 dashboard timing script) that hammers
|
||
`/state/summary` with unchanged revision — p95 < 50ms.
|
||
- Cache-miss path (forced `?refresh=true`) stays < 500ms on current local data
|
||
volume (STATE-WP-0056 T07 bar).
|
||
- MCP smoke tests still pass (`tests/test_mcp_smoke.py`).
|
||
- Router shape tests still pass (`tests/test_routers_core.py`).
|
||
- Add tests for stale-while-revalidate and section-split behaviour.
|
||
|
||
Done when benchmark script is documented in workplan notes or Makefile target
|
||
and CI tests are green.
|
||
|
||
## T07 — Documentation and consumer guidance
|
||
|
||
```task
|
||
id: STATE-WP-0066-T07
|
||
status: done
|
||
priority: low
|
||
state_hub_task_id: "aee0f349-761f-42b7-82b2-6a319936a68e"
|
||
```
|
||
|
||
Update:
|
||
|
||
- `README.md` — `/state/summary` caching section (revision + stale headers).
|
||
- `dashboard/src/docs/live-data.md` — which pages need summary vs overview.
|
||
- `mcp_server/TOOLS.md` — note cache behaviour for `get_state_summary`.
|
||
- `SCOPE.md` one-liner if summary caching is an operational concern.
|
||
|
||
Clarify for operators:
|
||
|
||
- Infrastructure probes → `/state/health`.
|
||
- Full snapshot → `/state/summary` (cached; `?refresh=true` to force).
|
||
- Dashboard overview → `/state/overview` (already lighter).
|
||
|
||
Done when docs match implemented headers and invalidation semantics.
|
||
|
||
## T08 — Verify through ops-bridge path
|
||
|
||
```task
|
||
id: STATE-WP-0066-T08
|
||
status: done
|
||
priority: medium
|
||
state_hub_task_id: "b4e967ab-57ba-4f0f-9161-373b383248b5"
|
||
```
|
||
|
||
End-to-end check from Railiance01 (or documented manual runbook):
|
||
|
||
- `bridge check state-hub-railiance01` healthy.
|
||
- Through node `127.0.0.1:18080`: summary returns 200 with `hit-revision` or
|
||
`stale` on repeated polls; p95 < 2s through tunnel (probe timeout is 5s).
|
||
- Confirm `actcore-state-hub-bridge` readiness stays 1/1 (health probe, not
|
||
summary).
|
||
- Log findings as State Hub progress event when verified.
|
||
|
||
Done when bridge-path latency is recorded and no readiness regressions over a
|
||
30-minute observation window.
|
||
|
||
## Acceptance criteria
|
||
|
||
- Quiet hub: consecutive `/state/summary` requests hit revision cache, not full
|
||
rebuild.
|
||
- Data change: next request serves stale snapshot immediately, then fresh data
|
||
after background refresh.
|
||
- Progress-only change: core section not rebuilt (T04).
|
||
- Mutation → visible on next GET without manual API restart.
|
||
- No breaking changes to `StateSummary` JSON for existing consumers.
|
||
- Bridge readiness remains on `/state/health`.
|
||
|
||
## Notes for implementer
|
||
|
||
- `AsyncSession` does not support concurrent operations on one session —
|
||
background refresh must use `async_session_factory()` (same as workplan
|
||
index).
|
||
- `generated_at` on cache hit should reflect build time, not request time.
|
||
- Consider extracting `_build_state_summary(session)` from the route handler to
|
||
simplify testing (may already be partially inline).
|
||
- If revision query itself becomes hot, cache the revision for ~1s (micro-TTL)
|
||
as a second-level optimization — only if profiling warrants it. |