diff --git a/workplans/CUST-WP-0041-api-performance.md b/workplans/CUST-WP-0041-api-performance.md new file mode 100644 index 0000000..314e3a4 --- /dev/null +++ b/workplans/CUST-WP-0041-api-performance.md @@ -0,0 +1,215 @@ +--- +id: CUST-WP-0041 +type: workplan +title: "API Performance Optimization" +domain: custodian +status: todo +owner: custodian +topic_slug: custodian +created: "2026-05-15" +updated: "2026-05-15" +state_hub_workstream_id: "36b97d4a-0144-479b-81fb-9b9072379143" +--- + +# API Performance Optimization + +## Problem + +Profiling the API under dashboard load revealed four distinct bottlenecks that +collectively cause 2–4 s response times for endpoints that should complete in +under 100 ms: + +| Endpoint | Observed latency | Root cause | +|----------|-----------------|------------| +| `/workstreams/workplan-index` | 4171 ms | Synchronous filesystem scan (all repos, all `.md` files, YAML parse) on every request — no cache | +| `/topics/` | 2382 ms | `lazy="selectin"` on `Topic` triggers full loads of `workstreams`, `decisions`, and `progress_events` per topic | +| `/domains/` | 2252 ms | `lazy="selectin"` on `Domain` cascades into topics → workstreams chain | +| `/repos/` | 2222 ms | `lazy="selectin"` on `ManagedRepo` loads `domain` + `goals` | +| `/tasks/?limit=500` | 2174 ms vs 850 ms for `limit=2000` | Query planner picks a bad plan without a `status` index | +| `/sbom/snapshots/` | 2704 ms | Missing composite index on `(repo_id, snapshot_at)` used by the latest-snapshot subquery | + +The list endpoints (`/topics/`, `/repos/`, `/domains/`) are called by 10+ dashboard +pages as reference data. Each call loads far more data than the callers need +(they only use `id`, `slug`, `domain_slug`, `title`). This is an N+1 at the +ORM relationship level, not the HTTP level. + +## Goals + +- Reduce all list endpoint response times to under 300 ms +- Eliminate the 4 s workplan-index scan +- Keep schemas backwards-compatible (no dashboard changes required) + +## Out of scope + +- Full query result caching (Redis/Memcached) +- Pagination on list endpoints +- Read-replica routing + +--- + +## Tasks + +### T1 — Add missing DB indexes + +```task +id: CUST-WP-0041-T1 +status: done +priority: high +state_hub_task_id: "160152e1-2286-46dc-9240-0d6a6db8abd9" +``` + +Add a new Alembic migration with the following indexes: + +```sql +-- Filters used by /tasks/?needs_human, /tasks/?status=, and query planner hints +CREATE INDEX ix_tasks_status ON tasks(status); +CREATE INDEX ix_tasks_workstream_status ON tasks(workstream_id, status); + +-- /workstreams/?status= and state/summary active-workstream filter +CREATE INDEX ix_workstreams_status ON workstreams(status); + +-- /sbom/snapshots/ latest-snapshot subquery: MAX(snapshot_at) GROUP BY repo_id +CREATE INDEX ix_sbom_snapshots_repo_at ON sbom_snapshots(repo_id, snapshot_at DESC); +``` + +**Implementation:** new migration file in `state-hub/migrations/versions/` +with `down_revision` pointing to the current Alembic head. + +--- + +### T2 — Add TTL cache to `/workstreams/workplan-index` + +```task +id: CUST-WP-0041-T2 +status: todo +priority: high +state_hub_task_id: "f74e88d3-2446-4fd5-8f2e-10ab37144f2a" +``` + +The endpoint scans the filesystem of every active repo on every request. +Add a module-level in-process cache with a 30 s TTL: + +```python +# in api/routers/workstreams.py +import asyncio, time + +_INDEX_CACHE: dict[str, Any] | None = None +_INDEX_CACHE_AT: float = 0.0 +_INDEX_TTL = 30.0 + +@router.get("/workplan-index") +async def workplan_index( + refresh: bool = False, + session: AsyncSession = Depends(get_session), +) -> dict[str, Any]: + global _INDEX_CACHE, _INDEX_CACHE_AT + if not refresh and _INDEX_CACHE is not None and (time.monotonic() - _INDEX_CACHE_AT) < _INDEX_TTL: + return _INDEX_CACHE + # ... existing scan logic ... + _INDEX_CACHE = {"workstreams": index} + _INDEX_CACHE_AT = time.monotonic() + return _INDEX_CACHE +``` + +The dashboard passes `?refresh=1` after `make fix-consistency` completes +(or just waits 30 s for the TTL to expire naturally). + +**Note:** module-level cache is safe here — uvicorn runs single-process in dev; +in production a shared cache (Redis) would be needed, but that is out of scope. + +--- + +### T3 — Replace selectin cascade on list endpoints with lean schemas + +```task +id: CUST-WP-0041-T3 +status: todo +priority: high +state_hub_task_id: "e15378a7-39c4-4596-b67c-1aff18674fc7" +``` + +`GET /topics/`, `GET /domains/`, and `GET /repos/` trigger deep `selectin` +relationship loads that the list callers never use. Fix with lean response +models that exclude the heavy relationships: + +**New schemas (add to existing schema files):** + +```python +# api/schemas/topic.py +class TopicListItem(BaseModel): + model_config = ConfigDict(from_attributes=True) + id: uuid.UUID + slug: str + title: str + domain_slug: str # already a @property on the model + status: str + created_at: datetime + updated_at: datetime +``` + +```python +# api/schemas/managed_repo.py +class RepoListItem(BaseModel): + model_config = ConfigDict(from_attributes=True) + id: uuid.UUID + slug: str + title: str | None + domain_slug: str | None + status: str + local_path: str | None +``` + +```python +# api/schemas/domain.py +class DomainListItem(BaseModel): + model_config = ConfigDict(from_attributes=True) + id: uuid.UUID + slug: str + title: str + status: str + created_at: datetime + updated_at: datetime +``` + +**Router changes:** switch list endpoint `response_model` to +`list[TopicListItem]` / `list[RepoListItem]` / `list[DomainListItem]`. +Add `options(load_only(...))` or remove `selectin` from the ORM load +path for list queries (use explicit `joinedload` only where needed). + +The detail endpoints (`GET /topics/{id}`) keep `TopicRead` (full schema). + +**Dashboard impact:** none — dashboards only use `id`, `slug`, +`domain_slug`, `title`, which all appear in the lean schema. + +--- + +### T4 — Add short server-side TTL cache for `/state/summary` + +```task +id: CUST-WP-0041-T4 +status: todo +priority: medium +depends_on: [CUST-WP-0041-T2] +state_hub_task_id: "20213c08-da5b-4ecf-9219-017969c2200b" +``` + +`/state/summary` is already fast (603 ms) after T1 reduces index scans, but +it aggregates 10+ tables and is called every 60 s by `index.md`. A 15 s +server-side cache means at most one full query per 15 s window regardless of +how many open tabs the user has. + +Use the same pattern as T2: module-level `_SUMMARY_CACHE` + `_SUMMARY_CACHE_AT` +in `api/routers/state.py`, TTL = 15 s. The ETag middleware (CUST-WP-0039-T2) +remains the outer layer and still returns 304 for unchanged data within the +TTL window. + +--- + +## Expected impact + +| Change | Estimated latency after | +|--------|------------------------| +| T1 (indexes) | `/tasks/` → ~50 ms; `/sbom/snapshots/` → ~100 ms | +| T2 (workplan-index cache) | 4171 ms → ~5 ms on cache hit | +| T3 (lean schemas) | `/topics/`, `/domains/`, `/repos/` → ~50–150 ms | +| T4 (summary cache) | `/state/summary` → ~5 ms on cache hit |