Files
the-custodian/workplans/CUST-WP-0041-api-performance.md
tegwick e0f8bbfbfc chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15:
  - CUST-WP-0041-T4: todo → done
2026-05-15 11:10:23 +02:00

216 lines
6.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: CUST-WP-0041
type: workplan
title: "API Performance Optimization"
domain: custodian
status: todo
owner: custodian
topic_slug: custodian
created: "2026-05-15"
updated: "2026-05-15"
state_hub_workstream_id: "36b97d4a-0144-479b-81fb-9b9072379143"
---
# API Performance Optimization
## Problem
Profiling the API under dashboard load revealed four distinct bottlenecks that
collectively cause 24 s response times for endpoints that should complete in
under 100 ms:
| Endpoint | Observed latency | Root cause |
|----------|-----------------|------------|
| `/workstreams/workplan-index` | 4171 ms | Synchronous filesystem scan (all repos, all `.md` files, YAML parse) on every request — no cache |
| `/topics/` | 2382 ms | `lazy="selectin"` on `Topic` triggers full loads of `workstreams`, `decisions`, and `progress_events` per topic |
| `/domains/` | 2252 ms | `lazy="selectin"` on `Domain` cascades into topics → workstreams chain |
| `/repos/` | 2222 ms | `lazy="selectin"` on `ManagedRepo` loads `domain` + `goals` |
| `/tasks/?limit=500` | 2174 ms vs 850 ms for `limit=2000` | Query planner picks a bad plan without a `status` index |
| `/sbom/snapshots/` | 2704 ms | Missing composite index on `(repo_id, snapshot_at)` used by the latest-snapshot subquery |
The list endpoints (`/topics/`, `/repos/`, `/domains/`) are called by 10+ dashboard
pages as reference data. Each call loads far more data than the callers need
(they only use `id`, `slug`, `domain_slug`, `title`). This is an N+1 at the
ORM relationship level, not the HTTP level.
## Goals
- Reduce all list endpoint response times to under 300 ms
- Eliminate the 4 s workplan-index scan
- Keep schemas backwards-compatible (no dashboard changes required)
## Out of scope
- Full query result caching (Redis/Memcached)
- Pagination on list endpoints
- Read-replica routing
---
## Tasks
### T1 — Add missing DB indexes
```task
id: CUST-WP-0041-T1
status: done
priority: high
state_hub_task_id: "160152e1-2286-46dc-9240-0d6a6db8abd9"
```
Add a new Alembic migration with the following indexes:
```sql
-- Filters used by /tasks/?needs_human, /tasks/?status=, and query planner hints
CREATE INDEX ix_tasks_status ON tasks(status);
CREATE INDEX ix_tasks_workstream_status ON tasks(workstream_id, status);
-- /workstreams/?status= and state/summary active-workstream filter
CREATE INDEX ix_workstreams_status ON workstreams(status);
-- /sbom/snapshots/ latest-snapshot subquery: MAX(snapshot_at) GROUP BY repo_id
CREATE INDEX ix_sbom_snapshots_repo_at ON sbom_snapshots(repo_id, snapshot_at DESC);
```
**Implementation:** new migration file in `state-hub/migrations/versions/`
with `down_revision` pointing to the current Alembic head.
---
### T2 — Add TTL cache to `/workstreams/workplan-index`
```task
id: CUST-WP-0041-T2
status: done
priority: high
state_hub_task_id: "f74e88d3-2446-4fd5-8f2e-10ab37144f2a"
```
The endpoint scans the filesystem of every active repo on every request.
Add a module-level in-process cache with a 30 s TTL:
```python
# in api/routers/workstreams.py
import asyncio, time
_INDEX_CACHE: dict[str, Any] | None = None
_INDEX_CACHE_AT: float = 0.0
_INDEX_TTL = 30.0
@router.get("/workplan-index")
async def workplan_index(
refresh: bool = False,
session: AsyncSession = Depends(get_session),
) -> dict[str, Any]:
global _INDEX_CACHE, _INDEX_CACHE_AT
if not refresh and _INDEX_CACHE is not None and (time.monotonic() - _INDEX_CACHE_AT) < _INDEX_TTL:
return _INDEX_CACHE
# ... existing scan logic ...
_INDEX_CACHE = {"workstreams": index}
_INDEX_CACHE_AT = time.monotonic()
return _INDEX_CACHE
```
The dashboard passes `?refresh=1` after `make fix-consistency` completes
(or just waits 30 s for the TTL to expire naturally).
**Note:** module-level cache is safe here — uvicorn runs single-process in dev;
in production a shared cache (Redis) would be needed, but that is out of scope.
---
### T3 — Replace selectin cascade on list endpoints with lean schemas
```task
id: CUST-WP-0041-T3
status: done
priority: high
state_hub_task_id: "e15378a7-39c4-4596-b67c-1aff18674fc7"
```
`GET /topics/`, `GET /domains/`, and `GET /repos/` trigger deep `selectin`
relationship loads that the list callers never use. Fix with lean response
models that exclude the heavy relationships:
**New schemas (add to existing schema files):**
```python
# api/schemas/topic.py
class TopicListItem(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
slug: str
title: str
domain_slug: str # already a @property on the model
status: str
created_at: datetime
updated_at: datetime
```
```python
# api/schemas/managed_repo.py
class RepoListItem(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
slug: str
title: str | None
domain_slug: str | None
status: str
local_path: str | None
```
```python
# api/schemas/domain.py
class DomainListItem(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
slug: str
title: str
status: str
created_at: datetime
updated_at: datetime
```
**Router changes:** switch list endpoint `response_model` to
`list[TopicListItem]` / `list[RepoListItem]` / `list[DomainListItem]`.
Add `options(load_only(...))` or remove `selectin` from the ORM load
path for list queries (use explicit `joinedload` only where needed).
The detail endpoints (`GET /topics/{id}`) keep `TopicRead` (full schema).
**Dashboard impact:** none — dashboards only use `id`, `slug`,
`domain_slug`, `title`, which all appear in the lean schema.
---
### T4 — Add short server-side TTL cache for `/state/summary`
```task
id: CUST-WP-0041-T4
status: done
priority: medium
depends_on: [CUST-WP-0041-T2]
state_hub_task_id: "20213c08-da5b-4ecf-9219-017969c2200b"
```
`/state/summary` is already fast (603 ms) after T1 reduces index scans, but
it aggregates 10+ tables and is called every 60 s by `index.md`. A 15 s
server-side cache means at most one full query per 15 s window regardless of
how many open tabs the user has.
Use the same pattern as T2: module-level `_SUMMARY_CACHE` + `_SUMMARY_CACHE_AT`
in `api/routers/state.py`, TTL = 15 s. The ETag middleware (CUST-WP-0039-T2)
remains the outer layer and still returns 304 for unchanged data within the
TTL window.
---
## Expected impact
| Change | Estimated latency after |
|--------|------------------------|
| T1 (indexes) | `/tasks/` → ~50 ms; `/sbom/snapshots/` → ~100 ms |
| T2 (workplan-index cache) | 4171 ms → ~5 ms on cache hit |
| T3 (lean schemas) | `/topics/`, `/domains/`, `/repos/` → ~50150 ms |
| T4 (summary cache) | `/state/summary` → ~5 ms on cache hit |