6.6 KiB
id, type, title, domain, status, owner, topic_slug, created, updated, state_hub_workstream_id
| id | type | title | domain | status | owner | topic_slug | created | updated | state_hub_workstream_id |
|---|---|---|---|---|---|---|---|---|---|
| CUST-WP-0041 | workplan | API Performance Optimization | custodian | todo | custodian | custodian | 2026-05-15 | 2026-05-15 | 36b97d4a-0144-479b-81fb-9b9072379143 |
API Performance Optimization
Problem
Profiling the API under dashboard load revealed four distinct bottlenecks that collectively cause 2–4 s response times for endpoints that should complete in under 100 ms:
| Endpoint | Observed latency | Root cause |
|---|---|---|
/workstreams/workplan-index |
4171 ms | Synchronous filesystem scan (all repos, all .md files, YAML parse) on every request — no cache |
/topics/ |
2382 ms | lazy="selectin" on Topic triggers full loads of workstreams, decisions, and progress_events per topic |
/domains/ |
2252 ms | lazy="selectin" on Domain cascades into topics → workstreams chain |
/repos/ |
2222 ms | lazy="selectin" on ManagedRepo loads domain + goals |
/tasks/?limit=500 |
2174 ms vs 850 ms for limit=2000 |
Query planner picks a bad plan without a status index |
/sbom/snapshots/ |
2704 ms | Missing composite index on (repo_id, snapshot_at) used by the latest-snapshot subquery |
The list endpoints (/topics/, /repos/, /domains/) are called by 10+ dashboard
pages as reference data. Each call loads far more data than the callers need
(they only use id, slug, domain_slug, title). This is an N+1 at the
ORM relationship level, not the HTTP level.
Goals
- Reduce all list endpoint response times to under 300 ms
- Eliminate the 4 s workplan-index scan
- Keep schemas backwards-compatible (no dashboard changes required)
Out of scope
- Full query result caching (Redis/Memcached)
- Pagination on list endpoints
- Read-replica routing
Tasks
T1 — Add missing DB indexes
id: CUST-WP-0041-T1
status: done
priority: high
state_hub_task_id: "160152e1-2286-46dc-9240-0d6a6db8abd9"
Add a new Alembic migration with the following indexes:
-- Filters used by /tasks/?needs_human, /tasks/?status=, and query planner hints
CREATE INDEX ix_tasks_status ON tasks(status);
CREATE INDEX ix_tasks_workstream_status ON tasks(workstream_id, status);
-- /workstreams/?status= and state/summary active-workstream filter
CREATE INDEX ix_workstreams_status ON workstreams(status);
-- /sbom/snapshots/ latest-snapshot subquery: MAX(snapshot_at) GROUP BY repo_id
CREATE INDEX ix_sbom_snapshots_repo_at ON sbom_snapshots(repo_id, snapshot_at DESC);
Implementation: new migration file in state-hub/migrations/versions/
with down_revision pointing to the current Alembic head.
T2 — Add TTL cache to /workstreams/workplan-index
id: CUST-WP-0041-T2
status: done
priority: high
state_hub_task_id: "f74e88d3-2446-4fd5-8f2e-10ab37144f2a"
The endpoint scans the filesystem of every active repo on every request. Add a module-level in-process cache with a 30 s TTL:
# in api/routers/workstreams.py
import asyncio, time
_INDEX_CACHE: dict[str, Any] | None = None
_INDEX_CACHE_AT: float = 0.0
_INDEX_TTL = 30.0
@router.get("/workplan-index")
async def workplan_index(
refresh: bool = False,
session: AsyncSession = Depends(get_session),
) -> dict[str, Any]:
global _INDEX_CACHE, _INDEX_CACHE_AT
if not refresh and _INDEX_CACHE is not None and (time.monotonic() - _INDEX_CACHE_AT) < _INDEX_TTL:
return _INDEX_CACHE
# ... existing scan logic ...
_INDEX_CACHE = {"workstreams": index}
_INDEX_CACHE_AT = time.monotonic()
return _INDEX_CACHE
The dashboard passes ?refresh=1 after make fix-consistency completes
(or just waits 30 s for the TTL to expire naturally).
Note: module-level cache is safe here — uvicorn runs single-process in dev; in production a shared cache (Redis) would be needed, but that is out of scope.
T3 — Replace selectin cascade on list endpoints with lean schemas
id: CUST-WP-0041-T3
status: done
priority: high
state_hub_task_id: "e15378a7-39c4-4596-b67c-1aff18674fc7"
GET /topics/, GET /domains/, and GET /repos/ trigger deep selectin
relationship loads that the list callers never use. Fix with lean response
models that exclude the heavy relationships:
New schemas (add to existing schema files):
# api/schemas/topic.py
class TopicListItem(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
slug: str
title: str
domain_slug: str # already a @property on the model
status: str
created_at: datetime
updated_at: datetime
# api/schemas/managed_repo.py
class RepoListItem(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
slug: str
title: str | None
domain_slug: str | None
status: str
local_path: str | None
# api/schemas/domain.py
class DomainListItem(BaseModel):
model_config = ConfigDict(from_attributes=True)
id: uuid.UUID
slug: str
title: str
status: str
created_at: datetime
updated_at: datetime
Router changes: switch list endpoint response_model to
list[TopicListItem] / list[RepoListItem] / list[DomainListItem].
Add options(load_only(...)) or remove selectin from the ORM load
path for list queries (use explicit joinedload only where needed).
The detail endpoints (GET /topics/{id}) keep TopicRead (full schema).
Dashboard impact: none — dashboards only use id, slug,
domain_slug, title, which all appear in the lean schema.
T4 — Add short server-side TTL cache for /state/summary
id: CUST-WP-0041-T4
status: done
priority: medium
depends_on: [CUST-WP-0041-T2]
state_hub_task_id: "20213c08-da5b-4ecf-9219-017969c2200b"
/state/summary is already fast (603 ms) after T1 reduces index scans, but
it aggregates 10+ tables and is called every 60 s by index.md. A 15 s
server-side cache means at most one full query per 15 s window regardless of
how many open tabs the user has.
Use the same pattern as T2: module-level _SUMMARY_CACHE + _SUMMARY_CACHE_AT
in api/routers/state.py, TTL = 15 s. The ETag middleware (CUST-WP-0039-T2)
remains the outer layer and still returns 304 for unchanged data within the
TTL window.
Expected impact
| Change | Estimated latency after |
|---|---|
| T1 (indexes) | /tasks/ → ~50 ms; /sbom/snapshots/ → ~100 ms |
| T2 (workplan-index cache) | 4171 ms → ~5 ms on cache hit |
| T3 (lean schemas) | /topics/, /domains/, /repos/ → ~50–150 ms |
| T4 (summary cache) | /state/summary → ~5 ms on cache hit |