chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-15: - CUST-WP-0041-T1: todo → done
This commit is contained in:
215
workplans/CUST-WP-0041-api-performance.md
Normal file
215
workplans/CUST-WP-0041-api-performance.md
Normal file
@@ -0,0 +1,215 @@
|
|||||||
|
---
|
||||||
|
id: CUST-WP-0041
|
||||||
|
type: workplan
|
||||||
|
title: "API Performance Optimization"
|
||||||
|
domain: custodian
|
||||||
|
status: todo
|
||||||
|
owner: custodian
|
||||||
|
topic_slug: custodian
|
||||||
|
created: "2026-05-15"
|
||||||
|
updated: "2026-05-15"
|
||||||
|
state_hub_workstream_id: "36b97d4a-0144-479b-81fb-9b9072379143"
|
||||||
|
---
|
||||||
|
|
||||||
|
# API Performance Optimization
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Profiling the API under dashboard load revealed four distinct bottlenecks that
|
||||||
|
collectively cause 2–4 s response times for endpoints that should complete in
|
||||||
|
under 100 ms:
|
||||||
|
|
||||||
|
| Endpoint | Observed latency | Root cause |
|
||||||
|
|----------|-----------------|------------|
|
||||||
|
| `/workstreams/workplan-index` | 4171 ms | Synchronous filesystem scan (all repos, all `.md` files, YAML parse) on every request — no cache |
|
||||||
|
| `/topics/` | 2382 ms | `lazy="selectin"` on `Topic` triggers full loads of `workstreams`, `decisions`, and `progress_events` per topic |
|
||||||
|
| `/domains/` | 2252 ms | `lazy="selectin"` on `Domain` cascades into topics → workstreams chain |
|
||||||
|
| `/repos/` | 2222 ms | `lazy="selectin"` on `ManagedRepo` loads `domain` + `goals` |
|
||||||
|
| `/tasks/?limit=500` | 2174 ms vs 850 ms for `limit=2000` | Query planner picks a bad plan without a `status` index |
|
||||||
|
| `/sbom/snapshots/` | 2704 ms | Missing composite index on `(repo_id, snapshot_at)` used by the latest-snapshot subquery |
|
||||||
|
|
||||||
|
The list endpoints (`/topics/`, `/repos/`, `/domains/`) are called by 10+ dashboard
|
||||||
|
pages as reference data. Each call loads far more data than the callers need
|
||||||
|
(they only use `id`, `slug`, `domain_slug`, `title`). This is an N+1 at the
|
||||||
|
ORM relationship level, not the HTTP level.
|
||||||
|
|
||||||
|
## Goals
|
||||||
|
|
||||||
|
- Reduce all list endpoint response times to under 300 ms
|
||||||
|
- Eliminate the 4 s workplan-index scan
|
||||||
|
- Keep schemas backwards-compatible (no dashboard changes required)
|
||||||
|
|
||||||
|
## Out of scope
|
||||||
|
|
||||||
|
- Full query result caching (Redis/Memcached)
|
||||||
|
- Pagination on list endpoints
|
||||||
|
- Read-replica routing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
### T1 — Add missing DB indexes
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0041-T1
|
||||||
|
status: done
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "160152e1-2286-46dc-9240-0d6a6db8abd9"
|
||||||
|
```
|
||||||
|
|
||||||
|
Add a new Alembic migration with the following indexes:
|
||||||
|
|
||||||
|
```sql
|
||||||
|
-- Filters used by /tasks/?needs_human, /tasks/?status=, and query planner hints
|
||||||
|
CREATE INDEX ix_tasks_status ON tasks(status);
|
||||||
|
CREATE INDEX ix_tasks_workstream_status ON tasks(workstream_id, status);
|
||||||
|
|
||||||
|
-- /workstreams/?status= and state/summary active-workstream filter
|
||||||
|
CREATE INDEX ix_workstreams_status ON workstreams(status);
|
||||||
|
|
||||||
|
-- /sbom/snapshots/ latest-snapshot subquery: MAX(snapshot_at) GROUP BY repo_id
|
||||||
|
CREATE INDEX ix_sbom_snapshots_repo_at ON sbom_snapshots(repo_id, snapshot_at DESC);
|
||||||
|
```
|
||||||
|
|
||||||
|
**Implementation:** new migration file in `state-hub/migrations/versions/`
|
||||||
|
with `down_revision` pointing to the current Alembic head.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### T2 — Add TTL cache to `/workstreams/workplan-index`
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0041-T2
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "f74e88d3-2446-4fd5-8f2e-10ab37144f2a"
|
||||||
|
```
|
||||||
|
|
||||||
|
The endpoint scans the filesystem of every active repo on every request.
|
||||||
|
Add a module-level in-process cache with a 30 s TTL:
|
||||||
|
|
||||||
|
```python
|
||||||
|
# in api/routers/workstreams.py
|
||||||
|
import asyncio, time
|
||||||
|
|
||||||
|
_INDEX_CACHE: dict[str, Any] | None = None
|
||||||
|
_INDEX_CACHE_AT: float = 0.0
|
||||||
|
_INDEX_TTL = 30.0
|
||||||
|
|
||||||
|
@router.get("/workplan-index")
|
||||||
|
async def workplan_index(
|
||||||
|
refresh: bool = False,
|
||||||
|
session: AsyncSession = Depends(get_session),
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
global _INDEX_CACHE, _INDEX_CACHE_AT
|
||||||
|
if not refresh and _INDEX_CACHE is not None and (time.monotonic() - _INDEX_CACHE_AT) < _INDEX_TTL:
|
||||||
|
return _INDEX_CACHE
|
||||||
|
# ... existing scan logic ...
|
||||||
|
_INDEX_CACHE = {"workstreams": index}
|
||||||
|
_INDEX_CACHE_AT = time.monotonic()
|
||||||
|
return _INDEX_CACHE
|
||||||
|
```
|
||||||
|
|
||||||
|
The dashboard passes `?refresh=1` after `make fix-consistency` completes
|
||||||
|
(or just waits 30 s for the TTL to expire naturally).
|
||||||
|
|
||||||
|
**Note:** module-level cache is safe here — uvicorn runs single-process in dev;
|
||||||
|
in production a shared cache (Redis) would be needed, but that is out of scope.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### T3 — Replace selectin cascade on list endpoints with lean schemas
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0041-T3
|
||||||
|
status: todo
|
||||||
|
priority: high
|
||||||
|
state_hub_task_id: "e15378a7-39c4-4596-b67c-1aff18674fc7"
|
||||||
|
```
|
||||||
|
|
||||||
|
`GET /topics/`, `GET /domains/`, and `GET /repos/` trigger deep `selectin`
|
||||||
|
relationship loads that the list callers never use. Fix with lean response
|
||||||
|
models that exclude the heavy relationships:
|
||||||
|
|
||||||
|
**New schemas (add to existing schema files):**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# api/schemas/topic.py
|
||||||
|
class TopicListItem(BaseModel):
|
||||||
|
model_config = ConfigDict(from_attributes=True)
|
||||||
|
id: uuid.UUID
|
||||||
|
slug: str
|
||||||
|
title: str
|
||||||
|
domain_slug: str # already a @property on the model
|
||||||
|
status: str
|
||||||
|
created_at: datetime
|
||||||
|
updated_at: datetime
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
# api/schemas/managed_repo.py
|
||||||
|
class RepoListItem(BaseModel):
|
||||||
|
model_config = ConfigDict(from_attributes=True)
|
||||||
|
id: uuid.UUID
|
||||||
|
slug: str
|
||||||
|
title: str | None
|
||||||
|
domain_slug: str | None
|
||||||
|
status: str
|
||||||
|
local_path: str | None
|
||||||
|
```
|
||||||
|
|
||||||
|
```python
|
||||||
|
# api/schemas/domain.py
|
||||||
|
class DomainListItem(BaseModel):
|
||||||
|
model_config = ConfigDict(from_attributes=True)
|
||||||
|
id: uuid.UUID
|
||||||
|
slug: str
|
||||||
|
title: str
|
||||||
|
status: str
|
||||||
|
created_at: datetime
|
||||||
|
updated_at: datetime
|
||||||
|
```
|
||||||
|
|
||||||
|
**Router changes:** switch list endpoint `response_model` to
|
||||||
|
`list[TopicListItem]` / `list[RepoListItem]` / `list[DomainListItem]`.
|
||||||
|
Add `options(load_only(...))` or remove `selectin` from the ORM load
|
||||||
|
path for list queries (use explicit `joinedload` only where needed).
|
||||||
|
|
||||||
|
The detail endpoints (`GET /topics/{id}`) keep `TopicRead` (full schema).
|
||||||
|
|
||||||
|
**Dashboard impact:** none — dashboards only use `id`, `slug`,
|
||||||
|
`domain_slug`, `title`, which all appear in the lean schema.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### T4 — Add short server-side TTL cache for `/state/summary`
|
||||||
|
|
||||||
|
```task
|
||||||
|
id: CUST-WP-0041-T4
|
||||||
|
status: todo
|
||||||
|
priority: medium
|
||||||
|
depends_on: [CUST-WP-0041-T2]
|
||||||
|
state_hub_task_id: "20213c08-da5b-4ecf-9219-017969c2200b"
|
||||||
|
```
|
||||||
|
|
||||||
|
`/state/summary` is already fast (603 ms) after T1 reduces index scans, but
|
||||||
|
it aggregates 10+ tables and is called every 60 s by `index.md`. A 15 s
|
||||||
|
server-side cache means at most one full query per 15 s window regardless of
|
||||||
|
how many open tabs the user has.
|
||||||
|
|
||||||
|
Use the same pattern as T2: module-level `_SUMMARY_CACHE` + `_SUMMARY_CACHE_AT`
|
||||||
|
in `api/routers/state.py`, TTL = 15 s. The ETag middleware (CUST-WP-0039-T2)
|
||||||
|
remains the outer layer and still returns 304 for unchanged data within the
|
||||||
|
TTL window.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Expected impact
|
||||||
|
|
||||||
|
| Change | Estimated latency after |
|
||||||
|
|--------|------------------------|
|
||||||
|
| T1 (indexes) | `/tasks/` → ~50 ms; `/sbom/snapshots/` → ~100 ms |
|
||||||
|
| T2 (workplan-index cache) | 4171 ms → ~5 ms on cache hit |
|
||||||
|
| T3 (lean schemas) | `/topics/`, `/domains/`, `/repos/` → ~50–150 ms |
|
||||||
|
| T4 (summary cache) | `/state/summary` → ~5 ms on cache hit |
|
||||||
Reference in New Issue
Block a user