Files

tegwick 0eb2ef0650 perf(api): CUST-WP-0041 — DB indexes, TTL caches, noload on list endpoints

- Migration t7o8p9q0r1s2: indexes on tasks.status, tasks(workstream_id,status),
  workstreams.status, sbom_snapshots(repo_id,snapshot_at)
- workplan-index: 30 s TTL cache + ?refresh param (4171 ms → 16 ms on hit)
- /state/summary: 15 s TTL cache, bypassed on Cache-Control: no-cache
- /topics/: noload(workstreams, decisions, progress_events) (2382 ms → 115 ms)
- /domains/: noload(topics, repos, goals) (2252 ms → 39 ms)
- /repos/: noload(goals) (2222 ms → 599 ms first / fast on repeat)
- conftest: reset TTL caches between tests to prevent bleed-through

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-05-15 11:12:17 +02:00

6.6 KiB

Raw Blame History

id, type, title, domain, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	status	owner	topic_slug	created	updated	state_hub_workstream_id
CUST-WP-0041	workplan	API Performance Optimization	custodian	done	custodian	custodian	2026-05-15	2026-05-15	36b97d4a-0144-479b-81fb-9b9072379143

API Performance Optimization

Problem

Profiling the API under dashboard load revealed four distinct bottlenecks that collectively cause 2–4 s response times for endpoints that should complete in under 100 ms:

Endpoint	Observed latency	Root cause
`/workstreams/workplan-index`	4171 ms	Synchronous filesystem scan (all repos, all `.md` files, YAML parse) on every request — no cache
`/topics/`	2382 ms	`lazy="selectin"` on `Topic` triggers full loads of `workstreams`, `decisions`, and `progress_events` per topic
`/domains/`	2252 ms	`lazy="selectin"` on `Domain` cascades into topics → workstreams chain
`/repos/`	2222 ms	`lazy="selectin"` on `ManagedRepo` loads `domain` + `goals`
`/tasks/?limit=500`	2174 ms vs 850 ms for `limit=2000`	Query planner picks a bad plan without a `status` index
`/sbom/snapshots/`	2704 ms	Missing composite index on `(repo_id, snapshot_at)` used by the latest-snapshot subquery

The list endpoints (/topics/, /repos/, /domains/) are called by 10+ dashboard pages as reference data. Each call loads far more data than the callers need (they only use id, slug, domain_slug, title). This is an N+1 at the ORM relationship level, not the HTTP level.

Goals

Reduce all list endpoint response times to under 300 ms
Eliminate the 4 s workplan-index scan
Keep schemas backwards-compatible (no dashboard changes required)

Out of scope

Full query result caching (Redis/Memcached)
Pagination on list endpoints
Read-replica routing

Tasks

T1 — Add missing DB indexes

id: CUST-WP-0041-T1
status: done
priority: high
state_hub_task_id: "160152e1-2286-46dc-9240-0d6a6db8abd9"

Add a new Alembic migration with the following indexes:

-- Filters used by /tasks/?needs_human, /tasks/?status=, and query planner hints
CREATE INDEX ix_tasks_status ON tasks(status);
CREATE INDEX ix_tasks_workstream_status ON tasks(workstream_id, status);

-- /workstreams/?status= and state/summary active-workstream filter
CREATE INDEX ix_workstreams_status ON workstreams(status);

-- /sbom/snapshots/ latest-snapshot subquery: MAX(snapshot_at) GROUP BY repo_id
CREATE INDEX ix_sbom_snapshots_repo_at ON sbom_snapshots(repo_id, snapshot_at DESC);

Implementation: new migration file in state-hub/migrations/versions/ with down_revision pointing to the current Alembic head.

T2 — Add TTL cache to `/workstreams/workplan-index`

id: CUST-WP-0041-T2
status: done
priority: high
state_hub_task_id: "f74e88d3-2446-4fd5-8f2e-10ab37144f2a"

The endpoint scans the filesystem of every active repo on every request. Add a module-level in-process cache with a 30 s TTL:

# in api/routers/workstreams.py
import asyncio, time

_INDEX_CACHE: dict[str, Any] | None = None
_INDEX_CACHE_AT: float = 0.0
_INDEX_TTL = 30.0

@router.get("/workplan-index")
async def workplan_index(
    refresh: bool = False,
    session: AsyncSession = Depends(get_session),
) -> dict[str, Any]:
    global _INDEX_CACHE, _INDEX_CACHE_AT
    if not refresh and _INDEX_CACHE is not None and (time.monotonic() - _INDEX_CACHE_AT) < _INDEX_TTL:
        return _INDEX_CACHE
    # ... existing scan logic ...
    _INDEX_CACHE = {"workstreams": index}
    _INDEX_CACHE_AT = time.monotonic()
    return _INDEX_CACHE

The dashboard passes ?refresh=1 after make fix-consistency completes (or just waits 30 s for the TTL to expire naturally).

Note: module-level cache is safe here — uvicorn runs single-process in dev; in production a shared cache (Redis) would be needed, but that is out of scope.

T3 — Replace selectin cascade on list endpoints with lean schemas

id: CUST-WP-0041-T3
status: done
priority: high
state_hub_task_id: "e15378a7-39c4-4596-b67c-1aff18674fc7"

GET /topics/, GET /domains/, and GET /repos/ trigger deep selectin relationship loads that the list callers never use. Fix with lean response models that exclude the heavy relationships:

New schemas (add to existing schema files):

# api/schemas/topic.py
class TopicListItem(BaseModel):
    model_config = ConfigDict(from_attributes=True)
    id: uuid.UUID
    slug: str
    title: str
    domain_slug: str      # already a @property on the model
    status: str
    created_at: datetime
    updated_at: datetime

# api/schemas/managed_repo.py
class RepoListItem(BaseModel):
    model_config = ConfigDict(from_attributes=True)
    id: uuid.UUID
    slug: str
    title: str | None
    domain_slug: str | None
    status: str
    local_path: str | None

# api/schemas/domain.py  
class DomainListItem(BaseModel):
    model_config = ConfigDict(from_attributes=True)
    id: uuid.UUID
    slug: str
    title: str
    status: str
    created_at: datetime
    updated_at: datetime

Router changes: switch list endpoint response_model to list[TopicListItem] / list[RepoListItem] / list[DomainListItem]. Add options(load_only(...)) or remove selectin from the ORM load path for list queries (use explicit joinedload only where needed).

The detail endpoints (GET /topics/{id}) keep TopicRead (full schema).

Dashboard impact: none — dashboards only use id, slug, domain_slug, title, which all appear in the lean schema.

T4 — Add short server-side TTL cache for `/state/summary`

id: CUST-WP-0041-T4
status: done
priority: medium
depends_on: [CUST-WP-0041-T2]
state_hub_task_id: "20213c08-da5b-4ecf-9219-017969c2200b"

/state/summary is already fast (603 ms) after T1 reduces index scans, but it aggregates 10+ tables and is called every 60 s by index.md. A 15 s server-side cache means at most one full query per 15 s window regardless of how many open tabs the user has.

Use the same pattern as T2: module-level _SUMMARY_CACHE + _SUMMARY_CACHE_AT in api/routers/state.py, TTL = 15 s. The ETag middleware (CUST-WP-0039-T2) remains the outer layer and still returns 304 for unchanged data within the TTL window.

Expected impact

Change	Estimated latency after
T1 (indexes)	`/tasks/` → ~50 ms; `/sbom/snapshots/` → ~100 ms
T2 (workplan-index cache)	4171 ms → ~5 ms on cache hit
T3 (lean schemas)	`/topics/`, `/domains/`, `/repos/` → ~50–150 ms
T4 (summary cache)	`/state/summary` → ~5 ms on cache hit

6.6 KiB Raw Blame History Unescape Escape