Files

tegwick b340489d96 Optimize dashboard overview loading

2026-06-06 00:42:00 +02:00

10 KiB

Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
STATE-WP-0056	workplan	Dashboard Loading Robustness and Efficiency	custodian	state-hub	finished	codex	custodian	2026-06-05	2026-06-05	28f9569c-937b-4b79-b46c-f6b1f83c09c3

Dashboard Loading Robustness and Efficiency

Summary

Make the State Hub dashboard overview page faster and more resilient under normal polling. The current overview performs a broad concurrent fan-out of full-list API calls and treats most request failures as whole-page failures. This can surface frequent Dashboard data load failed: The operation was aborted. warnings when one call crosses the frontend timeout, even if the API eventually returns successfully.

This work should reduce request count, payload size, and backend contention; preserve useful last-known data during partial failures; and give operators clearer diagnostics when a section is stale or unavailable.

Current Findings

Inspection on 2026-06-05 found:

dashboard/src/index.md loads overview data with one eight-request Promise.all batch.
dashboard/src/components/config.js aborts most apiFetch calls after 12_000 ms.
A dashboard-style concurrent timing run produced several calls at or above the default timeout: /sbom/snapshots/, /repos/, and /workplans/index.
The same endpoints can be much faster when called alone, which points to contention and over-fetching rather than one permanently slow endpoint.
The overview calls /tasks/?limit=2000, but the tasks API currently ignores limit and returns every task. In the observed run that response was roughly 2.1 MB just to compute per-workplan task counts.
/state/summary has a short in-process cache, but a cache miss still runs a large amount of sequential database and Python-side aggregation work.
/workplans/index scans active repository workplan files and parses frontmatter. It is cached, but concurrent dashboard loads can still wait on the same expensive rebuild pattern.
Several API routes set cache headers, but the shared dashboard fetch helper forces cache: "no-store" for every request.

Out of Scope

Replacing Observable Framework.
Redesigning the dashboard information architecture.
Adding authentication, authorization, or multi-user session handling.
Changing workplan file conventions.
Moving State Hub to a different database or deployment substrate.

T01 — Add Focused Dashboard Load Instrumentation

id: STATE-WP-0056-T01
status: done
priority: high
state_hub_task_id: "e5208053-0db1-4842-a221-c5289422677a"

Add enough timing and error visibility to confirm which overview calls are slow, aborted, or oversized during normal use.

Implementation notes:

Add lightweight server-side timing logs or response headers for overview-hot endpoints: /state/summary, /workplans/, /tasks/, /topics/, /repos/, /sbom/snapshots/, /progress/, and /workplans/index.
Include request path, status, elapsed time, response size when practical, and whether a cached result was used.
Keep instrumentation local and low-noise; avoid logging full payloads or secrets.
Add a small dashboard diagnostic surface or console logging that distinguishes timeout aborts from HTTP errors and network failures.
Capture before/after timing notes in this workplan or a progress event.

Done when a normal dashboard refresh can be diagnosed without manually timing each endpoint from a shell.

T02 — Make Overview Polling Partially Resilient

id: STATE-WP-0056-T02
status: done
priority: high
state_hub_task_id: "2cdd960d-ba86-48d1-a7c6-e83671cd0e69"

Change the overview data loader so one slow or failed secondary request does not mark the whole dashboard as failed.

Implementation notes:

Replace fail-fast Promise.all behavior in dashboard/src/index.md with a per-resource result model, for example Promise.allSettled.
Keep last-known-good data for each section while a refresh is degraded.
Treat optional resources such as SBOM snapshots, registration milestones, and workplan file metadata independently from core summary/workplan status data.
Display section-level stale/error indicators instead of one global warning whenever possible.
Keep exponential backoff for repeated failures, but do not discard usable data just because one request timed out.
Make abort errors user-readable, for example "timed out after 12s" instead of only "The operation was aborted."

Done when an SBOM, repo-list, or workplan-index timeout leaves the rest of the overview usable and visibly stale rather than failed.

T03 — Respect Pagination and Add Task Count Aggregates

id: STATE-WP-0056-T03
status: done
priority: high
state_hub_task_id: "78484226-9ccc-460c-a2b3-750b3204caa3"

Stop returning all tasks for overview count calculations.

Implementation notes:

Add limit and offset support to GET /tasks/, preserving existing filter behavior and sensible limits.
Add a lightweight aggregate endpoint for task counts by workplan and status, for example GET /tasks/counts?group_by=workstream,status, or add an overview-specific aggregate route.
Prefer SQL GROUP BY over transferring every task to the browser.
Update dashboard/src/index.md, dashboard/src/tasks.md, dashboard/src/interventions.md, and workplan detail pages as needed so list views still receive the rows they need.
Add tests for pagination compatibility and aggregate counts.

Done when the overview no longer fetches the full task table to draw the workplan chart.

T04 — Build a Lightweight Overview Read Endpoint

id: STATE-WP-0056-T04
status: done
priority: high
state_hub_task_id: "2cf47a12-e8aa-49ca-963c-1f0d2933c344"

Create a dashboard-specific read model that returns exactly the data needed by the overview page in one bounded response.

Implementation notes:

Add an endpoint such as GET /state/overview or GET /state/dashboard-overview.
Include summary totals, recent progress needed by the page, blocking decision counts, waiting-task counts, SBOM snapshot totals, registration milestones, and workplan chart rows with repo/domain labels and task counts.
Keep response fields stable and documented in dashboard reference docs.
Reuse existing summary helpers where they are efficient, but avoid serializing large full-list payloads that the overview does not display directly.
Add cache headers and a short in-process cache with explicit invalidation rules where appropriate.
Update dashboard/src/index.md to prefer this endpoint and remove redundant overview-only fetches.

Done when the overview's steady-state refresh is one bounded API call plus only truly interactive secondary calls.

T05 — Add Stale-While-Refresh for File-Backed Workplan Index

id: STATE-WP-0056-T05
status: done
priority: medium
state_hub_task_id: "0c88c1a2-588b-41f8-bc1c-f94c8b4b0d1a"

Make /workplans/index resilient when repository filesystem scans are slow.

Implementation notes:

Add singleflight behavior so concurrent requests share one in-progress rebuild instead of starting or waiting on redundant scans.
Return stale cached data quickly while a background refresh runs when the cache is expired but still available.
Include metadata such as generated_at, stale, cache_age_seconds, and optionally refresh_in_progress.
Consider reading only frontmatter rather than whole markdown files if this can be done cleanly.
Keep refresh=true as an explicit operator escape hatch.
Add tests for cache hit, stale return, and forced refresh behavior.

Done when a slow filesystem scan cannot block normal dashboard refreshes for longer than the frontend timeout if cached data exists.

T06 — Use Browser and HTTP Caching Selectively

id: STATE-WP-0056-T06
status: done
priority: medium
state_hub_task_id: "811f02ff-2e92-4c82-8b8a-e3d39a450b02"

Let stable lookup requests benefit from cache headers instead of forcing every dashboard request to bypass caches.

Implementation notes:

Extend apiFetch so callers can choose cache mode.
Keep no-store for volatile mutation-sensitive resources.
Use default browser caching or reload only where route cache headers are already intentional, such as repo/topic lookup data.
Review current route cache headers and align them with dashboard polling needs.
Avoid stale cached data for controls that immediately follow a mutation.

Done when stable overview lookup data no longer bypasses useful cache headers by default.

T07 — Optimize `/state/summary` Cache Misses

id: STATE-WP-0056-T07
status: done
priority: medium
state_hub_task_id: "633f4cc6-ffeb-4086-9858-d239f50a9686"

Reduce the cost of a cold or expired /state/summary request.

Implementation notes:

Profile the current sequential query groups in api/routers/state.py.
Move Python-side counts and scans into SQL where straightforward.
Remove unused work from the summary path, such as dead intermediate query results.
Cache derived sections independently when their freshness requirements differ.
Add indexes only after profiling shows a query plan needs them.
Keep summary response compatibility for existing consumers and MCP smoke tests.

Done when a summary cache miss stays comfortably below the frontend timeout under the current local data volume.

T08 — Verify Under Dashboard-Style Load

id: STATE-WP-0056-T08
status: done
priority: high
state_hub_task_id: "353fb25a-5306-416b-8d6d-9b201e6fac87"

Prove the dashboard no longer produces frequent abort warnings under realistic refresh behavior.

Implementation notes:

Add or document a repeatable script that performs dashboard-style concurrent endpoint timing before and after the changes.
Run API tests and dashboard component tests.
Open the dashboard locally and verify that initial load, refresh, hidden-tab pause/resume, and partial API failure states behave correctly.
Confirm payload sizes are lower than the baseline for the overview page.
Update dashboard/src/docs/overview.md and dashboard/src/docs/live-data.md with the new data-loading model.

Done when repeated dashboard refreshes do not show the global aborted-operation warning during normal local operation, and degraded sections recover cleanly.

10 KiB Raw Permalink Blame History

Dashboard Loading Robustness and Efficiency

Summary

Current Findings

Out of Scope

T01 — Add Focused Dashboard Load Instrumentation

T02 — Make Overview Polling Partially Resilient

T03 — Respect Pagination and Add Task Count Aggregates

T04 — Build a Lightweight Overview Read Endpoint

T05 — Add Stale-While-Refresh for File-Backed Workplan Index

T06 — Use Browser and HTTP Caching Selectively

T07 — Optimize /state/summary Cache Misses

T08 — Verify Under Dashboard-Style Load

10 KiB

Raw Permalink Blame History

T07 — Optimize `/state/summary` Cache Misses