--- id: STATE-WP-0056 type: workplan title: "Dashboard Loading Robustness and Efficiency" domain: custodian repo: state-hub status: finished owner: codex topic_slug: custodian created: "2026-06-05" updated: "2026-06-05" state_hub_workstream_id: "28f9569c-937b-4b79-b46c-f6b1f83c09c3" --- # Dashboard Loading Robustness and Efficiency ## Summary Make the State Hub dashboard overview page faster and more resilient under normal polling. The current overview performs a broad concurrent fan-out of full-list API calls and treats most request failures as whole-page failures. This can surface frequent `Dashboard data load failed: The operation was aborted.` warnings when one call crosses the frontend timeout, even if the API eventually returns successfully. This work should reduce request count, payload size, and backend contention; preserve useful last-known data during partial failures; and give operators clearer diagnostics when a section is stale or unavailable. ## Current Findings Inspection on 2026-06-05 found: - `dashboard/src/index.md` loads overview data with one eight-request `Promise.all` batch. - `dashboard/src/components/config.js` aborts most `apiFetch` calls after `12_000` ms. - A dashboard-style concurrent timing run produced several calls at or above the default timeout: `/sbom/snapshots/`, `/repos/`, and `/workplans/index`. - The same endpoints can be much faster when called alone, which points to contention and over-fetching rather than one permanently slow endpoint. - The overview calls `/tasks/?limit=2000`, but the tasks API currently ignores `limit` and returns every task. In the observed run that response was roughly 2.1 MB just to compute per-workplan task counts. - `/state/summary` has a short in-process cache, but a cache miss still runs a large amount of sequential database and Python-side aggregation work. - `/workplans/index` scans active repository workplan files and parses frontmatter. It is cached, but concurrent dashboard loads can still wait on the same expensive rebuild pattern. - Several API routes set cache headers, but the shared dashboard fetch helper forces `cache: "no-store"` for every request. ## Out of Scope - Replacing Observable Framework. - Redesigning the dashboard information architecture. - Adding authentication, authorization, or multi-user session handling. - Changing workplan file conventions. - Moving State Hub to a different database or deployment substrate. ## T01 — Add Focused Dashboard Load Instrumentation ```task id: STATE-WP-0056-T01 status: done priority: high state_hub_task_id: "e5208053-0db1-4842-a221-c5289422677a" ``` Add enough timing and error visibility to confirm which overview calls are slow, aborted, or oversized during normal use. Implementation notes: - Add lightweight server-side timing logs or response headers for overview-hot endpoints: `/state/summary`, `/workplans/`, `/tasks/`, `/topics/`, `/repos/`, `/sbom/snapshots/`, `/progress/`, and `/workplans/index`. - Include request path, status, elapsed time, response size when practical, and whether a cached result was used. - Keep instrumentation local and low-noise; avoid logging full payloads or secrets. - Add a small dashboard diagnostic surface or console logging that distinguishes timeout aborts from HTTP errors and network failures. - Capture before/after timing notes in this workplan or a progress event. Done when a normal dashboard refresh can be diagnosed without manually timing each endpoint from a shell. ## T02 — Make Overview Polling Partially Resilient ```task id: STATE-WP-0056-T02 status: done priority: high state_hub_task_id: "2cdd960d-ba86-48d1-a7c6-e83671cd0e69" ``` Change the overview data loader so one slow or failed secondary request does not mark the whole dashboard as failed. Implementation notes: - Replace fail-fast `Promise.all` behavior in `dashboard/src/index.md` with a per-resource result model, for example `Promise.allSettled`. - Keep last-known-good data for each section while a refresh is degraded. - Treat optional resources such as SBOM snapshots, registration milestones, and workplan file metadata independently from core summary/workplan status data. - Display section-level stale/error indicators instead of one global warning whenever possible. - Keep exponential backoff for repeated failures, but do not discard usable data just because one request timed out. - Make abort errors user-readable, for example "timed out after 12s" instead of only "The operation was aborted." Done when an SBOM, repo-list, or workplan-index timeout leaves the rest of the overview usable and visibly stale rather than failed. ## T03 — Respect Pagination and Add Task Count Aggregates ```task id: STATE-WP-0056-T03 status: done priority: high state_hub_task_id: "78484226-9ccc-460c-a2b3-750b3204caa3" ``` Stop returning all tasks for overview count calculations. Implementation notes: - Add `limit` and `offset` support to `GET /tasks/`, preserving existing filter behavior and sensible limits. - Add a lightweight aggregate endpoint for task counts by workplan and status, for example `GET /tasks/counts?group_by=workstream,status`, or add an overview-specific aggregate route. - Prefer SQL `GROUP BY` over transferring every task to the browser. - Update `dashboard/src/index.md`, `dashboard/src/tasks.md`, `dashboard/src/interventions.md`, and workplan detail pages as needed so list views still receive the rows they need. - Add tests for pagination compatibility and aggregate counts. Done when the overview no longer fetches the full task table to draw the workplan chart. ## T04 — Build a Lightweight Overview Read Endpoint ```task id: STATE-WP-0056-T04 status: done priority: high state_hub_task_id: "2cf47a12-e8aa-49ca-963c-1f0d2933c344" ``` Create a dashboard-specific read model that returns exactly the data needed by the overview page in one bounded response. Implementation notes: - Add an endpoint such as `GET /state/overview` or `GET /state/dashboard-overview`. - Include summary totals, recent progress needed by the page, blocking decision counts, waiting-task counts, SBOM snapshot totals, registration milestones, and workplan chart rows with repo/domain labels and task counts. - Keep response fields stable and documented in dashboard reference docs. - Reuse existing summary helpers where they are efficient, but avoid serializing large full-list payloads that the overview does not display directly. - Add cache headers and a short in-process cache with explicit invalidation rules where appropriate. - Update `dashboard/src/index.md` to prefer this endpoint and remove redundant overview-only fetches. Done when the overview's steady-state refresh is one bounded API call plus only truly interactive secondary calls. ## T05 — Add Stale-While-Refresh for File-Backed Workplan Index ```task id: STATE-WP-0056-T05 status: done priority: medium state_hub_task_id: "0c88c1a2-588b-41f8-bc1c-f94c8b4b0d1a" ``` Make `/workplans/index` resilient when repository filesystem scans are slow. Implementation notes: - Add singleflight behavior so concurrent requests share one in-progress rebuild instead of starting or waiting on redundant scans. - Return stale cached data quickly while a background refresh runs when the cache is expired but still available. - Include metadata such as `generated_at`, `stale`, `cache_age_seconds`, and optionally `refresh_in_progress`. - Consider reading only frontmatter rather than whole markdown files if this can be done cleanly. - Keep `refresh=true` as an explicit operator escape hatch. - Add tests for cache hit, stale return, and forced refresh behavior. Done when a slow filesystem scan cannot block normal dashboard refreshes for longer than the frontend timeout if cached data exists. ## T06 — Use Browser and HTTP Caching Selectively ```task id: STATE-WP-0056-T06 status: done priority: medium state_hub_task_id: "811f02ff-2e92-4c82-8b8a-e3d39a450b02" ``` Let stable lookup requests benefit from cache headers instead of forcing every dashboard request to bypass caches. Implementation notes: - Extend `apiFetch` so callers can choose cache mode. - Keep `no-store` for volatile mutation-sensitive resources. - Use default browser caching or `reload` only where route cache headers are already intentional, such as repo/topic lookup data. - Review current route cache headers and align them with dashboard polling needs. - Avoid stale cached data for controls that immediately follow a mutation. Done when stable overview lookup data no longer bypasses useful cache headers by default. ## T07 — Optimize `/state/summary` Cache Misses ```task id: STATE-WP-0056-T07 status: done priority: medium state_hub_task_id: "633f4cc6-ffeb-4086-9858-d239f50a9686" ``` Reduce the cost of a cold or expired `/state/summary` request. Implementation notes: - Profile the current sequential query groups in `api/routers/state.py`. - Move Python-side counts and scans into SQL where straightforward. - Remove unused work from the summary path, such as dead intermediate query results. - Cache derived sections independently when their freshness requirements differ. - Add indexes only after profiling shows a query plan needs them. - Keep summary response compatibility for existing consumers and MCP smoke tests. Done when a summary cache miss stays comfortably below the frontend timeout under the current local data volume. ## T08 — Verify Under Dashboard-Style Load ```task id: STATE-WP-0056-T08 status: done priority: high state_hub_task_id: "353fb25a-5306-416b-8d6d-9b201e6fac87" ``` Prove the dashboard no longer produces frequent abort warnings under realistic refresh behavior. Implementation notes: - Add or document a repeatable script that performs dashboard-style concurrent endpoint timing before and after the changes. - Run API tests and dashboard component tests. - Open the dashboard locally and verify that initial load, refresh, hidden-tab pause/resume, and partial API failure states behave correctly. - Confirm payload sizes are lower than the baseline for the overview page. - Update `dashboard/src/docs/overview.md` and `dashboard/src/docs/live-data.md` with the new data-loading model. Done when repeated dashboard refreshes do not show the global aborted-operation warning during normal local operation, and degraded sections recover cleanly.