generated from coulomb/repo-seed
Optimize dashboard overview loading
This commit is contained in:
276
workplans/STATE-WP-0056-dashboard-loading-robustness.md
Normal file
276
workplans/STATE-WP-0056-dashboard-loading-robustness.md
Normal file
@@ -0,0 +1,276 @@
|
||||
---
|
||||
id: STATE-WP-0056
|
||||
type: workplan
|
||||
title: "Dashboard Loading Robustness and Efficiency"
|
||||
domain: custodian
|
||||
repo: state-hub
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: custodian
|
||||
created: "2026-06-05"
|
||||
updated: "2026-06-05"
|
||||
state_hub_workstream_id: "28f9569c-937b-4b79-b46c-f6b1f83c09c3"
|
||||
---
|
||||
|
||||
# Dashboard Loading Robustness and Efficiency
|
||||
|
||||
## Summary
|
||||
|
||||
Make the State Hub dashboard overview page faster and more resilient under
|
||||
normal polling. The current overview performs a broad concurrent fan-out of
|
||||
full-list API calls and treats most request failures as whole-page failures.
|
||||
This can surface frequent `Dashboard data load failed: The operation was
|
||||
aborted.` warnings when one call crosses the frontend timeout, even if the API
|
||||
eventually returns successfully.
|
||||
|
||||
This work should reduce request count, payload size, and backend contention;
|
||||
preserve useful last-known data during partial failures; and give operators
|
||||
clearer diagnostics when a section is stale or unavailable.
|
||||
|
||||
## Current Findings
|
||||
|
||||
Inspection on 2026-06-05 found:
|
||||
|
||||
- `dashboard/src/index.md` loads overview data with one eight-request
|
||||
`Promise.all` batch.
|
||||
- `dashboard/src/components/config.js` aborts most `apiFetch` calls after
|
||||
`12_000` ms.
|
||||
- A dashboard-style concurrent timing run produced several calls at or above the
|
||||
default timeout: `/sbom/snapshots/`, `/repos/`, and `/workplans/index`.
|
||||
- The same endpoints can be much faster when called alone, which points to
|
||||
contention and over-fetching rather than one permanently slow endpoint.
|
||||
- The overview calls `/tasks/?limit=2000`, but the tasks API currently ignores
|
||||
`limit` and returns every task. In the observed run that response was roughly
|
||||
2.1 MB just to compute per-workplan task counts.
|
||||
- `/state/summary` has a short in-process cache, but a cache miss still runs a
|
||||
large amount of sequential database and Python-side aggregation work.
|
||||
- `/workplans/index` scans active repository workplan files and parses
|
||||
frontmatter. It is cached, but concurrent dashboard loads can still wait on
|
||||
the same expensive rebuild pattern.
|
||||
- Several API routes set cache headers, but the shared dashboard fetch helper
|
||||
forces `cache: "no-store"` for every request.
|
||||
|
||||
## Out of Scope
|
||||
|
||||
- Replacing Observable Framework.
|
||||
- Redesigning the dashboard information architecture.
|
||||
- Adding authentication, authorization, or multi-user session handling.
|
||||
- Changing workplan file conventions.
|
||||
- Moving State Hub to a different database or deployment substrate.
|
||||
|
||||
## T01 — Add Focused Dashboard Load Instrumentation
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T01
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "e5208053-0db1-4842-a221-c5289422677a"
|
||||
```
|
||||
|
||||
Add enough timing and error visibility to confirm which overview calls are slow,
|
||||
aborted, or oversized during normal use.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Add lightweight server-side timing logs or response headers for overview-hot
|
||||
endpoints: `/state/summary`, `/workplans/`, `/tasks/`, `/topics/`, `/repos/`,
|
||||
`/sbom/snapshots/`, `/progress/`, and `/workplans/index`.
|
||||
- Include request path, status, elapsed time, response size when practical, and
|
||||
whether a cached result was used.
|
||||
- Keep instrumentation local and low-noise; avoid logging full payloads or
|
||||
secrets.
|
||||
- Add a small dashboard diagnostic surface or console logging that distinguishes
|
||||
timeout aborts from HTTP errors and network failures.
|
||||
- Capture before/after timing notes in this workplan or a progress event.
|
||||
|
||||
Done when a normal dashboard refresh can be diagnosed without manually timing
|
||||
each endpoint from a shell.
|
||||
|
||||
## T02 — Make Overview Polling Partially Resilient
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T02
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "2cdd960d-ba86-48d1-a7c6-e83671cd0e69"
|
||||
```
|
||||
|
||||
Change the overview data loader so one slow or failed secondary request does
|
||||
not mark the whole dashboard as failed.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Replace fail-fast `Promise.all` behavior in `dashboard/src/index.md` with a
|
||||
per-resource result model, for example `Promise.allSettled`.
|
||||
- Keep last-known-good data for each section while a refresh is degraded.
|
||||
- Treat optional resources such as SBOM snapshots, registration milestones, and
|
||||
workplan file metadata independently from core summary/workplan status data.
|
||||
- Display section-level stale/error indicators instead of one global warning
|
||||
whenever possible.
|
||||
- Keep exponential backoff for repeated failures, but do not discard usable
|
||||
data just because one request timed out.
|
||||
- Make abort errors user-readable, for example "timed out after 12s" instead of
|
||||
only "The operation was aborted."
|
||||
|
||||
Done when an SBOM, repo-list, or workplan-index timeout leaves the rest of the
|
||||
overview usable and visibly stale rather than failed.
|
||||
|
||||
## T03 — Respect Pagination and Add Task Count Aggregates
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T03
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "78484226-9ccc-460c-a2b3-750b3204caa3"
|
||||
```
|
||||
|
||||
Stop returning all tasks for overview count calculations.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Add `limit` and `offset` support to `GET /tasks/`, preserving existing filter
|
||||
behavior and sensible limits.
|
||||
- Add a lightweight aggregate endpoint for task counts by workplan and status,
|
||||
for example `GET /tasks/counts?group_by=workstream,status`, or add an
|
||||
overview-specific aggregate route.
|
||||
- Prefer SQL `GROUP BY` over transferring every task to the browser.
|
||||
- Update `dashboard/src/index.md`, `dashboard/src/tasks.md`,
|
||||
`dashboard/src/interventions.md`, and workplan detail pages as needed so list
|
||||
views still receive the rows they need.
|
||||
- Add tests for pagination compatibility and aggregate counts.
|
||||
|
||||
Done when the overview no longer fetches the full task table to draw the
|
||||
workplan chart.
|
||||
|
||||
## T04 — Build a Lightweight Overview Read Endpoint
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T04
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "2cf47a12-e8aa-49ca-963c-1f0d2933c344"
|
||||
```
|
||||
|
||||
Create a dashboard-specific read model that returns exactly the data needed by
|
||||
the overview page in one bounded response.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Add an endpoint such as `GET /state/overview` or
|
||||
`GET /state/dashboard-overview`.
|
||||
- Include summary totals, recent progress needed by the page, blocking decision
|
||||
counts, waiting-task counts, SBOM snapshot totals, registration milestones,
|
||||
and workplan chart rows with repo/domain labels and task counts.
|
||||
- Keep response fields stable and documented in dashboard reference docs.
|
||||
- Reuse existing summary helpers where they are efficient, but avoid serializing
|
||||
large full-list payloads that the overview does not display directly.
|
||||
- Add cache headers and a short in-process cache with explicit invalidation
|
||||
rules where appropriate.
|
||||
- Update `dashboard/src/index.md` to prefer this endpoint and remove redundant
|
||||
overview-only fetches.
|
||||
|
||||
Done when the overview's steady-state refresh is one bounded API call plus only
|
||||
truly interactive secondary calls.
|
||||
|
||||
## T05 — Add Stale-While-Refresh for File-Backed Workplan Index
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T05
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "0c88c1a2-588b-41f8-bc1c-f94c8b4b0d1a"
|
||||
```
|
||||
|
||||
Make `/workplans/index` resilient when repository filesystem scans are slow.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Add singleflight behavior so concurrent requests share one in-progress
|
||||
rebuild instead of starting or waiting on redundant scans.
|
||||
- Return stale cached data quickly while a background refresh runs when the
|
||||
cache is expired but still available.
|
||||
- Include metadata such as `generated_at`, `stale`, `cache_age_seconds`, and
|
||||
optionally `refresh_in_progress`.
|
||||
- Consider reading only frontmatter rather than whole markdown files if this
|
||||
can be done cleanly.
|
||||
- Keep `refresh=true` as an explicit operator escape hatch.
|
||||
- Add tests for cache hit, stale return, and forced refresh behavior.
|
||||
|
||||
Done when a slow filesystem scan cannot block normal dashboard refreshes for
|
||||
longer than the frontend timeout if cached data exists.
|
||||
|
||||
## T06 — Use Browser and HTTP Caching Selectively
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T06
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "811f02ff-2e92-4c82-8b8a-e3d39a450b02"
|
||||
```
|
||||
|
||||
Let stable lookup requests benefit from cache headers instead of forcing every
|
||||
dashboard request to bypass caches.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Extend `apiFetch` so callers can choose cache mode.
|
||||
- Keep `no-store` for volatile mutation-sensitive resources.
|
||||
- Use default browser caching or `reload` only where route cache headers are
|
||||
already intentional, such as repo/topic lookup data.
|
||||
- Review current route cache headers and align them with dashboard polling
|
||||
needs.
|
||||
- Avoid stale cached data for controls that immediately follow a mutation.
|
||||
|
||||
Done when stable overview lookup data no longer bypasses useful cache headers
|
||||
by default.
|
||||
|
||||
## T07 — Optimize `/state/summary` Cache Misses
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T07
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "633f4cc6-ffeb-4086-9858-d239f50a9686"
|
||||
```
|
||||
|
||||
Reduce the cost of a cold or expired `/state/summary` request.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Profile the current sequential query groups in `api/routers/state.py`.
|
||||
- Move Python-side counts and scans into SQL where straightforward.
|
||||
- Remove unused work from the summary path, such as dead intermediate query
|
||||
results.
|
||||
- Cache derived sections independently when their freshness requirements differ.
|
||||
- Add indexes only after profiling shows a query plan needs them.
|
||||
- Keep summary response compatibility for existing consumers and MCP smoke
|
||||
tests.
|
||||
|
||||
Done when a summary cache miss stays comfortably below the frontend timeout
|
||||
under the current local data volume.
|
||||
|
||||
## T08 — Verify Under Dashboard-Style Load
|
||||
|
||||
```task
|
||||
id: STATE-WP-0056-T08
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "353fb25a-5306-416b-8d6d-9b201e6fac87"
|
||||
```
|
||||
|
||||
Prove the dashboard no longer produces frequent abort warnings under realistic
|
||||
refresh behavior.
|
||||
|
||||
Implementation notes:
|
||||
|
||||
- Add or document a repeatable script that performs dashboard-style concurrent
|
||||
endpoint timing before and after the changes.
|
||||
- Run API tests and dashboard component tests.
|
||||
- Open the dashboard locally and verify that initial load, refresh, hidden-tab
|
||||
pause/resume, and partial API failure states behave correctly.
|
||||
- Confirm payload sizes are lower than the baseline for the overview page.
|
||||
- Update `dashboard/src/docs/overview.md` and `dashboard/src/docs/live-data.md`
|
||||
with the new data-loading model.
|
||||
|
||||
Done when repeated dashboard refreshes do not show the global aborted-operation
|
||||
warning during normal local operation, and degraded sections recover cleanly.
|
||||
Reference in New Issue
Block a user