Files
state-hub/workplans/STATE-WP-0056-dashboard-loading-robustness.md

10 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated state_hub_workstream_id
STATE-WP-0056 workplan Dashboard Loading Robustness and Efficiency custodian state-hub finished codex custodian 2026-06-05 2026-06-05 28f9569c-937b-4b79-b46c-f6b1f83c09c3

Dashboard Loading Robustness and Efficiency

Summary

Make the State Hub dashboard overview page faster and more resilient under normal polling. The current overview performs a broad concurrent fan-out of full-list API calls and treats most request failures as whole-page failures. This can surface frequent Dashboard data load failed: The operation was aborted. warnings when one call crosses the frontend timeout, even if the API eventually returns successfully.

This work should reduce request count, payload size, and backend contention; preserve useful last-known data during partial failures; and give operators clearer diagnostics when a section is stale or unavailable.

Current Findings

Inspection on 2026-06-05 found:

  • dashboard/src/index.md loads overview data with one eight-request Promise.all batch.
  • dashboard/src/components/config.js aborts most apiFetch calls after 12_000 ms.
  • A dashboard-style concurrent timing run produced several calls at or above the default timeout: /sbom/snapshots/, /repos/, and /workplans/index.
  • The same endpoints can be much faster when called alone, which points to contention and over-fetching rather than one permanently slow endpoint.
  • The overview calls /tasks/?limit=2000, but the tasks API currently ignores limit and returns every task. In the observed run that response was roughly 2.1 MB just to compute per-workplan task counts.
  • /state/summary has a short in-process cache, but a cache miss still runs a large amount of sequential database and Python-side aggregation work.
  • /workplans/index scans active repository workplan files and parses frontmatter. It is cached, but concurrent dashboard loads can still wait on the same expensive rebuild pattern.
  • Several API routes set cache headers, but the shared dashboard fetch helper forces cache: "no-store" for every request.

Out of Scope

  • Replacing Observable Framework.
  • Redesigning the dashboard information architecture.
  • Adding authentication, authorization, or multi-user session handling.
  • Changing workplan file conventions.
  • Moving State Hub to a different database or deployment substrate.

T01 — Add Focused Dashboard Load Instrumentation

id: STATE-WP-0056-T01
status: done
priority: high
state_hub_task_id: "e5208053-0db1-4842-a221-c5289422677a"

Add enough timing and error visibility to confirm which overview calls are slow, aborted, or oversized during normal use.

Implementation notes:

  • Add lightweight server-side timing logs or response headers for overview-hot endpoints: /state/summary, /workplans/, /tasks/, /topics/, /repos/, /sbom/snapshots/, /progress/, and /workplans/index.
  • Include request path, status, elapsed time, response size when practical, and whether a cached result was used.
  • Keep instrumentation local and low-noise; avoid logging full payloads or secrets.
  • Add a small dashboard diagnostic surface or console logging that distinguishes timeout aborts from HTTP errors and network failures.
  • Capture before/after timing notes in this workplan or a progress event.

Done when a normal dashboard refresh can be diagnosed without manually timing each endpoint from a shell.

T02 — Make Overview Polling Partially Resilient

id: STATE-WP-0056-T02
status: done
priority: high
state_hub_task_id: "2cdd960d-ba86-48d1-a7c6-e83671cd0e69"

Change the overview data loader so one slow or failed secondary request does not mark the whole dashboard as failed.

Implementation notes:

  • Replace fail-fast Promise.all behavior in dashboard/src/index.md with a per-resource result model, for example Promise.allSettled.
  • Keep last-known-good data for each section while a refresh is degraded.
  • Treat optional resources such as SBOM snapshots, registration milestones, and workplan file metadata independently from core summary/workplan status data.
  • Display section-level stale/error indicators instead of one global warning whenever possible.
  • Keep exponential backoff for repeated failures, but do not discard usable data just because one request timed out.
  • Make abort errors user-readable, for example "timed out after 12s" instead of only "The operation was aborted."

Done when an SBOM, repo-list, or workplan-index timeout leaves the rest of the overview usable and visibly stale rather than failed.

T03 — Respect Pagination and Add Task Count Aggregates

id: STATE-WP-0056-T03
status: done
priority: high
state_hub_task_id: "78484226-9ccc-460c-a2b3-750b3204caa3"

Stop returning all tasks for overview count calculations.

Implementation notes:

  • Add limit and offset support to GET /tasks/, preserving existing filter behavior and sensible limits.
  • Add a lightweight aggregate endpoint for task counts by workplan and status, for example GET /tasks/counts?group_by=workstream,status, or add an overview-specific aggregate route.
  • Prefer SQL GROUP BY over transferring every task to the browser.
  • Update dashboard/src/index.md, dashboard/src/tasks.md, dashboard/src/interventions.md, and workplan detail pages as needed so list views still receive the rows they need.
  • Add tests for pagination compatibility and aggregate counts.

Done when the overview no longer fetches the full task table to draw the workplan chart.

T04 — Build a Lightweight Overview Read Endpoint

id: STATE-WP-0056-T04
status: done
priority: high
state_hub_task_id: "2cf47a12-e8aa-49ca-963c-1f0d2933c344"

Create a dashboard-specific read model that returns exactly the data needed by the overview page in one bounded response.

Implementation notes:

  • Add an endpoint such as GET /state/overview or GET /state/dashboard-overview.
  • Include summary totals, recent progress needed by the page, blocking decision counts, waiting-task counts, SBOM snapshot totals, registration milestones, and workplan chart rows with repo/domain labels and task counts.
  • Keep response fields stable and documented in dashboard reference docs.
  • Reuse existing summary helpers where they are efficient, but avoid serializing large full-list payloads that the overview does not display directly.
  • Add cache headers and a short in-process cache with explicit invalidation rules where appropriate.
  • Update dashboard/src/index.md to prefer this endpoint and remove redundant overview-only fetches.

Done when the overview's steady-state refresh is one bounded API call plus only truly interactive secondary calls.

T05 — Add Stale-While-Refresh for File-Backed Workplan Index

id: STATE-WP-0056-T05
status: done
priority: medium
state_hub_task_id: "0c88c1a2-588b-41f8-bc1c-f94c8b4b0d1a"

Make /workplans/index resilient when repository filesystem scans are slow.

Implementation notes:

  • Add singleflight behavior so concurrent requests share one in-progress rebuild instead of starting or waiting on redundant scans.
  • Return stale cached data quickly while a background refresh runs when the cache is expired but still available.
  • Include metadata such as generated_at, stale, cache_age_seconds, and optionally refresh_in_progress.
  • Consider reading only frontmatter rather than whole markdown files if this can be done cleanly.
  • Keep refresh=true as an explicit operator escape hatch.
  • Add tests for cache hit, stale return, and forced refresh behavior.

Done when a slow filesystem scan cannot block normal dashboard refreshes for longer than the frontend timeout if cached data exists.

T06 — Use Browser and HTTP Caching Selectively

id: STATE-WP-0056-T06
status: done
priority: medium
state_hub_task_id: "811f02ff-2e92-4c82-8b8a-e3d39a450b02"

Let stable lookup requests benefit from cache headers instead of forcing every dashboard request to bypass caches.

Implementation notes:

  • Extend apiFetch so callers can choose cache mode.
  • Keep no-store for volatile mutation-sensitive resources.
  • Use default browser caching or reload only where route cache headers are already intentional, such as repo/topic lookup data.
  • Review current route cache headers and align them with dashboard polling needs.
  • Avoid stale cached data for controls that immediately follow a mutation.

Done when stable overview lookup data no longer bypasses useful cache headers by default.

T07 — Optimize /state/summary Cache Misses

id: STATE-WP-0056-T07
status: done
priority: medium
state_hub_task_id: "633f4cc6-ffeb-4086-9858-d239f50a9686"

Reduce the cost of a cold or expired /state/summary request.

Implementation notes:

  • Profile the current sequential query groups in api/routers/state.py.
  • Move Python-side counts and scans into SQL where straightforward.
  • Remove unused work from the summary path, such as dead intermediate query results.
  • Cache derived sections independently when their freshness requirements differ.
  • Add indexes only after profiling shows a query plan needs them.
  • Keep summary response compatibility for existing consumers and MCP smoke tests.

Done when a summary cache miss stays comfortably below the frontend timeout under the current local data volume.

T08 — Verify Under Dashboard-Style Load

id: STATE-WP-0056-T08
status: done
priority: high
state_hub_task_id: "353fb25a-5306-416b-8d6d-9b201e6fac87"

Prove the dashboard no longer produces frequent abort warnings under realistic refresh behavior.

Implementation notes:

  • Add or document a repeatable script that performs dashboard-style concurrent endpoint timing before and after the changes.
  • Run API tests and dashboard component tests.
  • Open the dashboard locally and verify that initial load, refresh, hidden-tab pause/resume, and partial API failure states behave correctly.
  • Confirm payload sizes are lower than the baseline for the overview page.
  • Update dashboard/src/docs/overview.md and dashboard/src/docs/live-data.md with the new data-loading model.

Done when repeated dashboard refreshes do not show the global aborted-operation warning during normal local operation, and degraded sections recover cleanly.