docs(dashboard): add technical reference page for Observable Framework dashboard
Documents the dashboard's architecture, framework choice rationale, data-fetching strategies (static loaders + live polling), component library, page inventory, and key features including the Workstream Health Index and entity modals. Also registers the new page in the Reference nav and adds runbook section for node overload / runaway agent process (INC-002) with hardening checklist. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
title: Runbook — Gitea on COULOMBCORE
|
||||
tags: [gitea, coulombcore, k3s, postgresql-ha]
|
||||
created: 2026-03-25
|
||||
updated: 2026-03-25
|
||||
updated: 2026-03-26
|
||||
---
|
||||
|
||||
# Runbook: Gitea on COULOMBCORE
|
||||
@@ -143,6 +143,46 @@ When Gitea is down, work through this in order:
|
||||
|
||||
---
|
||||
|
||||
### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive)
|
||||
|
||||
**Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake
|
||||
timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks,
|
||||
kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established
|
||||
before the overload and requires no new connections).
|
||||
|
||||
**Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses)
|
||||
exhausts the process/memory budget. With no swap, the kernel thrashes continuously.
|
||||
|
||||
**Triage (workstation):**
|
||||
```bash
|
||||
# Check if node is alive despite SSH being down
|
||||
curl -s --max-time 5 http://127.0.0.1:8000/state/health # via reverse tunnel
|
||||
|
||||
# k3s API — will timeout if node is thrashing
|
||||
kubectl get nodes # expect TLS timeout
|
||||
```
|
||||
|
||||
**Fix (requires console/VNC access):**
|
||||
```bash
|
||||
# 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top
|
||||
# Runaway claude agents: massive VIRT (>50GB), user tegwick
|
||||
|
||||
# 2. Kill the offenders
|
||||
kill -9 <runaway-pid>
|
||||
kill -9 <apport-pid-if-in-D-state> # apport in D-state amplifies load
|
||||
|
||||
# 3. Wait ~60s for load to drop; SSH will start accepting connections
|
||||
# 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts
|
||||
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
|
||||
```
|
||||
|
||||
**Gitea does NOT need to be restarted** — it survives node overload. Once load drops
|
||||
and PostgreSQL HA resyncs, Gitea serves requests again.
|
||||
|
||||
**Prevention:** See "Robustness" section below.
|
||||
|
||||
---
|
||||
|
||||
## Node Resource Budget (approximate)
|
||||
|
||||
| Component | CPU Request |
|
||||
@@ -157,3 +197,71 @@ When Gitea is down, work through this in order:
|
||||
|
||||
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
|
||||
reviewing resource requests first.
|
||||
|
||||
---
|
||||
|
||||
## Robustness — Hardening Checklist
|
||||
|
||||
These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26):
|
||||
|
||||
### 1. Add swap (not yet done — highest priority)
|
||||
|
||||
```bash
|
||||
fallocate -l 4G /swapfile
|
||||
chmod 600 /swapfile
|
||||
mkswap /swapfile
|
||||
swapon /swapfile
|
||||
echo '/swapfile none swap sw 0 0' >> /etc/fstab
|
||||
```
|
||||
|
||||
Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time.
|
||||
|
||||
### 2. Cap tegwick user nproc (not yet done)
|
||||
|
||||
```bash
|
||||
# /etc/security/limits.conf
|
||||
tegwick hard nproc 512
|
||||
tegwick soft nproc 256
|
||||
```
|
||||
|
||||
Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine
|
||||
within 256 soft / 512 hard.
|
||||
|
||||
### 3. Cap tegwick systemd user session memory (not yet done)
|
||||
|
||||
```bash
|
||||
# Create override for the tegwick user slice
|
||||
mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/
|
||||
cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf <<EOF
|
||||
[Slice]
|
||||
MemoryMax=1500M
|
||||
MemorySwapMax=512M
|
||||
EOF
|
||||
systemctl daemon-reload
|
||||
```
|
||||
|
||||
Prevents a rogue user process from consuming all 3.9GB.
|
||||
|
||||
### 4. Always-on agent guardrails (process hygiene)
|
||||
|
||||
- **Never run `/ralph-loop` directly on COULOMBCORE** — use `/ralph-workplan` which
|
||||
self-terminates when the workplan is complete (HEUREKA stop condition).
|
||||
- Set `--max-iterations` explicitly on any Ralph invocation.
|
||||
- Avoid large parallel agent fans (e.g., spawning 20 sub-agents simultaneously) on
|
||||
this resource-constrained node.
|
||||
|
||||
### 5. Add cluster health alerting (not yet done)
|
||||
|
||||
A per-service tunnel adds passive visibility but no alerting. A single cron covering the
|
||||
whole cluster is more useful — it catches Gitea, PGPool, and any other crashlooping pod.
|
||||
|
||||
```bash
|
||||
# /etc/cron.d/k3s-pod-health (on CoulombCore, run as tegwick)
|
||||
*/5 * * * * tegwick kubectl get pods -A 2>/dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST <notify-webhook> -d "k3s pod unhealthy on COULOMBCORE" || true
|
||||
```
|
||||
|
||||
Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod
|
||||
with restart count > 3 and status not Running/Completed warrants a notification.
|
||||
|
||||
This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days
|
||||
undetected) without adding tunnel infrastructure that can't help under node overload.
|
||||
|
||||
@@ -72,6 +72,7 @@ export default {
|
||||
pages: [
|
||||
{ name: "Capabilities", path: "/docs/capabilities" },
|
||||
{ name: "Connecting to the Hub", path: "/docs/connecting" },
|
||||
{ name: "Dashboard", path: "/docs/dashboard" },
|
||||
{ name: "Contributions", path: "/docs/contributions" },
|
||||
{ name: "Decision Health", path: "/docs/decisions-kpi" },
|
||||
{ name: "Decisions", path: "/docs/decisions" },
|
||||
|
||||
338
state-hub/dashboard/src/docs/dashboard.md
Normal file
338
state-hub/dashboard/src/docs/dashboard.md
Normal file
@@ -0,0 +1,338 @@
|
||||
---
|
||||
title: Dashboard — Technical Reference
|
||||
---
|
||||
|
||||
# State Hub Dashboard — Technical Reference
|
||||
|
||||
The State Hub dashboard is the primary visual interface for the Custodian
|
||||
ecosystem. It provides live, reactive views of all tracked domains,
|
||||
workstreams, tasks, decisions, contributions, SBOM data, and agent activity —
|
||||
all sourced from the local FastAPI state service.
|
||||
|
||||
---
|
||||
|
||||
## Framework: Observable Framework
|
||||
|
||||
The dashboard is built on **[Observable Framework](https://observablehq.com/framework/)**,
|
||||
an open-source static-site framework from Observable, Inc. designed specifically
|
||||
for data-driven pages.
|
||||
|
||||
### Why Observable Framework?
|
||||
|
||||
| Requirement | How Observable Framework satisfies it |
|
||||
|---|---|
|
||||
| **Local-first, no build-time cloud dependency** | Compiles to a static site (`npm run build`); the preview server and data loaders run entirely on localhost. |
|
||||
| **Live data without a separate frontend service** | Pages poll the FastAPI backend directly from the browser via `fetch`. No BFF, no GraphQL, no WebSockets required. |
|
||||
| **Reactive updates without React complexity** | Observable's cell-based execution model re-runs any code block whose inputs change. Async generators produce new values every poll cycle and trigger re-renders automatically. |
|
||||
| **No JS bundler configuration** | `.md` files containing fenced JS code blocks are the entire source. No webpack, no Vite config, no `tsconfig.json`. |
|
||||
| **Native data visualisation** | First-class integration with `@observablehq/plot` — a concise, grammar-of-graphics library — for all charts. |
|
||||
| **Sovereignty-compatible** | The built output is a folder of static HTML/JS/CSS. It can be served by any web server, archived, or opened directly from disk. |
|
||||
| **Offline-graceful** | Data loaders (Python scripts that run at build time) produce JSON snapshots. If the API is unreachable at build time, the loader emits an empty-structure JSON so the page still renders with a clear error state instead of crashing. |
|
||||
|
||||
Observable Framework was chosen over alternatives (Grafana, Metabase, Streamlit,
|
||||
Next.js) because its design principles are uniquely aligned with the Custodian
|
||||
philosophy: **local-first**, **no vendor lock-in**, **sovereignty-preserving**,
|
||||
and **auditable** — the full data pipeline is visible in plain Markdown files.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
src/
|
||||
observablehq.config.js — site metadata, page registry, theme, global head
|
||||
components/ — shared JS modules
|
||||
data/ — Python data loaders (run at build time)
|
||||
docs/ — reference pages (this file lives here)
|
||||
*.md — one page per feature area
|
||||
```
|
||||
|
||||
### Data flow
|
||||
|
||||
There are two complementary data-fetching strategies:
|
||||
|
||||
**1. Static data loaders** (`src/data/*.json.py`)
|
||||
|
||||
Python scripts executed by the Observable build toolchain at `npm run build`
|
||||
or `npm run dev`. Each script calls the FastAPI backend via `urllib`, serialises
|
||||
the response to JSON on stdout, and Observable Framework captures that output
|
||||
as a static snapshot file that the page imports with `FileAttachment(...)`.
|
||||
|
||||
Current loaders:
|
||||
|
||||
| File | API endpoint |
|
||||
|---|---|
|
||||
| `summary.json.py` | `/state/summary` |
|
||||
| `workstreams.json.py` | `/workstreams/` |
|
||||
| `contributions.json.py` | `/contributions/` |
|
||||
| `decisions.json.py` | `/decisions/` |
|
||||
| `domains.json.py` | `/domains/` |
|
||||
| `messages.json.py` | `/messages/` |
|
||||
| `progress.json.py` | `/progress/` |
|
||||
| `repos.json.py` | `/repos/` |
|
||||
| `sbom.json.py` | `/sbom/aggregated` |
|
||||
| `gitea-inventory.json.py` | Gitea instance inventory |
|
||||
|
||||
**2. Live browser polling** (async generators in page `.md` files)
|
||||
|
||||
All interactive pages bypass the static snapshots for live data by using
|
||||
Observable's async generator pattern directly in the browser:
|
||||
|
||||
```js
|
||||
const summaryState = (async function*() {
|
||||
while (true) {
|
||||
const r = await fetch(`${API}/state/summary`);
|
||||
yield { data: r.ok ? await r.json() : {error: `HTTP ${r.status}`}, ok: r.ok };
|
||||
await new Promise(res => setTimeout(res, POLL));
|
||||
}
|
||||
})();
|
||||
```
|
||||
|
||||
`POLL` is set to **15 000 ms** (15 seconds) in `src/components/config.js`.
|
||||
Observable's reactivity engine detects each new yield value and re-runs all
|
||||
dependent code blocks, updating charts, tables, and KPI cards automatically.
|
||||
A `●` live indicator in the top-left corner of each page shows the connection
|
||||
status and the last-updated time.
|
||||
|
||||
### Global configuration — `observablehq.config.js`
|
||||
|
||||
| Setting | Value |
|
||||
|---|---|
|
||||
| Root directory | `src/` |
|
||||
| Site title | "Custodian State Hub" |
|
||||
| Theme | `["air", "near-midnight"]` — light body with dark sidebar |
|
||||
| Favicon | Inline SVG data URI (🗄️ emoji) |
|
||||
| Global head | KPI infobox styles, filter-bar styles, improvement-modal script |
|
||||
|
||||
The `improvement-modal.js` component is injected at the config level rather
|
||||
than imported per-page because Observable proxies `src/*.js` through its own
|
||||
bundler, which prevents them from being loaded as raw `<script>` tags in
|
||||
`<head>`. The config reads the file at build time, strips ES module export
|
||||
keywords, and injects the result as a plain inline `<script>`.
|
||||
|
||||
---
|
||||
|
||||
## Page Inventory
|
||||
|
||||
The dashboard has 30+ pages organised in four navigation groups:
|
||||
|
||||
### Top-level pages
|
||||
|
||||
| Page | Route | Purpose |
|
||||
|---|---|---|
|
||||
| Overview | `/` | Cross-domain summary — workstream chart, status KPIs, blocking decisions, recent activity |
|
||||
| Capabilities | `/capability-requests` | Capability request routing and fulfilment status |
|
||||
| Contributions | `/contributions` | Upstream contribution Kanban (bug reports, feature requests, upstream PRs) |
|
||||
| Domains | `/domains` | Per-domain health overview and management |
|
||||
| Goals | `/goals` | Domain goals and repo-scoped goals |
|
||||
| Inbox | `/inbox` | Agent message inbox and inter-repo communication |
|
||||
| Progress | `/progress` | Session progress event log |
|
||||
| Services (TPSC) | `/tpsc` | Third-party services catalog with GDPR maturity status |
|
||||
| Todo | `/todo` | Consolidated todo list across all repos |
|
||||
| Tools & Apps | `/tools` | Registered tools and applications |
|
||||
|
||||
### Repositories section
|
||||
|
||||
| Page | Route | Purpose |
|
||||
|---|---|---|
|
||||
| Repositories | `/repos` | All registered repos with DoI compliance tier |
|
||||
| Debt | `/techdept` | Technical debt registry |
|
||||
| Repo Sync | `/repo-sync` | Consistency checker results and sync status |
|
||||
| SBOM | `/sbom` | Software bill of materials — packages, licences, copyleft risk |
|
||||
|
||||
### Workstreams section
|
||||
|
||||
| Page | Route | Purpose |
|
||||
|---|---|---|
|
||||
| Workstreams | `/workstreams` | All workstreams with Workstream Health Index |
|
||||
| Decisions | `/decisions` | Decision log with resolve-in-place form |
|
||||
| Dependencies | `/dependencies` | Dependency graph explorer |
|
||||
| Extensions | `/extensions` | Extension point registry |
|
||||
| Interventions | `/interventions` | Tasks flagged for human intervention |
|
||||
| Tasks | `/tasks` | Task list with filters and status tracking |
|
||||
| UI Feedback | `/ui-feedback` | UI improvement feedback and issue tracking |
|
||||
|
||||
### Reference section
|
||||
|
||||
22 reference pages covering every feature, data model, and integration in detail.
|
||||
|
||||
---
|
||||
|
||||
## Component Library
|
||||
|
||||
All shared components live in `src/components/` and are imported as ES modules:
|
||||
|
||||
### `config.js`
|
||||
Exports two constants used by every live-polling page:
|
||||
- `API = "http://127.0.0.1:8000"` — the FastAPI base URL
|
||||
- `POLL = 15_000` — polling interval in milliseconds
|
||||
|
||||
### `entity-modal.js`
|
||||
A lightweight detail overlay for entities. Any table row or card can call
|
||||
`openEntityModal(entity, type)` to open a full-detail panel without navigating
|
||||
away from the page. Supports four entity types: `workstream`, `task`, `ep`
|
||||
(extension point), and `td` (technical debt).
|
||||
|
||||
Also exports `buildEntityTable()` — a function that constructs a consistent,
|
||||
clickable HTML table for any list of entities, with proportional column widths,
|
||||
overflow ellipsis, and native tooltip-on-hover for truncated values.
|
||||
|
||||
### `toc-sidebar.js`
|
||||
Provides `injectTocTop(id, element)` — injects a DOM element into the
|
||||
Observable Framework table-of-contents sidebar above the page's first section
|
||||
heading. Used on the Overview and Workstreams pages to embed live KPI infoboxes
|
||||
directly in the sidebar.
|
||||
|
||||
### `doc-overlay.js`
|
||||
Provides `withDocHelp(element, docPath)` — attaches a small `?` icon to any
|
||||
element that opens the linked reference page in a lightweight overlay panel
|
||||
without leaving the current page.
|
||||
|
||||
### `help-tip.js`
|
||||
A custom HTML element (`<help-tip>`) that renders an inline abbreviated label
|
||||
with an expandable tooltip containing a longer description and a link to the
|
||||
relevant reference page. Used in the Workstream Health Index card to annotate
|
||||
each metric abbreviation.
|
||||
|
||||
### `multiselect.js`
|
||||
A multi-value dropdown filter input compatible with Observable's `Inputs.form()`
|
||||
reactive pattern. Used on the Workstreams and Tasks pages for domain and status
|
||||
filtering.
|
||||
|
||||
### `improvement-modal.js`
|
||||
A floating feedback button that opens a modal form for submitting UI improvement
|
||||
suggestions. Injected globally via `observablehq.config.js` so it is available
|
||||
on every page.
|
||||
|
||||
### `action-confirm.js`
|
||||
A confirmation-dialog helper for destructive or irreversible actions triggered
|
||||
from the dashboard.
|
||||
|
||||
---
|
||||
|
||||
## Key Features
|
||||
|
||||
### Live polling with connection status
|
||||
|
||||
Every interactive page runs one or more async generator loops that poll the
|
||||
FastAPI backend every 15 seconds. A `●` indicator in the top-left corner
|
||||
shows green when the API is reachable and red with a restart command when it
|
||||
is not. This allows the dashboard to be used as a persistent, always-on monitor
|
||||
without requiring a page refresh.
|
||||
|
||||
### Workstream Health Index (WHI)
|
||||
|
||||
The Workstreams page computes a **Workstream Health Index** — a single
|
||||
composite score (0–100%) derived from five graph metrics:
|
||||
|
||||
| Metric | Abbrev. | Weight | Interpretation |
|
||||
|---|---|---|---|
|
||||
| Dependency Density | DD | 30% | Average deps per open workstream; high = tightly coupled |
|
||||
| Blocked Ratio | BR | 25% | Share of workstreams in a blocked state |
|
||||
| Single-Point Risk | SPR | 15% | Share of workstreams that others depend on but are not yet complete |
|
||||
| Parallel Execution Potential | PEP | 20% | Share of workstreams that could start/continue immediately |
|
||||
| Cross-Domain Dependency Ratio | CDDR | 10% | Share of edges crossing domain boundaries |
|
||||
|
||||
A **Cycle Presence Indicator** (CPI) detected via DFS halves the total score
|
||||
when a dependency cycle is found, since cyclic dependencies cause deadlock.
|
||||
The index is computed per-domain as well as globally and displayed in the
|
||||
TOC sidebar as a persistent KPI card.
|
||||
|
||||
### Multi-mode workstream chart
|
||||
|
||||
The Overview page renders a horizontal stacked bar chart using `@observablehq/plot`
|
||||
showing task counts (done / in progress / blocked / todo) per workstream.
|
||||
A `<select>` dropdown switches between:
|
||||
|
||||
- **Status modes**: active, accepted, finished, blocked, stalled, oldies
|
||||
- **Time modes**: last 1h, 24h, 7d, 30d, today, this week, this month
|
||||
|
||||
Domains are sorted by most recent workstream activity (most active domain at
|
||||
the top). Title labels and done/total counters are overlaid directly on the bars.
|
||||
|
||||
### Resolve-in-place for blocking decisions
|
||||
|
||||
Blocking decisions on the Overview page render with an expandable form
|
||||
(`<details>` element). The human can enter a rationale and click "Record & close"
|
||||
to call `POST /decisions/{id}/resolve` without leaving the page. The decision
|
||||
list refreshes after a successful resolve; other decisions remain unchanged and
|
||||
retain any in-progress text the user was typing.
|
||||
|
||||
### SBOM and licence-risk tracking
|
||||
|
||||
The Overview page shows three SBOM/contribution health KPI cards. The SBOM
|
||||
page renders a horizontal bar chart of package counts by licence, with
|
||||
highlighted cards for any detected copyleft licences (GPL, AGPL, LGPL, etc.)
|
||||
in direct production dependencies.
|
||||
|
||||
### Dependency graph
|
||||
|
||||
The Dependencies page and the Workstreams page both surface inter-workstream
|
||||
dependency data. Each workstream card shows the workstreams it depends on
|
||||
(`↳ depends on`) and the workstreams it blocks (`⊳ blocks`), derived from
|
||||
the `WorkstreamDependency` table.
|
||||
|
||||
### Entity modals
|
||||
|
||||
Any table row on any list page (workstreams, tasks, extension points, tech debt)
|
||||
can be clicked to open a detail modal with full field data, dependency lists,
|
||||
task progress, and timestamps — without a page navigation or a separate detail
|
||||
route.
|
||||
|
||||
### Graceful offline state
|
||||
|
||||
All async generator polls wrap API calls in `try/catch`. When the API is
|
||||
unreachable, pages display an error banner and a `make api` restart command
|
||||
rather than crashing or showing stale cached data without warning. Static
|
||||
data loaders emit an empty-structure JSON fallback so build-time failures do
|
||||
not block the dashboard from loading.
|
||||
|
||||
---
|
||||
|
||||
## Visualisation library: `@observablehq/plot`
|
||||
|
||||
All charts use **[@observablehq/plot](https://observablehq.com/plot/)** —
|
||||
Observable's concise, composable grammar-of-graphics library. It is imported
|
||||
on demand per page:
|
||||
|
||||
```js
|
||||
import * as Plot from "npm:@observablehq/plot";
|
||||
```
|
||||
|
||||
Observable Framework resolves `npm:` specifiers at build time (no `npm install`
|
||||
needed in the source directory). Typical mark types used across the dashboard:
|
||||
|
||||
| Mark | Used for |
|
||||
|---|---|
|
||||
| `Plot.barX` | Horizontal stacked task-count bars, SBOM licence distribution |
|
||||
| `Plot.text` | Workstream title labels and done/total counters overlaid on bars |
|
||||
| `Plot.ruleX([0])` | Zero-axis rule on all bar charts |
|
||||
|
||||
Charts are rendered as inline SVG and inherit Observable Framework's theme
|
||||
CSS variables, so they adapt correctly to both light (`air`) and dark
|
||||
(`near-midnight`) themes.
|
||||
|
||||
---
|
||||
|
||||
## Running the dashboard
|
||||
|
||||
```bash
|
||||
cd ~/the-custodian/state-hub/dashboard
|
||||
|
||||
npm run dev # Preview server on :3000 with hot reload
|
||||
npm run build # Static build into dist/
|
||||
```
|
||||
|
||||
The API must be running (`make api` in `state-hub/`) for the data loaders and
|
||||
live polling to work. If the API is not running, the dashboard loads with empty
|
||||
data and shows the offline error state on each page.
|
||||
|
||||
---
|
||||
|
||||
## Related
|
||||
|
||||
- [State Hub — Reference](/docs/state-hub) — overall architecture and design principles
|
||||
- [Live Data](/docs/live-data) — polling mechanism and offline behaviour in detail
|
||||
- [Connecting to the Hub](/docs/connecting) — MCP server registration
|
||||
- [Overview](/docs/overview) — Overview page feature walkthrough
|
||||
- [Workstreams](/docs/workstreams) — Workstreams page and WHI in depth
|
||||
Reference in New Issue
Block a user