From b19896a9a9d9e947a208c659938d4b44bba11b42 Mon Sep 17 00:00:00 2001 From: tegwick Date: Fri, 27 Mar 2026 00:09:18 +0100 Subject: [PATCH] docs(dashboard): add technical reference page for Observable Framework dashboard Documents the dashboard's architecture, framework choice rationale, data-fetching strategies (static loaders + live polling), component library, page inventory, and key features including the Workstream Health Index and entity modals. Also registers the new page in the Reference nav and adds runbook section for node overload / runaway agent process (INC-002) with hardening checklist. Co-Authored-By: Claude Sonnet 4.6 --- ops/runbooks/gitea-coulombcore.md | 110 ++++++- state-hub/dashboard/observablehq.config.js | 1 + state-hub/dashboard/src/docs/dashboard.md | 338 +++++++++++++++++++++ 3 files changed, 448 insertions(+), 1 deletion(-) create mode 100644 state-hub/dashboard/src/docs/dashboard.md diff --git a/ops/runbooks/gitea-coulombcore.md b/ops/runbooks/gitea-coulombcore.md index 272dd04..f226d00 100644 --- a/ops/runbooks/gitea-coulombcore.md +++ b/ops/runbooks/gitea-coulombcore.md @@ -2,7 +2,7 @@ title: Runbook — Gitea on COULOMBCORE tags: [gitea, coulombcore, k3s, postgresql-ha] created: 2026-03-25 -updated: 2026-03-25 +updated: 2026-03-26 --- # Runbook: Gitea on COULOMBCORE @@ -143,6 +143,46 @@ When Gitea is down, work through this in order: --- +### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive) + +**Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake +timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks, +kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established +before the overload and requires no new connections). + +**Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses) +exhausts the process/memory budget. With no swap, the kernel thrashes continuously. + +**Triage (workstation):** +```bash +# Check if node is alive despite SSH being down +curl -s --max-time 5 http://127.0.0.1:8000/state/health # via reverse tunnel + +# k3s API — will timeout if node is thrashing +kubectl get nodes # expect TLS timeout +``` + +**Fix (requires console/VNC access):** +```bash +# 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top +# Runaway claude agents: massive VIRT (>50GB), user tegwick + +# 2. Kill the offenders +kill -9 +kill -9 # apport in D-state amplifies load + +# 3. Wait ~60s for load to drop; SSH will start accepting connections +# 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts +kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha' +``` + +**Gitea does NOT need to be restarted** — it survives node overload. Once load drops +and PostgreSQL HA resyncs, Gitea serves requests again. + +**Prevention:** See "Robustness" section below. + +--- + ## Node Resource Budget (approximate) | Component | CPU Request | @@ -157,3 +197,71 @@ When Gitea is down, work through this in order: Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without reviewing resource requests first. + +--- + +## Robustness — Hardening Checklist + +These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26): + +### 1. Add swap (not yet done — highest priority) + +```bash +fallocate -l 4G /swapfile +chmod 600 /swapfile +mkswap /swapfile +swapon /swapfile +echo '/swapfile none swap sw 0 0' >> /etc/fstab +``` + +Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time. + +### 2. Cap tegwick user nproc (not yet done) + +```bash +# /etc/security/limits.conf +tegwick hard nproc 512 +tegwick soft nproc 256 +``` + +Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine +within 256 soft / 512 hard. + +### 3. Cap tegwick systemd user session memory (not yet done) + +```bash +# Create override for the tegwick user slice +mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/ +cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf </dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST -d "k3s pod unhealthy on COULOMBCORE" || true +``` + +Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod +with restart count > 3 and status not Running/Completed warrants a notification. + +This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days +undetected) without adding tunnel infrastructure that can't help under node overload. diff --git a/state-hub/dashboard/observablehq.config.js b/state-hub/dashboard/observablehq.config.js index d7fc3b2..9ffe7dd 100644 --- a/state-hub/dashboard/observablehq.config.js +++ b/state-hub/dashboard/observablehq.config.js @@ -72,6 +72,7 @@ export default { pages: [ { name: "Capabilities", path: "/docs/capabilities" }, { name: "Connecting to the Hub", path: "/docs/connecting" }, + { name: "Dashboard", path: "/docs/dashboard" }, { name: "Contributions", path: "/docs/contributions" }, { name: "Decision Health", path: "/docs/decisions-kpi" }, { name: "Decisions", path: "/docs/decisions" }, diff --git a/state-hub/dashboard/src/docs/dashboard.md b/state-hub/dashboard/src/docs/dashboard.md new file mode 100644 index 0000000..7b07b08 --- /dev/null +++ b/state-hub/dashboard/src/docs/dashboard.md @@ -0,0 +1,338 @@ +--- +title: Dashboard — Technical Reference +--- + +# State Hub Dashboard — Technical Reference + +The State Hub dashboard is the primary visual interface for the Custodian +ecosystem. It provides live, reactive views of all tracked domains, +workstreams, tasks, decisions, contributions, SBOM data, and agent activity — +all sourced from the local FastAPI state service. + +--- + +## Framework: Observable Framework + +The dashboard is built on **[Observable Framework](https://observablehq.com/framework/)**, +an open-source static-site framework from Observable, Inc. designed specifically +for data-driven pages. + +### Why Observable Framework? + +| Requirement | How Observable Framework satisfies it | +|---|---| +| **Local-first, no build-time cloud dependency** | Compiles to a static site (`npm run build`); the preview server and data loaders run entirely on localhost. | +| **Live data without a separate frontend service** | Pages poll the FastAPI backend directly from the browser via `fetch`. No BFF, no GraphQL, no WebSockets required. | +| **Reactive updates without React complexity** | Observable's cell-based execution model re-runs any code block whose inputs change. Async generators produce new values every poll cycle and trigger re-renders automatically. | +| **No JS bundler configuration** | `.md` files containing fenced JS code blocks are the entire source. No webpack, no Vite config, no `tsconfig.json`. | +| **Native data visualisation** | First-class integration with `@observablehq/plot` — a concise, grammar-of-graphics library — for all charts. | +| **Sovereignty-compatible** | The built output is a folder of static HTML/JS/CSS. It can be served by any web server, archived, or opened directly from disk. | +| **Offline-graceful** | Data loaders (Python scripts that run at build time) produce JSON snapshots. If the API is unreachable at build time, the loader emits an empty-structure JSON so the page still renders with a clear error state instead of crashing. | + +Observable Framework was chosen over alternatives (Grafana, Metabase, Streamlit, +Next.js) because its design principles are uniquely aligned with the Custodian +philosophy: **local-first**, **no vendor lock-in**, **sovereignty-preserving**, +and **auditable** — the full data pipeline is visible in plain Markdown files. + +--- + +## Architecture + +``` +src/ + observablehq.config.js — site metadata, page registry, theme, global head + components/ — shared JS modules + data/ — Python data loaders (run at build time) + docs/ — reference pages (this file lives here) + *.md — one page per feature area +``` + +### Data flow + +There are two complementary data-fetching strategies: + +**1. Static data loaders** (`src/data/*.json.py`) + +Python scripts executed by the Observable build toolchain at `npm run build` +or `npm run dev`. Each script calls the FastAPI backend via `urllib`, serialises +the response to JSON on stdout, and Observable Framework captures that output +as a static snapshot file that the page imports with `FileAttachment(...)`. + +Current loaders: + +| File | API endpoint | +|---|---| +| `summary.json.py` | `/state/summary` | +| `workstreams.json.py` | `/workstreams/` | +| `contributions.json.py` | `/contributions/` | +| `decisions.json.py` | `/decisions/` | +| `domains.json.py` | `/domains/` | +| `messages.json.py` | `/messages/` | +| `progress.json.py` | `/progress/` | +| `repos.json.py` | `/repos/` | +| `sbom.json.py` | `/sbom/aggregated` | +| `gitea-inventory.json.py` | Gitea instance inventory | + +**2. Live browser polling** (async generators in page `.md` files) + +All interactive pages bypass the static snapshots for live data by using +Observable's async generator pattern directly in the browser: + +```js +const summaryState = (async function*() { + while (true) { + const r = await fetch(`${API}/state/summary`); + yield { data: r.ok ? await r.json() : {error: `HTTP ${r.status}`}, ok: r.ok }; + await new Promise(res => setTimeout(res, POLL)); + } +})(); +``` + +`POLL` is set to **15 000 ms** (15 seconds) in `src/components/config.js`. +Observable's reactivity engine detects each new yield value and re-runs all +dependent code blocks, updating charts, tables, and KPI cards automatically. +A `●` live indicator in the top-left corner of each page shows the connection +status and the last-updated time. + +### Global configuration — `observablehq.config.js` + +| Setting | Value | +|---|---| +| Root directory | `src/` | +| Site title | "Custodian State Hub" | +| Theme | `["air", "near-midnight"]` — light body with dark sidebar | +| Favicon | Inline SVG data URI (🗄️ emoji) | +| Global head | KPI infobox styles, filter-bar styles, improvement-modal script | + +The `improvement-modal.js` component is injected at the config level rather +than imported per-page because Observable proxies `src/*.js` through its own +bundler, which prevents them from being loaded as raw `