diff --git a/ops/runbooks/gitea-coulombcore.md b/ops/runbooks/gitea-coulombcore.md index 272dd04..f226d00 100644 --- a/ops/runbooks/gitea-coulombcore.md +++ b/ops/runbooks/gitea-coulombcore.md @@ -2,7 +2,7 @@ title: Runbook — Gitea on COULOMBCORE tags: [gitea, coulombcore, k3s, postgresql-ha] created: 2026-03-25 -updated: 2026-03-25 +updated: 2026-03-26 --- # Runbook: Gitea on COULOMBCORE @@ -143,6 +143,46 @@ When Gitea is down, work through this in order: --- +### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive) + +**Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake +timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks, +kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established +before the overload and requires no new connections). + +**Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses) +exhausts the process/memory budget. With no swap, the kernel thrashes continuously. + +**Triage (workstation):** +```bash +# Check if node is alive despite SSH being down +curl -s --max-time 5 http://127.0.0.1:8000/state/health # via reverse tunnel + +# k3s API — will timeout if node is thrashing +kubectl get nodes # expect TLS timeout +``` + +**Fix (requires console/VNC access):** +```bash +# 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top +# Runaway claude agents: massive VIRT (>50GB), user tegwick + +# 2. Kill the offenders +kill -9 +kill -9 # apport in D-state amplifies load + +# 3. Wait ~60s for load to drop; SSH will start accepting connections +# 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts +kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha' +``` + +**Gitea does NOT need to be restarted** — it survives node overload. Once load drops +and PostgreSQL HA resyncs, Gitea serves requests again. + +**Prevention:** See "Robustness" section below. + +--- + ## Node Resource Budget (approximate) | Component | CPU Request | @@ -157,3 +197,71 @@ When Gitea is down, work through this in order: Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without reviewing resource requests first. + +--- + +## Robustness — Hardening Checklist + +These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26): + +### 1. Add swap (not yet done — highest priority) + +```bash +fallocate -l 4G /swapfile +chmod 600 /swapfile +mkswap /swapfile +swapon /swapfile +echo '/swapfile none swap sw 0 0' >> /etc/fstab +``` + +Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time. + +### 2. Cap tegwick user nproc (not yet done) + +```bash +# /etc/security/limits.conf +tegwick hard nproc 512 +tegwick soft nproc 256 +``` + +Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine +within 256 soft / 512 hard. + +### 3. Cap tegwick systemd user session memory (not yet done) + +```bash +# Create override for the tegwick user slice +mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/ +cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf </dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST -d "k3s pod unhealthy on COULOMBCORE" || true +``` + +Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod +with restart count > 3 and status not Running/Completed warrants a notification. + +This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days +undetected) without adding tunnel infrastructure that can't help under node overload. diff --git a/state-hub/dashboard/observablehq.config.js b/state-hub/dashboard/observablehq.config.js index d7fc3b2..9ffe7dd 100644 --- a/state-hub/dashboard/observablehq.config.js +++ b/state-hub/dashboard/observablehq.config.js @@ -72,6 +72,7 @@ export default { pages: [ { name: "Capabilities", path: "/docs/capabilities" }, { name: "Connecting to the Hub", path: "/docs/connecting" }, + { name: "Dashboard", path: "/docs/dashboard" }, { name: "Contributions", path: "/docs/contributions" }, { name: "Decision Health", path: "/docs/decisions-kpi" }, { name: "Decisions", path: "/docs/decisions" }, diff --git a/state-hub/dashboard/src/docs/dashboard.md b/state-hub/dashboard/src/docs/dashboard.md new file mode 100644 index 0000000..7b07b08 --- /dev/null +++ b/state-hub/dashboard/src/docs/dashboard.md @@ -0,0 +1,338 @@ +--- +title: Dashboard — Technical Reference +--- + +# State Hub Dashboard — Technical Reference + +The State Hub dashboard is the primary visual interface for the Custodian +ecosystem. It provides live, reactive views of all tracked domains, +workstreams, tasks, decisions, contributions, SBOM data, and agent activity — +all sourced from the local FastAPI state service. + +--- + +## Framework: Observable Framework + +The dashboard is built on **[Observable Framework](https://observablehq.com/framework/)**, +an open-source static-site framework from Observable, Inc. designed specifically +for data-driven pages. + +### Why Observable Framework? + +| Requirement | How Observable Framework satisfies it | +|---|---| +| **Local-first, no build-time cloud dependency** | Compiles to a static site (`npm run build`); the preview server and data loaders run entirely on localhost. | +| **Live data without a separate frontend service** | Pages poll the FastAPI backend directly from the browser via `fetch`. No BFF, no GraphQL, no WebSockets required. | +| **Reactive updates without React complexity** | Observable's cell-based execution model re-runs any code block whose inputs change. Async generators produce new values every poll cycle and trigger re-renders automatically. | +| **No JS bundler configuration** | `.md` files containing fenced JS code blocks are the entire source. No webpack, no Vite config, no `tsconfig.json`. | +| **Native data visualisation** | First-class integration with `@observablehq/plot` — a concise, grammar-of-graphics library — for all charts. | +| **Sovereignty-compatible** | The built output is a folder of static HTML/JS/CSS. It can be served by any web server, archived, or opened directly from disk. | +| **Offline-graceful** | Data loaders (Python scripts that run at build time) produce JSON snapshots. If the API is unreachable at build time, the loader emits an empty-structure JSON so the page still renders with a clear error state instead of crashing. | + +Observable Framework was chosen over alternatives (Grafana, Metabase, Streamlit, +Next.js) because its design principles are uniquely aligned with the Custodian +philosophy: **local-first**, **no vendor lock-in**, **sovereignty-preserving**, +and **auditable** — the full data pipeline is visible in plain Markdown files. + +--- + +## Architecture + +``` +src/ + observablehq.config.js — site metadata, page registry, theme, global head + components/ — shared JS modules + data/ — Python data loaders (run at build time) + docs/ — reference pages (this file lives here) + *.md — one page per feature area +``` + +### Data flow + +There are two complementary data-fetching strategies: + +**1. Static data loaders** (`src/data/*.json.py`) + +Python scripts executed by the Observable build toolchain at `npm run build` +or `npm run dev`. Each script calls the FastAPI backend via `urllib`, serialises +the response to JSON on stdout, and Observable Framework captures that output +as a static snapshot file that the page imports with `FileAttachment(...)`. + +Current loaders: + +| File | API endpoint | +|---|---| +| `summary.json.py` | `/state/summary` | +| `workstreams.json.py` | `/workstreams/` | +| `contributions.json.py` | `/contributions/` | +| `decisions.json.py` | `/decisions/` | +| `domains.json.py` | `/domains/` | +| `messages.json.py` | `/messages/` | +| `progress.json.py` | `/progress/` | +| `repos.json.py` | `/repos/` | +| `sbom.json.py` | `/sbom/aggregated` | +| `gitea-inventory.json.py` | Gitea instance inventory | + +**2. Live browser polling** (async generators in page `.md` files) + +All interactive pages bypass the static snapshots for live data by using +Observable's async generator pattern directly in the browser: + +```js +const summaryState = (async function*() { + while (true) { + const r = await fetch(`${API}/state/summary`); + yield { data: r.ok ? await r.json() : {error: `HTTP ${r.status}`}, ok: r.ok }; + await new Promise(res => setTimeout(res, POLL)); + } +})(); +``` + +`POLL` is set to **15 000 ms** (15 seconds) in `src/components/config.js`. +Observable's reactivity engine detects each new yield value and re-runs all +dependent code blocks, updating charts, tables, and KPI cards automatically. +A `●` live indicator in the top-left corner of each page shows the connection +status and the last-updated time. + +### Global configuration — `observablehq.config.js` + +| Setting | Value | +|---|---| +| Root directory | `src/` | +| Site title | "Custodian State Hub" | +| Theme | `["air", "near-midnight"]` — light body with dark sidebar | +| Favicon | Inline SVG data URI (🗄️ emoji) | +| Global head | KPI infobox styles, filter-bar styles, improvement-modal script | + +The `improvement-modal.js` component is injected at the config level rather +than imported per-page because Observable proxies `src/*.js` through its own +bundler, which prevents them from being loaded as raw `