docs(dashboard): add technical reference page for Observable Framework dashboard

Documents the dashboard's architecture, framework choice rationale, data-fetching
strategies (static loaders + live polling), component library, page inventory,
and key features including the Workstream Health Index and entity modals.
Also registers the new page in the Reference nav and adds runbook section for
node overload / runaway agent process (INC-002) with hardening checklist.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
2026-03-27 00:09:18 +01:00
parent 6018df03cf
commit b19896a9a9
3 changed files with 448 additions and 1 deletions

View File

@@ -2,7 +2,7 @@
title: Runbook — Gitea on COULOMBCORE
tags: [gitea, coulombcore, k3s, postgresql-ha]
created: 2026-03-25
updated: 2026-03-25
updated: 2026-03-26
---
# Runbook: Gitea on COULOMBCORE
@@ -143,6 +143,46 @@ When Gitea is down, work through this in order:
---
### 3. Node overload — runaway agent process (SSH dies, k3s unresponsive)
**Symptom:** SSH connections time out during banner exchange. k3s API returns TLS handshake
timeout. `top` (via console) shows load average >100, 99.8% `sy` CPU, many running tasks,
kswapd0 at high CPU. State-hub reverse tunnel may still be alive (it was established
before the overload and requires no new connections).
**Root cause:** A runaway process (typically a Claude Code agent spawning subprocesses)
exhausts the process/memory budget. With no swap, the kernel thrashes continuously.
**Triage (workstation):**
```bash
# Check if node is alive despite SSH being down
curl -s --max-time 5 http://127.0.0.1:8000/state/health # via reverse tunnel
# k3s API — will timeout if node is thrashing
kubectl get nodes # expect TLS timeout
```
**Fix (requires console/VNC access):**
```bash
# 1. Identify runaway: look for high VIRT, many children, 99.8% sy in top
# Runaway claude agents: massive VIRT (>50GB), user tegwick
# 2. Kill the offenders
kill -9 <runaway-pid>
kill -9 <apport-pid-if-in-D-state> # apport in D-state amplifies load
# 3. Wait ~60s for load to drop; SSH will start accepting connections
# 4. Check PostgreSQL HA pods — may need 2-3 min to resync after OOM restarts
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
```
**Gitea does NOT need to be restarted** — it survives node overload. Once load drops
and PostgreSQL HA resyncs, Gitea serves requests again.
**Prevention:** See "Robustness" section below.
---
## Node Resource Budget (approximate)
| Component | CPU Request |
@@ -157,3 +197,71 @@ When Gitea is down, work through this in order:
Node capacity: ~2000m. Headroom is tight (~325m). Avoid adding workloads without
reviewing resource requests first.
---
## Robustness — Hardening Checklist
These changes reduce blast radius from process/memory overload (INC-002, 2026-03-26):
### 1. Add swap (not yet done — highest priority)
```bash
fallocate -l 4G /swapfile
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
echo '/swapfile none swap sw 0 0' >> /etc/fstab
```
Without swap, any memory spike causes immediate kernel thrash. 4GB swapfile = buffer time.
### 2. Cap tegwick user nproc (not yet done)
```bash
# /etc/security/limits.conf
tegwick hard nproc 512
tegwick soft nproc 256
```
Prevents a single agent from spawning 500+ processes. Claude Code agents survive fine
within 256 soft / 512 hard.
### 3. Cap tegwick systemd user session memory (not yet done)
```bash
# Create override for the tegwick user slice
mkdir -p /etc/systemd/system/user-$(id -u tegwick).slice.d/
cat > /etc/systemd/system/user-$(id -u tegwick).slice.d/limits.conf <<EOF
[Slice]
MemoryMax=1500M
MemorySwapMax=512M
EOF
systemctl daemon-reload
```
Prevents a rogue user process from consuming all 3.9GB.
### 4. Always-on agent guardrails (process hygiene)
- **Never run `/ralph-loop` directly on COULOMBCORE** — use `/ralph-workplan` which
self-terminates when the workplan is complete (HEUREKA stop condition).
- Set `--max-iterations` explicitly on any Ralph invocation.
- Avoid large parallel agent fans (e.g., spawning 20 sub-agents simultaneously) on
this resource-constrained node.
### 5. Add cluster health alerting (not yet done)
A per-service tunnel adds passive visibility but no alerting. A single cron covering the
whole cluster is more useful — it catches Gitea, PGPool, and any other crashlooping pod.
```bash
# /etc/cron.d/k3s-pod-health (on CoulombCore, run as tegwick)
*/5 * * * * tegwick kubectl get pods -A 2>/dev/null | awk '$4 ~ /CrashLoop|OOMKill|Error/ && $5+0 > 3 {print}' | grep . && curl -s -X POST <notify-webhook> -d "k3s pod unhealthy on COULOMBCORE" || true
```
Or via a state-hub progress event so it surfaces in the dashboard. Threshold: any pod
with restart count > 3 and status not Running/Completed warrants a notification.
This single check covers the failure mode from INC-001 (PGPool crashlooping 13 days
undetected) without adding tunnel infrastructure that can't help under node overload.