ADR and Runbook artefacts

This commit is contained in:
2026-03-27 00:16:09 +01:00
parent b19896a9a9
commit 29b84de13c
2 changed files with 262 additions and 0 deletions

View File

@@ -0,0 +1,119 @@
---
title: "INC-002: COULOMBCORE node overload — runaway Claude Code agent"
date: 2026-03-26
severity: high
status: resolved
affected: gitea (http://92.205.130.254:32166), k3s API, SSH access
environment: COULOMBCORE k3s cluster
duration: ~15 minutes (detected ~21:00, SSH restored ~21:08 UTC)
resolved_by: Bernd Worsch / Claude
---
# INC-002: COULOMBCORE node overload — runaway Claude Code agent
## Summary
The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load
generated by a runaway Claude Code agent process. Load average peaked at **417.43** (1m).
99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out
during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was
technically still running as a process but unable to serve requests. The node had no swap,
so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to
reclaim pages with nowhere to put them.
---
## Timeline
| Time (approx UTC) | Event |
|------|-------|
| ~20:45 | Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses |
| ~21:00 | Load average passes 300; SSH banner exchange starts timing out |
| ~21:00 | User attempts git operations; git repo service unreachable |
| ~21:00 | Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive) |
| ~21:00 | k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443) |
| ~21:05 | User obtains console/VNC access via hosting provider |
| ~21:05 | `top` output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy |
| ~21:06 | `kill -9 2457456` + `kill -9 2579133` (stuck apport) executed via console |
| ~21:08 | SSH accepting connections again (load 85, still declining) |
| ~21:09 | kubectl connectivity restored; PostgreSQL HA nodes resyncing |
| ~21:10 | Gitea accessible; incident resolved |
---
## Root Cause
A Claude Code agent running on COULOMBCORE under the `tegwick` user (PID 2457456,
VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded
Ralph loop or parallel agent task expansion without a completion condition or iteration cap.
With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0
ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for
the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no
memory headroom caused the runaway context switching that buried the node.
**The ops-bridge reverse tunnels survived** because they were established before the
overload began and require no new SSH connections to stay alive. This was the only
out-of-band visibility channel available once SSH stopped accepting new connections.
---
## Impact
- Gitea and git operations unavailable for ~15 minutes
- SSH access to COULOMBCORE unavailable (required console)
- k3s API unresponsive (no pod management possible)
- PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own)
- No data loss
---
## Resolution Steps
```bash
# Via console/VNC — SSH was not available
# 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU)
# Indicator: massive VIRT, hundreds of children, 99.8% sy in top
# 2. Kill the runaway agent and stuck crash reporter
kill -9 2457456 # runaway claude process
kill -9 2579133 # apport in D-state, consuming CPU
# 3. Wait ~60s — load drops, SSH accepts connections
# 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync)
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
# 5. Verify Gitea
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/
```
---
## Follow-up Actions
- [ ] Add swap to COULOMBCORE: `fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab`
- [ ] Set nproc ulimit for tegwick user: `/etc/security/limits.conf``tegwick hard nproc 512`
- [ ] Set memory limit on tegwick systemd user session: `systemctl --user set-property "" MemoryMax=1G` or a dedicated slice
- [ ] Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5"
- [ ] Ensure all Ralph loops on remote agents use `/ralph-workplan` (bounded, HEUREKA stop) never raw `/ralph-loop`
- [ ] Consider adding a `bridge check` cron on workstation that alerts when node load > threshold via state-hub API
---
## Lessons Learned
1. **No swap = amplified blast radius.** A machine with no swap has zero buffer between
"memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys
significant time for intervention.
2. **Reverse tunnels are the last line of visibility.** SSH and the k3s API both died.
The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and
confirmed the node was alive. This was critical for triage.
3. **Remote agents need hard resource ceilings.** A Claude Code agent that spawns
subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session
are the right controls for this environment.
4. **Console access is non-negotiable.** Once SSH dies the only recovery path is OOB
console. Ensure hosting provider console credentials are always accessible.