5.3 KiB
title, date, severity, status, affected, environment, duration, resolved_by
| title | date | severity | status | affected | environment | duration | resolved_by |
|---|---|---|---|---|---|---|---|
| INC-002: COULOMBCORE node overload — runaway Claude Code agent | 2026-03-26 | high | resolved | gitea (http://92.205.130.254:32166), k3s API, SSH access | COULOMBCORE k3s cluster | ~15 minutes (detected ~21:00, SSH restored ~21:08 UTC) | Bernd Worsch / Claude |
INC-002: COULOMBCORE node overload — runaway Claude Code agent
Summary
The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load generated by a runaway Claude Code agent process. Load average peaked at 417.43 (1m). 99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was technically still running as a process but unable to serve requests. The node had no swap, so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to reclaim pages with nowhere to put them.
Timeline
| Time (approx UTC) | Event |
|---|---|
| ~20:45 | Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses |
| ~21:00 | Load average passes 300; SSH banner exchange starts timing out |
| ~21:00 | User attempts git operations; git repo service unreachable |
| ~21:00 | Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive) |
| ~21:00 | k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443) |
| ~21:05 | User obtains console/VNC access via hosting provider |
| ~21:05 | top output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy |
| ~21:06 | kill -9 2457456 + kill -9 2579133 (stuck apport) executed via console |
| ~21:08 | SSH accepting connections again (load 85, still declining) |
| ~21:09 | kubectl connectivity restored; PostgreSQL HA nodes resyncing |
| ~21:10 | Gitea accessible; incident resolved |
Root Cause
A Claude Code agent running on COULOMBCORE under the tegwick user (PID 2457456,
VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded
Ralph loop or parallel agent task expansion without a completion condition or iteration cap.
With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0 ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no memory headroom caused the runaway context switching that buried the node.
The ops-bridge reverse tunnels survived because they were established before the overload began and require no new SSH connections to stay alive. This was the only out-of-band visibility channel available once SSH stopped accepting new connections.
Impact
- Gitea and git operations unavailable for ~15 minutes
- SSH access to COULOMBCORE unavailable (required console)
- k3s API unresponsive (no pod management possible)
- PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own)
- No data loss
Resolution Steps
# Via console/VNC — SSH was not available
# 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU)
# Indicator: massive VIRT, hundreds of children, 99.8% sy in top
# 2. Kill the runaway agent and stuck crash reporter
kill -9 2457456 # runaway claude process
kill -9 2579133 # apport in D-state, consuming CPU
# 3. Wait ~60s — load drops, SSH accepts connections
# 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync)
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'
# 5. Verify Gitea
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/
Follow-up Actions
- Add swap to COULOMBCORE:
fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab - Set nproc ulimit for tegwick user:
/etc/security/limits.conf→tegwick hard nproc 512 - Set memory limit on tegwick systemd user session:
systemctl --user set-property "" MemoryMax=1Gor a dedicated slice - Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5"
- Ensure all Ralph loops on remote agents use
/ralph-workplan(bounded, HEUREKA stop) never raw/ralph-loop - Consider adding a
bridge checkcron on workstation that alerts when node load > threshold via state-hub API
Lessons Learned
-
No swap = amplified blast radius. A machine with no swap has zero buffer between "memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys significant time for intervention.
-
Reverse tunnels are the last line of visibility. SSH and the k3s API both died. The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and confirmed the node was alive. This was critical for triage.
-
Remote agents need hard resource ceilings. A Claude Code agent that spawns subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session are the right controls for this environment.
-
Console access is non-negotiable. Once SSH dies the only recovery path is OOB console. Ensure hosting provider console credentials are always accessible.