Files

tegwick 29b84de13c ADR and Runbook artefacts

2026-03-27 00:16:09 +01:00

5.3 KiB

Raw Permalink Blame History

title, date, severity, status, affected, environment, duration, resolved_by

title	date	severity	status	affected	environment	duration	resolved_by
INC-002: COULOMBCORE node overload — runaway Claude Code agent	2026-03-26	high	resolved	gitea (http://92.205.130.254:32166), k3s API, SSH access	COULOMBCORE k3s cluster	~15 minutes (detected ~21:00, SSH restored ~21:08 UTC)	Bernd Worsch / Claude

INC-002: COULOMBCORE node overload — runaway Claude Code agent

Summary

The COULOMBCORE node (92.205.130.254) became completely unresponsive under extreme load generated by a runaway Claude Code agent process. Load average peaked at 417.43 (1m). 99.8% of CPU time was spent in kernel mode (context switching). SSH connections timed out during banner exchange. k3s API was unreachable (TLS handshake timeout). Gitea was technically still running as a process but unable to serve requests. The node had no swap, so memory exhaustion amplified the impact — kswapd0 was consuming ~22% CPU trying to reclaim pages with nowhere to put them.

Timeline

Time (approx UTC)	Event
~20:45	Runaway claude agent (PID 2457456, user tegwick) spawning hundreds of subprocesses
~21:00	Load average passes 300; SSH banner exchange starts timing out
~21:00	User attempts git operations; git repo service unreachable
~21:00	Remote diagnosis begins via ops-bridge (state-hub reverse tunnel still alive)
~21:00	k3s API confirmed unresponsive (TLS handshake timeout via local tunnel :16443)
~21:05	User obtains console/VNC access via hosting provider
~21:05	`top` output shared: load 417, 530 tasks (104 running), 34 zombies, 99.8% sy
~21:06	`kill -9 2457456` + `kill -9 2579133` (stuck apport) executed via console
~21:08	SSH accepting connections again (load 85, still declining)
~21:09	kubectl connectivity restored; PostgreSQL HA nodes resyncing
~21:10	Gitea accessible; incident resolved

Root Cause

A Claude Code agent running on COULOMBCORE under the tegwick user (PID 2457456, VIRT 71.1GB) spawned approximately 500 child processes. The likely cause is an unbounded Ralph loop or parallel agent task expansion without a completion condition or iteration cap.

With no swap configured on a 3.9GB machine, the kernel had no reclaim target. kswapd0 ran at ~22% CPU continuously. systemd was at ~17% CPU processing unit state changes for the constant process churn. The combination of ~500 tasks competing for 2 vCPUs with no memory headroom caused the runaway context switching that buried the node.

The ops-bridge reverse tunnels survived because they were established before the overload began and require no new SSH connections to stay alive. This was the only out-of-band visibility channel available once SSH stopped accepting new connections.

Impact

Gitea and git operations unavailable for ~15 minutes
SSH access to COULOMBCORE unavailable (required console)
k3s API unresponsive (no pod management possible)
PostgreSQL HA nodes 0 and 2 restarted under load (recovered on their own)
No data loss

Resolution Steps

# Via console/VNC — SSH was not available

# 1. Identify the runaway process (top showed PID 2457456 at 71GB VIRT, 6.8% CPU)
#    Indicator: massive VIRT, hundreds of children, 99.8% sy in top

# 2. Kill the runaway agent and stuck crash reporter
kill -9 2457456     # runaway claude process
kill -9 2579133     # apport in D-state, consuming CPU

# 3. Wait ~60s — load drops, SSH accepts connections
# 4. Verify PostgreSQL HA recovery (may take 2-3 min to resync)
kubectl get pods -l 'app.kubernetes.io/name=postgresql-ha'

# 5. Verify Gitea
curl -s -o /dev/null -w "%{http_code}" http://localhost:3000/

Follow-up Actions

Add swap to COULOMBCORE: fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo '/swapfile none swap sw 0 0' >> /etc/fstab
Set nproc ulimit for tegwick user: /etc/security/limits.conf → tegwick hard nproc 512
Set memory limit on tegwick systemd user session: systemctl --user set-property "" MemoryMax=1G or a dedicated slice
Add cluster-wide pod health alerting (cron on CoulombCore) — catches any crashlooping pod, not just Gitea; see runbook "Robustness §5"
Ensure all Ralph loops on remote agents use /ralph-workplan (bounded, HEUREKA stop) never raw /ralph-loop
Consider adding a bridge check cron on workstation that alerts when node load > threshold via state-hub API

Lessons Learned

No swap = amplified blast radius. A machine with no swap has zero buffer between "memory pressure" and "complete kernel thrash". A 4GB swapfile costs nothing and buys significant time for intervention.
Reverse tunnels are the last line of visibility. SSH and the k3s API both died. The state-hub reverse tunnel (established from COULOMBCORE outbound) survived and confirmed the node was alive. This was critical for triage.
Remote agents need hard resource ceilings. A Claude Code agent that spawns subprocesses has no built-in rate limit. nproc + systemd MemoryMax on the user session are the right controls for this environment.
Console access is non-negotiable. Once SSH dies the only recovery path is OOB console. Ensure hosting provider console credentials are always accessible.

5.3 KiB Raw Permalink Blame History