- ADR-003: protocols artifact convention (location, structure, lifecycle) - agents/protocols/README.md: directory-level index and usage guide - agents/protocols/sys-medic/k3s-node-health-assessment.md: full structured k3s node health assessment protocol (8 steps: OS baseline, process hygiene, memory, CPU, disk, network, k3s node state, runtime services) - agent-sys-medic.md: add memory: enabled frontmatter, session-start/close protocols, node-profile memory template extensions, protocol reference in Default Task - cli.py: add protocols command group (list, show); extend memory init to hint protocol commands for agents that have protocols Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8.6 KiB
8.6 KiB
agent, slug, title, version, last_updated
| agent | slug | title | version | last_updated |
|---|---|---|---|---|
| sys-medic | k3s-node-health-assessment | k3s Node Health Assessment | 1.0.0 | 2026-03-18 |
k3s Node Health Assessment
Purpose
Structured health assessment for a Linux host running k3s (lightweight Kubernetes). Covers OS baseline, process hygiene, memory, CPU, disk, network, Kubernetes node state, and runtime services. Produces a prioritized findings report with safe next actions.
Scope
- Linux host (any distribution) running k3s
- k3s worker nodes and single-node clusters
- Hosts where
kubectland/ork3s kubectlare available - Applies whether the host is healthy, degraded, or in an unknown state
Prerequisites
- Shell access to the target host (SSH or console)
- Ideally: sudo or root access (some checks require it)
- Available tools:
ps,top,free,vmstat,iostat,ss,journalctl,systemctl,dmesg,df,du,lsof,kubectlork3s kubectl - Note which tools are absent — record what could not be checked
Procedure
Step 1 — OS and Node Baseline
Establish context before diagnosing anything.
hostname
uptime
uname -r
nproc
free -h
swapon --show
df -h
date
Record:
- Hostname and uptime
- Kernel version
- CPU core count
- Total/used/free memory and swap
- Overall disk usage per mount
- Current time (for correlating log timestamps)
Step 2 — Process Hygiene
# Zombie and D-state processes
ps aux | awk '$8 ~ /^[ZD]/ {print}'
# Top memory consumers
ps aux --sort=-%mem | head -20
# Top CPU consumers
ps aux --sort=-%cpu | head -20
# Processes with high FD counts (requires lsof)
sudo lsof 2>/dev/null | awk '{print $2}' | sort | uniq -c | sort -rn | head -20
# Long-running suspicious processes (> 7 days)
ps -eo pid,user,etime,comm --sort=-etime | head -30
Look for:
- Zombie count > 0
- D-state (uninterruptible sleep) tasks
- Unexpected high-memory or high-CPU processes
- Stale maintenance scripts, port-forwards, debug sessions, rsync, or backup jobs
- Orphaned shells or user sessions
Step 3 — Memory Health
# Overall memory picture
free -h
cat /proc/meminfo | grep -E 'MemAvailable|SwapFree|Dirty|Slab|KReclaimable'
# OOM kill history
sudo dmesg | grep -i 'oom\|killed process' | tail -20
sudo journalctl -k --since "24 hours ago" | grep -i 'oom\|out of memory' | tail -20
# Slab usage
sudo slabtop -o | head -30
# cgroup memory pressure (if cgroups v2)
find /sys/fs/cgroup -name "memory.pressure" 2>/dev/null | xargs grep -l "some" 2>/dev/null | head -10
Look for:
- Available memory < 10% of total
- Swap being actively used (churn is worse than swap in use)
- Recent OOM kills
- High slab growth
- cgroup memory pressure events
Step 4 — CPU and Scheduler Health
# Load average vs core count
uptime
nproc
# CPU idle and steal
top -bn1 | grep '%Cpu'
vmstat 1 5
# Run queue pressure
vmstat 1 5 | awk '{print $1, $2}' # r=running, b=blocked
Look for:
- Load average persistently > core count
- CPU idle < 10%
- High CPU steal (virtualised hosts)
- Run queue (r) > core count sustained
- Blocked processes (b) > 0 sustained
Step 5 — Disk and Filesystem Health
# Disk usage
df -h
df -i # inode usage
# Large log files
sudo du -sh /var/log/* 2>/dev/null | sort -rh | head -20
sudo journalctl --disk-usage
# k3s data directory
sudo du -sh /var/lib/rancher/k3s/ 2>/dev/null
sudo du -sh /var/lib/rancher/k3s/agent/containerd/ 2>/dev/null
# Rapidly growing dirs (compare two snapshots 60s apart)
sudo du -sh /var/lib/rancher /var/log /tmp 2>/dev/null
Look for:
- Any mount > 85% full (warning) or > 95% (critical)
- Any mount with inode usage > 85%
- Container image accumulation in containerd storage
- Large or rapidly growing log files
- Abandoned temp files
Step 6 — Network and Connection State
# Connection state summary
ss -s
ss -tnp | awk '{print $1}' | sort | uniq -c | sort -rn
# Unusual listeners
ss -tlnp
# CLOSE_WAIT accumulation (application socket leak)
ss -tnp | grep CLOSE_WAIT | wc -l
# TIME_WAIT count (normal but high counts may indicate connection thrash)
ss -tnp | grep TIME_WAIT | wc -l
Look for:
- CLOSE_WAIT count > 50 (application not closing sockets)
- SYN_RECV accumulation (connection flood or backlog issue)
- Unexpected listeners on unusual ports
- Long-lived unexpected tunnels or port-forwards
Step 7 — Kubernetes Node Health
# Node status and conditions
kubectl get node $(hostname) -o wide 2>/dev/null || k3s kubectl get node $(hostname) -o wide
# Node conditions in detail
kubectl describe node $(hostname) 2>/dev/null | grep -A 10 'Conditions:'
# Resource pressure
kubectl top node $(hostname) 2>/dev/null
# Recent node events
kubectl get events --field-selector involvedObject.name=$(hostname) --sort-by='.lastTimestamp' 2>/dev/null | tail -20
# Top pods by resource use
kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -20
# Restarting pods on this node
kubectl get pods --all-namespaces --field-selector spec.nodeName=$(hostname) 2>/dev/null | awk '$5 > 5 {print}'
Look for:
- Node Ready=False or Unknown
- MemoryPressure, DiskPressure, PIDPressure, or NetworkUnavailable = True
- Pods with high restart counts (> 5)
- CrashLoopBackOff workloads
- Evicted pods (indicates past resource pressure)
Step 8 — k3s Runtime and Control Services
# k3s service status
sudo systemctl status k3s 2>/dev/null || sudo systemctl status k3s-agent
# k3s recent logs (last 100 lines)
sudo journalctl -u k3s --since "1 hour ago" -n 100 2>/dev/null || \
sudo journalctl -u k3s-agent --since "1 hour ago" -n 100
# containerd status (k3s embedded)
sudo systemctl status containerd 2>/dev/null
# CNI / flannel if applicable
sudo systemctl status flanneld 2>/dev/null
sudo ip addr show flannel.1 2>/dev/null
Look for:
- k3s service not running or in failed state
- Repeated restart entries in k3s logs
- PLEG errors, image GC failures, sandbox creation failures
- cgroup-related errors
- API server timeout messages (on worker nodes: etcd or API server unreachable)
Interpretation
| Signal | Normal | Warning | Critical |
|---|---|---|---|
| Load average | ≤ core count | 1–2× core count | > 2× sustained |
| Memory available | > 20% | 10–20% | < 10% |
| Disk usage | < 75% | 75–90% | > 90% |
| Inode usage | < 75% | 75–90% | > 90% |
| Zombie count | 0 | 1–5 | > 5 or climbing |
| OOM kills (24h) | 0 | 1–2 | > 2 or recent |
| Pod restarts | < 3 | 3–10 | > 10 or CrashLoop |
| CLOSE_WAIT | < 10 | 10–50 | > 50 |
| Node Ready | True | — | False / Unknown |
Confidence in findings:
- High — direct evidence (OOM kill log, node condition set, error in service log)
- Medium — indirect evidence (high memory use without OOM, rising load with no clear cause)
- Low — circumstantial (aging process without other indicators)
Remediation
High memory pressure
- Identify top consumers:
ps aux --sort=-%mem | head -20 - Check for OOM history:
dmesg | grep -i oom - If a workload is leaking: restart the specific pod (not the node)
- If slab is high: check for inode-heavy workloads or NFS mounts
- Do not drop caches unless explicitly justified — Linux reclaims page cache automatically
Disk pressure
- Find largest directories:
du -sh /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head -20 - Prune unused container images:
k3s crictl rmi --prune(safe — only removes unused images) - Clear old journal logs:
sudo journalctl --vacuum-size=500M - Identify log-bloating pods and fix their logging config
k3s service failing
- Check service status:
sudo systemctl status k3s - Check logs:
sudo journalctl -u k3s -n 200 - Common causes: etcd data corruption (single-node), API server unreachable (worker), disk full, cert expiry
- Do not restart k3s without understanding the cause — a restart may mask the issue
High pod restart count
- Check logs:
kubectl logs <pod> --previous - Check events:
kubectl describe pod <pod> - Distinguish OOMKilled (memory limit) from CrashLoop (application error) from Liveness probe failure
Notes
- This protocol was adapted from the sys-medic agent's structured assessment areas and the sys-medic repo's companion protocol document.
- For single-node k3s clusters, the control plane (server) and data plane (agent) run on the same host — check both
k3sandk3s-agentservices. - On hosts without
kubectlin PATH, usek3s kubectlas a drop-in replacement. - Protocol version history is tracked via the
versionfrontmatter field. Update on significant structural changes.