--- agent: sys-medic slug: k3s-node-health-assessment title: k3s Node Health Assessment version: 1.0.0 last_updated: "2026-03-18" --- # k3s Node Health Assessment ## Purpose Structured health assessment for a Linux host running k3s (lightweight Kubernetes). Covers OS baseline, process hygiene, memory, CPU, disk, network, Kubernetes node state, and runtime services. Produces a prioritized findings report with safe next actions. ## Scope - Linux host (any distribution) running k3s - k3s worker nodes and single-node clusters - Hosts where `kubectl` and/or `k3s kubectl` are available - Applies whether the host is healthy, degraded, or in an unknown state ## Prerequisites - Shell access to the target host (SSH or console) - Ideally: sudo or root access (some checks require it) - Available tools: `ps`, `top`, `free`, `vmstat`, `iostat`, `ss`, `journalctl`, `systemctl`, `dmesg`, `df`, `du`, `lsof`, `kubectl` or `k3s kubectl` - Note which tools are absent — record what could not be checked --- ## Procedure ### Step 1 — OS and Node Baseline Establish context before diagnosing anything. ```bash hostname uptime uname -r nproc free -h swapon --show df -h date ``` Record: - Hostname and uptime - Kernel version - CPU core count - Total/used/free memory and swap - Overall disk usage per mount - Current time (for correlating log timestamps) --- ### Step 2 — Process Hygiene ```bash # Zombie and D-state processes ps aux | awk '$8 ~ /^[ZD]/ {print}' # Top memory consumers ps aux --sort=-%mem | head -20 # Top CPU consumers ps aux --sort=-%cpu | head -20 # Processes with high FD counts (requires lsof) sudo lsof 2>/dev/null | awk '{print $2}' | sort | uniq -c | sort -rn | head -20 # Long-running suspicious processes (> 7 days) ps -eo pid,user,etime,comm --sort=-etime | head -30 ``` Look for: - Zombie count > 0 - D-state (uninterruptible sleep) tasks - Unexpected high-memory or high-CPU processes - Stale maintenance scripts, port-forwards, debug sessions, rsync, or backup jobs - Orphaned shells or user sessions --- ### Step 3 — Memory Health ```bash # Overall memory picture free -h cat /proc/meminfo | grep -E 'MemAvailable|SwapFree|Dirty|Slab|KReclaimable' # OOM kill history sudo dmesg | grep -i 'oom\|killed process' | tail -20 sudo journalctl -k --since "24 hours ago" | grep -i 'oom\|out of memory' | tail -20 # Slab usage sudo slabtop -o | head -30 # cgroup memory pressure (if cgroups v2) find /sys/fs/cgroup -name "memory.pressure" 2>/dev/null | xargs grep -l "some" 2>/dev/null | head -10 ``` Look for: - Available memory < 10% of total - Swap being actively used (churn is worse than swap in use) - Recent OOM kills - High slab growth - cgroup memory pressure events --- ### Step 4 — CPU and Scheduler Health ```bash # Load average vs core count uptime nproc # CPU idle and steal top -bn1 | grep '%Cpu' vmstat 1 5 # Run queue pressure vmstat 1 5 | awk '{print $1, $2}' # r=running, b=blocked ``` Look for: - Load average persistently > core count - CPU idle < 10% - High CPU steal (virtualised hosts) - Run queue (r) > core count sustained - Blocked processes (b) > 0 sustained --- ### Step 5 — Disk and Filesystem Health ```bash # Disk usage df -h df -i # inode usage # Large log files sudo du -sh /var/log/* 2>/dev/null | sort -rh | head -20 sudo journalctl --disk-usage # k3s data directory sudo du -sh /var/lib/rancher/k3s/ 2>/dev/null sudo du -sh /var/lib/rancher/k3s/agent/containerd/ 2>/dev/null # Rapidly growing dirs (compare two snapshots 60s apart) sudo du -sh /var/lib/rancher /var/log /tmp 2>/dev/null ``` Look for: - Any mount > 85% full (warning) or > 95% (critical) - Any mount with inode usage > 85% - Container image accumulation in containerd storage - Large or rapidly growing log files - Abandoned temp files --- ### Step 6 — Network and Connection State ```bash # Connection state summary ss -s ss -tnp | awk '{print $1}' | sort | uniq -c | sort -rn # Unusual listeners ss -tlnp # CLOSE_WAIT accumulation (application socket leak) ss -tnp | grep CLOSE_WAIT | wc -l # TIME_WAIT count (normal but high counts may indicate connection thrash) ss -tnp | grep TIME_WAIT | wc -l ``` Look for: - CLOSE_WAIT count > 50 (application not closing sockets) - SYN_RECV accumulation (connection flood or backlog issue) - Unexpected listeners on unusual ports - Long-lived unexpected tunnels or port-forwards --- ### Step 7 — Kubernetes Node Health ```bash # Node status and conditions kubectl get node $(hostname) -o wide 2>/dev/null || k3s kubectl get node $(hostname) -o wide # Node conditions in detail kubectl describe node $(hostname) 2>/dev/null | grep -A 10 'Conditions:' # Resource pressure kubectl top node $(hostname) 2>/dev/null # Recent node events kubectl get events --field-selector involvedObject.name=$(hostname) --sort-by='.lastTimestamp' 2>/dev/null | tail -20 # Top pods by resource use kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -20 # Restarting pods on this node kubectl get pods --all-namespaces --field-selector spec.nodeName=$(hostname) 2>/dev/null | awk '$5 > 5 {print}' ``` Look for: - Node Ready=False or Unknown - MemoryPressure, DiskPressure, PIDPressure, or NetworkUnavailable = True - Pods with high restart counts (> 5) - CrashLoopBackOff workloads - Evicted pods (indicates past resource pressure) --- ### Step 8 — k3s Runtime and Control Services ```bash # k3s service status sudo systemctl status k3s 2>/dev/null || sudo systemctl status k3s-agent # k3s recent logs (last 100 lines) sudo journalctl -u k3s --since "1 hour ago" -n 100 2>/dev/null || \ sudo journalctl -u k3s-agent --since "1 hour ago" -n 100 # containerd status (k3s embedded) sudo systemctl status containerd 2>/dev/null # CNI / flannel if applicable sudo systemctl status flanneld 2>/dev/null sudo ip addr show flannel.1 2>/dev/null ``` Look for: - k3s service not running or in failed state - Repeated restart entries in k3s logs - PLEG errors, image GC failures, sandbox creation failures - cgroup-related errors - API server timeout messages (on worker nodes: etcd or API server unreachable) --- ## Interpretation | Signal | Normal | Warning | Critical | |--------|--------|---------|----------| | Load average | ≤ core count | 1–2× core count | > 2× sustained | | Memory available | > 20% | 10–20% | < 10% | | Disk usage | < 75% | 75–90% | > 90% | | Inode usage | < 75% | 75–90% | > 90% | | Zombie count | 0 | 1–5 | > 5 or climbing | | OOM kills (24h) | 0 | 1–2 | > 2 or recent | | Pod restarts | < 3 | 3–10 | > 10 or CrashLoop | | CLOSE_WAIT | < 10 | 10–50 | > 50 | | Node Ready | True | — | False / Unknown | Confidence in findings: - **High** — direct evidence (OOM kill log, node condition set, error in service log) - **Medium** — indirect evidence (high memory use without OOM, rising load with no clear cause) - **Low** — circumstantial (aging process without other indicators) --- ## Remediation ### High memory pressure 1. Identify top consumers: `ps aux --sort=-%mem | head -20` 2. Check for OOM history: `dmesg | grep -i oom` 3. If a workload is leaking: restart the specific pod (not the node) 4. If slab is high: check for inode-heavy workloads or NFS mounts 5. Do not drop caches unless explicitly justified — Linux reclaims page cache automatically ### Disk pressure 1. Find largest directories: `du -sh /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head -20` 2. Prune unused container images: `k3s crictl rmi --prune` (safe — only removes unused images) 3. Clear old journal logs: `sudo journalctl --vacuum-size=500M` 4. Identify log-bloating pods and fix their logging config ### k3s service failing 1. Check service status: `sudo systemctl status k3s` 2. Check logs: `sudo journalctl -u k3s -n 200` 3. Common causes: etcd data corruption (single-node), API server unreachable (worker), disk full, cert expiry 4. Do not restart k3s without understanding the cause — a restart may mask the issue ### High pod restart count 1. Check logs: `kubectl logs --previous` 2. Check events: `kubectl describe pod ` 3. Distinguish OOMKilled (memory limit) from CrashLoop (application error) from Liveness probe failure --- ## Notes - This protocol was adapted from the sys-medic agent's structured assessment areas and the sys-medic repo's companion protocol document. - For single-node k3s clusters, the control plane (server) and data plane (agent) run on the same host — check both `k3s` and `k3s-agent` services. - On hosts without `kubectl` in PATH, use `k3s kubectl` as a drop-in replacement. - Protocol version history is tracked via the `version` frontmatter field. Update on significant structural changes.