Files
kaizen-agentic/agents/protocols/sys-medic/k3s-node-health-assessment.md
Bernd Worsch 53dfd55916 feat(protocols): add protocols artifact convention, sys-medic protocol + CLI (WP-0002 T17-T24)
- ADR-003: protocols artifact convention (location, structure, lifecycle)
- agents/protocols/README.md: directory-level index and usage guide
- agents/protocols/sys-medic/k3s-node-health-assessment.md: full structured
  k3s node health assessment protocol (8 steps: OS baseline, process hygiene,
  memory, CPU, disk, network, k3s node state, runtime services)
- agent-sys-medic.md: add memory: enabled frontmatter, session-start/close
  protocols, node-profile memory template extensions, protocol reference in
  Default Task
- cli.py: add protocols command group (list, show); extend memory init to hint
  protocol commands for agents that have protocols

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 23:48:09 +00:00

8.6 KiB
Raw Permalink Blame History

agent, slug, title, version, last_updated
agent slug title version last_updated
sys-medic k3s-node-health-assessment k3s Node Health Assessment 1.0.0 2026-03-18

k3s Node Health Assessment

Purpose

Structured health assessment for a Linux host running k3s (lightweight Kubernetes). Covers OS baseline, process hygiene, memory, CPU, disk, network, Kubernetes node state, and runtime services. Produces a prioritized findings report with safe next actions.

Scope

  • Linux host (any distribution) running k3s
  • k3s worker nodes and single-node clusters
  • Hosts where kubectl and/or k3s kubectl are available
  • Applies whether the host is healthy, degraded, or in an unknown state

Prerequisites

  • Shell access to the target host (SSH or console)
  • Ideally: sudo or root access (some checks require it)
  • Available tools: ps, top, free, vmstat, iostat, ss, journalctl, systemctl, dmesg, df, du, lsof, kubectl or k3s kubectl
  • Note which tools are absent — record what could not be checked

Procedure

Step 1 — OS and Node Baseline

Establish context before diagnosing anything.

hostname
uptime
uname -r
nproc
free -h
swapon --show
df -h
date

Record:

  • Hostname and uptime
  • Kernel version
  • CPU core count
  • Total/used/free memory and swap
  • Overall disk usage per mount
  • Current time (for correlating log timestamps)

Step 2 — Process Hygiene

# Zombie and D-state processes
ps aux | awk '$8 ~ /^[ZD]/ {print}'

# Top memory consumers
ps aux --sort=-%mem | head -20

# Top CPU consumers
ps aux --sort=-%cpu | head -20

# Processes with high FD counts (requires lsof)
sudo lsof 2>/dev/null | awk '{print $2}' | sort | uniq -c | sort -rn | head -20

# Long-running suspicious processes (> 7 days)
ps -eo pid,user,etime,comm --sort=-etime | head -30

Look for:

  • Zombie count > 0
  • D-state (uninterruptible sleep) tasks
  • Unexpected high-memory or high-CPU processes
  • Stale maintenance scripts, port-forwards, debug sessions, rsync, or backup jobs
  • Orphaned shells or user sessions

Step 3 — Memory Health

# Overall memory picture
free -h
cat /proc/meminfo | grep -E 'MemAvailable|SwapFree|Dirty|Slab|KReclaimable'

# OOM kill history
sudo dmesg | grep -i 'oom\|killed process' | tail -20
sudo journalctl -k --since "24 hours ago" | grep -i 'oom\|out of memory' | tail -20

# Slab usage
sudo slabtop -o | head -30

# cgroup memory pressure (if cgroups v2)
find /sys/fs/cgroup -name "memory.pressure" 2>/dev/null | xargs grep -l "some" 2>/dev/null | head -10

Look for:

  • Available memory < 10% of total
  • Swap being actively used (churn is worse than swap in use)
  • Recent OOM kills
  • High slab growth
  • cgroup memory pressure events

Step 4 — CPU and Scheduler Health

# Load average vs core count
uptime
nproc

# CPU idle and steal
top -bn1 | grep '%Cpu'
vmstat 1 5

# Run queue pressure
vmstat 1 5 | awk '{print $1, $2}'   # r=running, b=blocked

Look for:

  • Load average persistently > core count
  • CPU idle < 10%
  • High CPU steal (virtualised hosts)
  • Run queue (r) > core count sustained
  • Blocked processes (b) > 0 sustained

Step 5 — Disk and Filesystem Health

# Disk usage
df -h
df -i  # inode usage

# Large log files
sudo du -sh /var/log/* 2>/dev/null | sort -rh | head -20
sudo journalctl --disk-usage

# k3s data directory
sudo du -sh /var/lib/rancher/k3s/ 2>/dev/null
sudo du -sh /var/lib/rancher/k3s/agent/containerd/ 2>/dev/null

# Rapidly growing dirs (compare two snapshots 60s apart)
sudo du -sh /var/lib/rancher /var/log /tmp 2>/dev/null

Look for:

  • Any mount > 85% full (warning) or > 95% (critical)
  • Any mount with inode usage > 85%
  • Container image accumulation in containerd storage
  • Large or rapidly growing log files
  • Abandoned temp files

Step 6 — Network and Connection State

# Connection state summary
ss -s
ss -tnp | awk '{print $1}' | sort | uniq -c | sort -rn

# Unusual listeners
ss -tlnp

# CLOSE_WAIT accumulation (application socket leak)
ss -tnp | grep CLOSE_WAIT | wc -l

# TIME_WAIT count (normal but high counts may indicate connection thrash)
ss -tnp | grep TIME_WAIT | wc -l

Look for:

  • CLOSE_WAIT count > 50 (application not closing sockets)
  • SYN_RECV accumulation (connection flood or backlog issue)
  • Unexpected listeners on unusual ports
  • Long-lived unexpected tunnels or port-forwards

Step 7 — Kubernetes Node Health

# Node status and conditions
kubectl get node $(hostname) -o wide 2>/dev/null || k3s kubectl get node $(hostname) -o wide

# Node conditions in detail
kubectl describe node $(hostname) 2>/dev/null | grep -A 10 'Conditions:'

# Resource pressure
kubectl top node $(hostname) 2>/dev/null

# Recent node events
kubectl get events --field-selector involvedObject.name=$(hostname) --sort-by='.lastTimestamp' 2>/dev/null | tail -20

# Top pods by resource use
kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -20

# Restarting pods on this node
kubectl get pods --all-namespaces --field-selector spec.nodeName=$(hostname) 2>/dev/null | awk '$5 > 5 {print}'

Look for:

  • Node Ready=False or Unknown
  • MemoryPressure, DiskPressure, PIDPressure, or NetworkUnavailable = True
  • Pods with high restart counts (> 5)
  • CrashLoopBackOff workloads
  • Evicted pods (indicates past resource pressure)

Step 8 — k3s Runtime and Control Services

# k3s service status
sudo systemctl status k3s 2>/dev/null || sudo systemctl status k3s-agent

# k3s recent logs (last 100 lines)
sudo journalctl -u k3s --since "1 hour ago" -n 100 2>/dev/null || \
sudo journalctl -u k3s-agent --since "1 hour ago" -n 100

# containerd status (k3s embedded)
sudo systemctl status containerd 2>/dev/null

# CNI / flannel if applicable
sudo systemctl status flanneld 2>/dev/null
sudo ip addr show flannel.1 2>/dev/null

Look for:

  • k3s service not running or in failed state
  • Repeated restart entries in k3s logs
  • PLEG errors, image GC failures, sandbox creation failures
  • cgroup-related errors
  • API server timeout messages (on worker nodes: etcd or API server unreachable)

Interpretation

Signal Normal Warning Critical
Load average ≤ core count 12× core count > 2× sustained
Memory available > 20% 1020% < 10%
Disk usage < 75% 7590% > 90%
Inode usage < 75% 7590% > 90%
Zombie count 0 15 > 5 or climbing
OOM kills (24h) 0 12 > 2 or recent
Pod restarts < 3 310 > 10 or CrashLoop
CLOSE_WAIT < 10 1050 > 50
Node Ready True False / Unknown

Confidence in findings:

  • High — direct evidence (OOM kill log, node condition set, error in service log)
  • Medium — indirect evidence (high memory use without OOM, rising load with no clear cause)
  • Low — circumstantial (aging process without other indicators)

Remediation

High memory pressure

  1. Identify top consumers: ps aux --sort=-%mem | head -20
  2. Check for OOM history: dmesg | grep -i oom
  3. If a workload is leaking: restart the specific pod (not the node)
  4. If slab is high: check for inode-heavy workloads or NFS mounts
  5. Do not drop caches unless explicitly justified — Linux reclaims page cache automatically

Disk pressure

  1. Find largest directories: du -sh /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head -20
  2. Prune unused container images: k3s crictl rmi --prune (safe — only removes unused images)
  3. Clear old journal logs: sudo journalctl --vacuum-size=500M
  4. Identify log-bloating pods and fix their logging config

k3s service failing

  1. Check service status: sudo systemctl status k3s
  2. Check logs: sudo journalctl -u k3s -n 200
  3. Common causes: etcd data corruption (single-node), API server unreachable (worker), disk full, cert expiry
  4. Do not restart k3s without understanding the cause — a restart may mask the issue

High pod restart count

  1. Check logs: kubectl logs <pod> --previous
  2. Check events: kubectl describe pod <pod>
  3. Distinguish OOMKilled (memory limit) from CrashLoop (application error) from Liveness probe failure

Notes

  • This protocol was adapted from the sys-medic agent's structured assessment areas and the sys-medic repo's companion protocol document.
  • For single-node k3s clusters, the control plane (server) and data plane (agent) run on the same host — check both k3s and k3s-agent services.
  • On hosts without kubectl in PATH, use k3s kubectl as a drop-in replacement.
  • Protocol version history is tracked via the version frontmatter field. Update on significant structural changes.