Files

Bernd Worsch 53dfd55916 feat(protocols): add protocols artifact convention, sys-medic protocol + CLI (WP-0002 T17-T24)

- ADR-003: protocols artifact convention (location, structure, lifecycle)
- agents/protocols/README.md: directory-level index and usage guide
- agents/protocols/sys-medic/k3s-node-health-assessment.md: full structured
  k3s node health assessment protocol (8 steps: OS baseline, process hygiene,
  memory, CPU, disk, network, k3s node state, runtime services)
- agent-sys-medic.md: add memory: enabled frontmatter, session-start/close
  protocols, node-profile memory template extensions, protocol reference in
  Default Task
- cli.py: add protocols command group (list, show); extend memory init to hint
  protocol commands for agents that have protocols

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-18 23:48:09 +00:00

8.6 KiB

Raw Permalink Blame History

agent, slug, title, version, last_updated

agent	slug	title	version	last_updated
sys-medic	k3s-node-health-assessment	k3s Node Health Assessment	1.0.0	2026-03-18

k3s Node Health Assessment

Purpose

Structured health assessment for a Linux host running k3s (lightweight Kubernetes). Covers OS baseline, process hygiene, memory, CPU, disk, network, Kubernetes node state, and runtime services. Produces a prioritized findings report with safe next actions.

Scope

Linux host (any distribution) running k3s
k3s worker nodes and single-node clusters
Hosts where kubectl and/or k3s kubectl are available
Applies whether the host is healthy, degraded, or in an unknown state

Prerequisites

Shell access to the target host (SSH or console)
Ideally: sudo or root access (some checks require it)
Available tools: ps, top, free, vmstat, iostat, ss, journalctl, systemctl, dmesg, df, du, lsof, kubectl or k3s kubectl
Note which tools are absent — record what could not be checked

Procedure

Step 1 — OS and Node Baseline

Establish context before diagnosing anything.

hostname
uptime
uname -r
nproc
free -h
swapon --show
df -h
date

Record:

Hostname and uptime
Kernel version
CPU core count
Total/used/free memory and swap
Overall disk usage per mount
Current time (for correlating log timestamps)

Step 2 — Process Hygiene

# Zombie and D-state processes
ps aux | awk '$8 ~ /^[ZD]/ {print}'

# Top memory consumers
ps aux --sort=-%mem | head -20

# Top CPU consumers
ps aux --sort=-%cpu | head -20

# Processes with high FD counts (requires lsof)
sudo lsof 2>/dev/null | awk '{print $2}' | sort | uniq -c | sort -rn | head -20

# Long-running suspicious processes (> 7 days)
ps -eo pid,user,etime,comm --sort=-etime | head -30

Look for:

Zombie count > 0
D-state (uninterruptible sleep) tasks
Unexpected high-memory or high-CPU processes
Stale maintenance scripts, port-forwards, debug sessions, rsync, or backup jobs
Orphaned shells or user sessions

Step 3 — Memory Health

# Overall memory picture
free -h
cat /proc/meminfo | grep -E 'MemAvailable|SwapFree|Dirty|Slab|KReclaimable'

# OOM kill history
sudo dmesg | grep -i 'oom\|killed process' | tail -20
sudo journalctl -k --since "24 hours ago" | grep -i 'oom\|out of memory' | tail -20

# Slab usage
sudo slabtop -o | head -30

# cgroup memory pressure (if cgroups v2)
find /sys/fs/cgroup -name "memory.pressure" 2>/dev/null | xargs grep -l "some" 2>/dev/null | head -10

Look for:

Available memory < 10% of total
Swap being actively used (churn is worse than swap in use)
Recent OOM kills
High slab growth
cgroup memory pressure events

Step 4 — CPU and Scheduler Health

# Load average vs core count
uptime
nproc

# CPU idle and steal
top -bn1 | grep '%Cpu'
vmstat 1 5

# Run queue pressure
vmstat 1 5 | awk '{print $1, $2}'   # r=running, b=blocked

Look for:

Load average persistently > core count
CPU idle < 10%
High CPU steal (virtualised hosts)
Run queue (r) > core count sustained
Blocked processes (b) > 0 sustained

Step 5 — Disk and Filesystem Health

# Disk usage
df -h
df -i  # inode usage

# Large log files
sudo du -sh /var/log/* 2>/dev/null | sort -rh | head -20
sudo journalctl --disk-usage

# k3s data directory
sudo du -sh /var/lib/rancher/k3s/ 2>/dev/null
sudo du -sh /var/lib/rancher/k3s/agent/containerd/ 2>/dev/null

# Rapidly growing dirs (compare two snapshots 60s apart)
sudo du -sh /var/lib/rancher /var/log /tmp 2>/dev/null

Look for:

Any mount > 85% full (warning) or > 95% (critical)
Any mount with inode usage > 85%
Container image accumulation in containerd storage
Large or rapidly growing log files
Abandoned temp files

Step 6 — Network and Connection State

# Connection state summary
ss -s
ss -tnp | awk '{print $1}' | sort | uniq -c | sort -rn

# Unusual listeners
ss -tlnp

# CLOSE_WAIT accumulation (application socket leak)
ss -tnp | grep CLOSE_WAIT | wc -l

# TIME_WAIT count (normal but high counts may indicate connection thrash)
ss -tnp | grep TIME_WAIT | wc -l

Look for:

CLOSE_WAIT count > 50 (application not closing sockets)
SYN_RECV accumulation (connection flood or backlog issue)
Unexpected listeners on unusual ports
Long-lived unexpected tunnels or port-forwards

Step 7 — Kubernetes Node Health

# Node status and conditions
kubectl get node $(hostname) -o wide 2>/dev/null || k3s kubectl get node $(hostname) -o wide

# Node conditions in detail
kubectl describe node $(hostname) 2>/dev/null | grep -A 10 'Conditions:'

# Resource pressure
kubectl top node $(hostname) 2>/dev/null

# Recent node events
kubectl get events --field-selector involvedObject.name=$(hostname) --sort-by='.lastTimestamp' 2>/dev/null | tail -20

# Top pods by resource use
kubectl top pods --all-namespaces --sort-by=memory 2>/dev/null | head -20

# Restarting pods on this node
kubectl get pods --all-namespaces --field-selector spec.nodeName=$(hostname) 2>/dev/null | awk '$5 > 5 {print}'

Look for:

Node Ready=False or Unknown
MemoryPressure, DiskPressure, PIDPressure, or NetworkUnavailable = True
Pods with high restart counts (> 5)
CrashLoopBackOff workloads
Evicted pods (indicates past resource pressure)

Step 8 — k3s Runtime and Control Services

# k3s service status
sudo systemctl status k3s 2>/dev/null || sudo systemctl status k3s-agent

# k3s recent logs (last 100 lines)
sudo journalctl -u k3s --since "1 hour ago" -n 100 2>/dev/null || \
sudo journalctl -u k3s-agent --since "1 hour ago" -n 100

# containerd status (k3s embedded)
sudo systemctl status containerd 2>/dev/null

# CNI / flannel if applicable
sudo systemctl status flanneld 2>/dev/null
sudo ip addr show flannel.1 2>/dev/null

Look for:

k3s service not running or in failed state
Repeated restart entries in k3s logs
PLEG errors, image GC failures, sandbox creation failures
cgroup-related errors
API server timeout messages (on worker nodes: etcd or API server unreachable)

Interpretation

Signal	Normal	Warning	Critical
Load average	≤ core count	1–2× core count	> 2× sustained
Memory available	> 20%	10–20%	< 10%
Disk usage	< 75%	75–90%	> 90%
Inode usage	< 75%	75–90%	> 90%
Zombie count	0	1–5	> 5 or climbing
OOM kills (24h)	0	1–2	> 2 or recent
Pod restarts	< 3	3–10	> 10 or CrashLoop
CLOSE_WAIT	< 10	10–50	> 50
Node Ready	True	—	False / Unknown

Confidence in findings:

High — direct evidence (OOM kill log, node condition set, error in service log)
Medium — indirect evidence (high memory use without OOM, rising load with no clear cause)
Low — circumstantial (aging process without other indicators)

Remediation

High memory pressure

Identify top consumers: ps aux --sort=-%mem | head -20
Check for OOM history: dmesg | grep -i oom
If a workload is leaking: restart the specific pod (not the node)
If slab is high: check for inode-heavy workloads or NFS mounts
Do not drop caches unless explicitly justified — Linux reclaims page cache automatically

Disk pressure

Find largest directories: du -sh /var/lib/rancher/k3s/agent/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/* | sort -rh | head -20
Prune unused container images: k3s crictl rmi --prune (safe — only removes unused images)
Clear old journal logs: sudo journalctl --vacuum-size=500M
Identify log-bloating pods and fix their logging config

k3s service failing

Check service status: sudo systemctl status k3s
Check logs: sudo journalctl -u k3s -n 200
Common causes: etcd data corruption (single-node), API server unreachable (worker), disk full, cert expiry
Do not restart k3s without understanding the cause — a restart may mask the issue

High pod restart count

Check logs: kubectl logs <pod> --previous
Check events: kubectl describe pod <pod>
Distinguish OOMKilled (memory limit) from CrashLoop (application error) from Liveness probe failure

Notes

This protocol was adapted from the sys-medic agent's structured assessment areas and the sys-medic repo's companion protocol document.
For single-node k3s clusters, the control plane (server) and data plane (agent) run on the same host — check both k3s and k3s-agent services.
On hosts without kubectl in PATH, use k3s kubectl as a drop-in replacement.
Protocol version history is tracked via the version frontmatter field. Update on significant structural changes.

8.6 KiB Raw Permalink Blame History Unescape Escape

k3s Node Health Assessment

Purpose

Scope

Prerequisites

Procedure

Step 1 — OS and Node Baseline

Step 2 — Process Hygiene

Step 3 — Memory Health

Step 4 — CPU and Scheduler Health

Step 5 — Disk and Filesystem Health

Step 6 — Network and Connection State

Step 7 — Kubernetes Node Health

Step 8 — k3s Runtime and Control Services

Interpretation

Remediation

High memory pressure

Disk pressure

k3s service failing

High pod restart count

Notes

8.6 KiB

Raw Permalink Blame History