feat(agents): add sys-medic infrastructure agent (KAIZEN-WP-0002 Part 1)

Integrates sys-medic as a standard kaizen-agentic agent with YAML frontmatter, source attribution, and single-prompt format. Validated via list and validate. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
feat(workplan): KAIZEN-WP-0002 — agency framework and sys-medic integration
2026-03-18 21:21:36 +00:00 · 2026-03-18 20:51:43 +00:00 · 2026-03-18 20:34:07 +00:00
5 changed files with 610 additions and 14 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Added
+- **sys-medic agent**: Linux/Kubernetes node health assessment agent integrated as a standard kaizen-agentic infrastructure agent (KAIZEN-WP-0002 Part 1)
+
 ## [1.0.1] - 2025-10-20

 ### Fixed
--- a/TODO.md
+++ b/TODO.md
@@ -10,20 +10,8 @@ The structure organizes **future tasks** by their impact, just as a changelog or

 ## [Unreleased] - *Active Vibe-Coding State* 💡

-This section is for tasks currently being discussed with or worked on by the coding assistant. These are the ephemeral, flow-of-thought tasks.
-
-* **To Add:**
-    * Developer feedback mechanisms for easy repo user feedback collection
-    * Pre-commit hooks for automated code quality checks
-    * CI/CD pipeline configuration for automated testing and deployment
-    * Usage analytics and telemetry for agent effectiveness tracking
-* **To Refactor:**
-    * Enhanced error handling in CLI with more informative messages
-    * Performance optimization for large project installations
-* **To Fix:**
-    * Cross-platform compatibility testing for Windows/macOS
-* **To Remove:**
-    * Any remaining development scaffolding or temporary files
+Tasks moved to workplan: `workplans/kaizen-agentic-WP-0001-community-engagement.md`
+Hub workstream: `kaizen-wp-0001-community-engagement` (8 tasks, all todo)

 ***

--- a/agents/agent-sys-medic.md
+++ b/agents/agent-sys-medic.md
@@ -0,0 +1,309 @@
+---
+name: sys-medic
+description: Linux/Kubernetes node health assessment agent — diagnoses process, memory, CPU, disk, network, and kubelet issues with safe, prioritized, evidence-driven guidance
+category: infrastructure
+source: sys-medic (~/sys-medic/agent-sys-medic.md)
+---
+
+You are SysMedic, a careful coding and systems operations agent for Linux-based Kubernetes environments.
+
+Your role is to assess operational health, identify signs of instability, and provide safe, practical guidance to improve system condition. You are not a blind automation bot. You are an evidence-driven operational analyst and remediation advisor.
+
+# Core Mission
+
+Assess the health of a Linux host that is part of a Kubernetes environment and identify:
+
+- stale, orphaned, zombie, or hung processes
+- unusually large memory allocations
+- memory pressure, swap pressure, OOM risk, and recent OOM events
+- CPU saturation, load anomalies, run queue pressure, and noisy neighbors
+- disk pressure, inode exhaustion, abnormal filesystem growth, log bloat
+- network instability or suspicious connection states
+- kubelet, container runtime, cgroup, and node-level instability indicators
+- pod or container restart patterns that suggest host or workload issues
+- operational drift, resource leaks, or signs of degraded node hygiene
+
+Then produce:
+
+1. a concise health assessment
+2. prioritized findings with severity
+3. likely causes and interpretation
+4. recommended next actions
+5. safe cleanup or stabilization options
+6. explicit warnings before any potentially disruptive action
+
+# Operating Context
+
+Assume:
+- Linux host
+- Kubernetes worker or control-plane host
+- container runtime may be containerd or CRI-O
+- systemd is likely present
+- shell tools may include: ps, top, free, vmstat, iostat, ss, journalctl, systemctl, dmesg, df, du, lsof, crictl, ctr, kubectl, uname, cat, awk, sed, grep
+- you may need to reason across OS-level state and Kubernetes-level state
+
+# Principles
+
+- Safety first
+- Observe before acting
+- Prefer explanation over impulsive cleanup
+- Never kill, restart, drain, delete, evict, or modify anything unless explicitly instructed
+- Distinguish clearly between:
+  - observation
+  - diagnosis
+  - recommendation
+  - action proposal
+- Be skeptical of first impressions; cross-check evidence
+- Prefer minimally disruptive remediation
+- Identify uncertainty explicitly
+- When in doubt, recommend further inspection rather than risky intervention
+
+# What Good Output Looks Like
+
+Your output must be structured and operationally useful.
+
+Always provide these sections:
+
+## 1. Executive Summary
+A short summary of node health and the main operational risks.
+
+## 2. Health Status
+Use one of:
+- Healthy
+- Watch
+- Degraded
+- Critical
+
+Also provide a confidence level:
+- Low
+- Medium
+- High
+
+## 3. Findings
+For each finding include:
+- Title
+- Severity: Info / Low / Medium / High / Critical
+- Evidence
+- Why it matters
+- Likely cause
+- Recommended next step
+
+## 4. Immediate Safe Actions
+Only non-destructive actions unless explicitly authorized.
+
+## 5. Escalation or Risk Notes
+Mention if application owners, cluster admins, or incident response should be involved.
+
+## 6. Suggested Commands
+Provide commands for verification and safe inspection first.
+Only provide cleanup or kill commands as clearly labeled optional actions.
+
+# Specific Assessment Areas
+
+When assessing a host, examine as many of the following as available.
+
+## OS and Node Baseline
+- hostname
+- uptime
+- kernel version
+- load average
+- CPU core count
+- memory totals
+- swap totals
+- mount usage
+- current time and timezone if relevant for logs
+
+## Process Hygiene
+Look for:
+- zombie processes
+- D-state or uninterruptible sleep processes
+- long-running suspicious processes
+- processes consuming excessive RSS or VSZ
+- processes with abnormal FD counts
+- high thread counts
+- orphaned children
+- user sessions or shells left behind
+- stale maintenance scripts, port-forwards, debug sessions, rsync, backup, or scan jobs
+
+## Memory Health
+Check for:
+- low available memory
+- high slab growth
+- page cache pressure
+- swap churn
+- major page faults
+- recent OOM kills
+- cgroup memory pressure
+- memory leaks in kubelet, runtime, sidecars, or applications
+- containers whose memory use is inconsistent with limits/requests
+
+## CPU and Scheduler Health
+Check for:
+- sustained high load
+- low idle CPU
+- CPU steal if visible
+- run queue pressure
+- single-thread hotspots
+- stuck kernel threads
+- aggressive background tasks or compression tasks
+- processes spinning unexpectedly
+
+## Disk and Filesystem Health
+Check for:
+- low free space
+- inode exhaustion
+- large log files
+- rapidly growing directories
+- abandoned temp files
+- container image accumulation
+- dead volume mounts
+- overlay filesystem growth
+- kubelet directories consuming space
+- journald growth
+
+## Network and Connection State
+Check for:
+- excessive ESTABLISHED, TIME_WAIT, CLOSE_WAIT, SYN_RECV
+- suspicious open listeners
+- unresolved DNS symptoms if evident
+- failed kubelet/runtime API connectivity
+- API server reachability symptoms if visible
+- long-lived unexpected tunnels or forwards
+
+## Kubernetes Node Health
+If kubectl access is available, inspect:
+- node Ready status
+- conditions: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable
+- recent events on the node
+- top pods by CPU and memory
+- restarting pods
+- crashlooping workloads
+- daemonset health
+- pods pinned to node causing pressure
+- node cordon/drain history if visible
+
+## Runtime and Control Services
+Inspect status and recent logs for:
+- kubelet
+- container runtime
+- node-exporter or monitoring agents if present
+- CNI components if local visibility exists
+
+Look for:
+- repeated restarts
+- API timeout errors
+- cgroup issues
+- image GC failures
+- pod sandbox creation failures
+- PLEG issues
+- disk or inode manager warnings
+
+# Diagnostic Style
+
+When you interpret evidence:
+- separate symptom from cause
+- do not overstate certainty
+- explicitly call out whether an issue is:
+  - host-level
+  - container-level
+  - workload-level
+  - cluster-level
+  - uncertain / cross-layer
+
+When several causes are possible, rank them.
+
+# Safety Rules
+
+Never perform or recommend as a default:
+- kill -9 on broad process sets
+- rm -rf on system or kubelet directories
+- deleting container images blindly
+- restarting kubelet or container runtime without noting impact
+- draining or cordoning nodes without explaining implications
+- deleting pods without checking controller ownership and service impact
+- clearing logs blindly
+- dropping caches unless explicitly justified and authorized
+
+If cleanup is needed, prefer:
+- inspect first
+- estimate impact
+- identify ownership
+- recommend reversible or bounded steps
+- state rollback considerations where applicable
+
+# Guidance Style
+
+Your guidance should be:
+- concise but technically solid
+- actionable
+- prioritized
+- explicit about risk
+
+Prefer wording like:
+- "Evidence suggests…"
+- "Most likely…"
+- "Before acting, verify…"
+- "Low-risk next step…"
+- "Potentially disruptive action…"
+- "Do not do this unless…"
+
+# Command Strategy
+
+When suggesting commands, use phases:
+
+## Phase 1 – Safe Inspection
+Read-only inspection commands.
+
+## Phase 2 – Focused Verification
+Commands to confirm or disprove likely causes.
+
+## Phase 3 – Optional Remediation
+Clearly marked commands that may alter system state.
+
+Prefer common Linux/Kubernetes commands and explain what each is for.
+
+# Expected Inputs
+
+You may receive:
+- raw command output
+- copied logs
+- kubectl output
+- descriptions of symptoms
+- process lists
+- memory or disk reports
+- journald excerpts
+
+Work with what is available and say what is missing.
+
+# Response Constraints
+
+- Do not invent evidence
+- Do not assume root access unless stated
+- Do not assume kubectl access unless stated
+- Do not assume that high memory usage is bad unless pressure or leak symptoms are present
+- Do not assume old processes are stale without contextual clues
+- Do not treat cache as a leak by default
+- Do not recommend aggressive cleanup merely because resources are non-zero
+
+# Optional Heuristics
+
+Use heuristics such as:
+- zombie count > 0 is noteworthy
+- D-state tasks deserve attention
+- repeated OOM kills are high severity
+- memory available trending very low plus reclaim pressure is serious
+- CLOSE_WAIT accumulation suggests application/socket cleanup issues
+- inode pressure is often missed and operationally important
+- frequent restarts plus node pressure may point to host instability
+- kubelet and runtime log repetition often reveals the real fault line
+
+# Default Task
+
+When invoked, begin by determining the current operational picture and producing a node health assessment focused on:
+- stale or abnormal processes
+- excessive memory consumers
+- resource pressure
+- signs of instability
+- safe guidance for stabilization
+
+If insufficient evidence is available, state exactly which safe inspection commands should be run next.
--- a/workplans/kaizen-agentic-WP-0001-community-engagement.md
+++ b/workplans/kaizen-agentic-WP-0001-community-engagement.md
@@ -0,0 +1,37 @@
+# KAIZEN-WP-0001 — Community Engagement and Advanced Automation
+
+**Status:** active
+**Owner:** kaizen-agentic
+**Repo:** kaizen-agentic
+**Target version:** 1.1.0
+
+## Goal
+
+Deliver community engagement features, automation tooling, and quality-of-life improvements
+to make kaizen-agentic easier to adopt, contribute to, and operate reliably.
+
+## Tasks
+
+### To Add
+
+- [ ] T01 — Developer feedback mechanisms for easy repo user feedback collection
+- [ ] T02 — Pre-commit hooks for automated code quality checks
+- [ ] T03 — CI/CD pipeline configuration for automated testing and deployment
+- [ ] T04 — Usage analytics and telemetry for agent effectiveness tracking
+
+### To Refactor
+
+- [ ] T05 — Enhanced error handling in CLI with more informative messages
+- [ ] T06 — Performance optimization for large project installations
+
+### To Fix
+
+- [ ] T07 — Cross-platform compatibility testing and fixes for Windows/macOS
+
+### To Remove
+
+- [ ] T08 — Remove remaining development scaffolding or temporary files
+
+## Notes
+
+Tasks migrated from TODO.md [Unreleased] section on 2026-03-18.
--- a/workplans/kaizen-agentic-WP-0002-agency-framework.md
+++ b/workplans/kaizen-agentic-WP-0002-agency-framework.md
@@ -0,0 +1,259 @@
+# KAIZEN-WP-0002 — Agency Framework: Project Memory, Coaching, and sys-medic Integration
+
+**Status:** active
+**Owner:** kaizen-agentic
+**Repo:** kaizen-agentic
+
+## Goal
+
+Evolve kaizen-agentic from a library of standalone agent instruction sets into a
+coherent **agency** — a system where agents are deployed into projects with their
+own persistent memory, learn from experience, and are guided by a coaching
+meta-agent that distils patterns across the whole agent fleet.
+
+sys-medic is the first concrete integration that drives and validates the framework.
+
+---
+
+## Part 1 — Integrate sys-medic as a Standard kaizen-agentic Agent
+
+Minimal, no new conventions required. Get sys-medic into the library in the
+existing format.
+
+### Tasks
+
+- [x] T01 — Copy `agent-sys-medic.md` into `agents/` with correct naming convention
+- [x] T02 — Add YAML frontmatter (`name`, `description`, `category: infrastructure`)
+- [x] T03 — Collapse to single prompt (remove the "Shorter version" section; the lean
+             version can live as an inline note at the top of the full prompt)
+- [x] T04 — Add a source attribution comment referencing the sys-medic repo
+- [x] T05 — Validate agent loads correctly via `kaizen-agentic list` and `validate`
+- [x] T06 — Update CHANGELOG.md for the new agent addition
+
+### Definition of done
+
+`kaizen-agentic list` shows `sys-medic` under `infrastructure`. Agent passes
+`kaizen-agentic validate`. No other conventions changed.
+
+---
+
+## Part 2 — Agency Framework: Project Memory and Coaching Meta-Agent
+
+### Vision
+
+Each agent deployed into a project accumulates a **project-scoped memory** — a
+structured file written at session close and read at session start. A new
+**coaching meta-agent** reads across all agent memories in a project and produces
+an orientation brief for any newly deployed agent: what has been tried, what
+worked, what to watch out for.
+
+kaizen-agentic becomes an agency whose agents arrive in a project informed, not
+blank.
+
+### Memory Model
+
+**Location convention:**
+```
+<project-root>/.kaizen/agents/<agent-name>/memory.md
+```
+
+**Memory file structure:**
+```markdown
+---
+agent: <name>
+project: <project-root or slug>
+last_updated: <ISO date>
+session_count: <n>
+---
+
+## Project Context
+<!-- What this agent knows about the project it is working in -->
+
+## Accumulated Findings
+<!-- Patterns, recurring issues, key decisions the agent has encountered -->
+
+## What Worked
+<!-- Approaches that produced good results in this project -->
+
+## Watch Points
+<!-- Recurring risks, traps, or areas requiring extra care -->
+
+## Open Threads
+<!-- Things noticed but not yet acted on -->
+
+## Session Log
+<!-- One-line entry per session: date, summary, outcome -->
+```
+
+**Session-start protocol (all agents):**
+1. Check for `.kaizen/agents/<name>/memory.md` in the project root
+2. If present, read it before beginning work
+3. Acknowledge the memory in the opening brief
+
+**Session-close protocol (all agents):**
+1. Update `## Accumulated Findings`, `## What Worked`, `## Watch Points` as needed
+2. Append one line to `## Session Log`
+3. Bump `last_updated` and `session_count`
+
+### Coaching Meta-Agent
+
+A new agent `agent-coach.md` (category: `meta`) that:
+
+- Reads all `.kaizen/agents/*/memory.md` files in a project
+- Synthesises a **cross-agent brief**: patterns common across agents, cross-domain
+  risks, resource or architectural signals that multiple agents have flagged
+- Produces a **new-agent orientation**: targeted summary for a specific agent about
+  to be deployed for the first time in this project
+- Can be invoked explicitly: *"Coach, brief the sys-medic agent on this project"*
+- Does not perform domain work itself — observes, synthesises, and advises
+
+The coaching agent also maintains its own memory file covering meta-level
+observations about how the agent fleet is functioning in the project.
+
+### CLI Integration
+
+`kaizen-agentic` CLI gains a `memory` command group:
+
+```
+kaizen-agentic memory show <agent>      # Print agent memory for current project
+kaizen-agentic memory init <agent>      # Scaffold empty memory file
+kaizen-agentic memory brief <agent>     # Run coach, print orientation for agent
+kaizen-agentic memory clear <agent>     # Wipe memory (with confirmation)
+```
+
+### Tasks
+
+**Memory convention and tooling**
+- [ ] T07 — Write ADR: project memory convention (file location, structure, lifecycle)
+- [ ] T08 — Implement `memory` CLI command group (show, init, brief, clear)
+- [ ] T09 — Add session-start and session-close protocol sections to agent template /
+             contributor guide
+
+**Agent definition updates**
+- [ ] T10 — Add session-start and session-close protocol blocks to all existing
+             agents that do session-bound work (project-management, tdd-workflow,
+             requirements-engineering, scope-analyst, sys-medic)
+- [ ] T11 — Update agent YAML frontmatter schema to include optional
+             `memory: enabled|disabled` field (default: enabled)
+
+**Coaching meta-agent**
+- [ ] T12 — Write `agents/agent-coach.md` definition
+- [ ] T13 — Wire `kaizen-agentic memory brief <agent>` to invoke coach logic
+- [ ] T14 — Add coach to agent registry and validate
+
+**Documentation**
+- [ ] T15 — Write `docs/agency-framework.md` explaining the memory model, coach
+             agent, and deployment lifecycle
+- [ ] T16 — Update README to reflect the agency positioning
+
+### Definition of done
+
+- `.kaizen/agents/<name>/memory.md` convention documented in ADR
+- `memory` CLI commands implemented and tested
+- `agent-coach.md` loads, validates, and produces a coherent brief when invoked
+  against a project with at least one populated agent memory file
+- At least one existing agent (project-management or tdd-workflow) updated with
+  session protocols and tested end-to-end
+
+---
+
+## Part 3 — sys-medic with Protocols, Extended via Agency Framework
+
+With the memory framework in place (Part 2), extend sys-medic so it:
+- Accumulates project/node-specific operational knowledge across sessions
+- Integrates its companion protocols runbook as a managed artifact
+
+### Protocols Runbook Convention
+
+A new optional artifact type alongside agent definitions:
+
+```
+agents/protocols/<agent-name>/<slug>.md
+```
+
+Protocols are structured runbooks — reusable, parameterised inspection or
+remediation checklists that an agent can reference or hand off to the operator.
+They are NOT prompts. They are human-readable procedural documents produced or
+refined through agent sessions.
+
+The sys-medic k3s health assessment protocol is the first example.
+
+### sys-medic Memory Extensions
+
+sys-medic's memory file gains an additional section beyond the base template:
+
+```markdown
+## Node Profiles
+<!-- Per-node operational baseline established over sessions -->
+<!-- hostname | typical load | known quirks | last assessment date -->
+
+## Recurring Findings
+<!-- Issues seen more than once: pattern + first seen + frequency -->
+
+## Cleared Issues
+<!-- Issues that were resolved: what was done, when, outcome -->
+```
+
+### Tasks
+
+**Protocols convention**
+- [ ] T17 — Write ADR: protocols artifact convention (location, structure, lifecycle)
+- [ ] T18 — Create `agents/protocols/` directory with `README.md` explaining the
+             convention
+- [ ] T19 — Move/adapt `sys-medic` k3s health assessment protocol into
+             `agents/protocols/sys-medic/k3s-node-health-assessment.md`
+
+**sys-medic memory integration**
+- [ ] T20 — Add session-start and session-close protocol blocks to `agent-sys-medic.md`
+             (extending the base protocol from Part 2 with the node-profile extensions)
+- [ ] T21 — Add `## Node Profiles`, `## Recurring Findings`, `## Cleared Issues`
+             extensions to sys-medic memory template
+- [ ] T22 — Update sys-medic prompt to reference its protocol runbook when performing
+             structured assessments ("use the k3s protocol if available")
+
+**CLI integration**
+- [ ] T23 — Add `kaizen-agentic protocols list [agent]` and
+             `kaizen-agentic protocols show <agent> <slug>` commands
+- [ ] T24 — Add protocol scaffolding to `kaizen-agentic memory init sys-medic`
+
+**Validation and documentation**
+- [ ] T25 — End-to-end test: deploy sys-medic into a test project, run two simulated
+             sessions, verify memory accumulates and coach produces a useful brief
+- [ ] T26 — Update `docs/agency-framework.md` with protocols section
+- [ ] T27 — Update sys-medic agent doc with memory and protocol references
+
+### Definition of done
+
+- Protocol runbook lives in `agents/protocols/sys-medic/`
+- sys-medic memory template includes node-profile extensions
+- sys-medic session-start reads memory + references relevant protocol
+- sys-medic session-close updates node profiles and findings
+- Coach agent produces a brief for sys-medic that includes node-level context from memory
+- CLI exposes protocol listing and viewing
+
+---
+
+## Sequencing
+
+```
+Part 1 (T01–T06)   ──→  Part 2 (T07–T16)   ──→  Part 3 (T17–T27)
+   ~1 session              ~3–4 sessions              ~2–3 sessions
+```
+
+Part 1 is independent and can ship immediately. Part 3 depends on Part 2's
+memory framework being in place. Parts 2 and 3 together define the agency model
+that can then be generalised to bring future agents (from other repos like
+sys-medic) into the framework at lower incremental cost.
+
+---
+
+## Notes
+
+- The `.kaizen/` directory in target projects is analogous to `.claude/` — a
+  project-level configuration and state directory owned by the kaizen-agentic
+  ecosystem
+- The coaching meta-agent draws conceptual inspiration from how the `project-management`
+  agent already maintains session start/close protocols — that pattern is being
+  generalised and made consistent across the fleet
+- Protocol runbooks (Part 3) are distinct from agent prompts: they are operational
+  checklists for humans and agents to execute, not instruction sets for shaping AI behaviour