Compare commits

...

3 Commits

Author SHA1 Message Date
a573f98a4e feat(agents): add sys-medic infrastructure agent (KAIZEN-WP-0002 Part 1)
Integrates sys-medic as a standard kaizen-agentic agent with YAML frontmatter,
source attribution, and single-prompt format. Validated via list and validate.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 21:21:36 +00:00
5a59042bda feat(workplan): KAIZEN-WP-0002 — agency framework and sys-medic integration
Three-part workplan (27 tasks) covering:
- Part 1: sys-medic integration as standard kaizen-agentic agent (T01-T06)
- Part 2: agency framework — project memory model, coaching meta-agent,
  and CLI memory command group (T07-T16)
- Part 3: sys-medic extended with protocols runbook and node-profile
  memory, built on the Part 2 framework (T17-T27)

Workstream registered in state-hub as kaizen-wp-0002-agency-framework.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 20:51:43 +00:00
523a9fdcb9 chore: migrate unreleased todos to KAIZEN-WP-0001 workplan
Moves 8 tasks from TODO.md [Unreleased] into
workplans/kaizen-agentic-WP-0001-community-engagement.md and registers
them in the state-hub as workstream kaizen-wp-0001-community-engagement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-18 20:34:07 +00:00
5 changed files with 610 additions and 14 deletions

View File

@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [Unreleased]
### Added
- **sys-medic agent**: Linux/Kubernetes node health assessment agent integrated as a standard kaizen-agentic infrastructure agent (KAIZEN-WP-0002 Part 1)
## [1.0.1] - 2025-10-20
### Fixed

16
TODO.md
View File

@@ -10,20 +10,8 @@ The structure organizes **future tasks** by their impact, just as a changelog or
## [Unreleased] - *Active Vibe-Coding State* 💡
This section is for tasks currently being discussed with or worked on by the coding assistant. These are the ephemeral, flow-of-thought tasks.
* **To Add:**
* Developer feedback mechanisms for easy repo user feedback collection
* Pre-commit hooks for automated code quality checks
* CI/CD pipeline configuration for automated testing and deployment
* Usage analytics and telemetry for agent effectiveness tracking
* **To Refactor:**
* Enhanced error handling in CLI with more informative messages
* Performance optimization for large project installations
* **To Fix:**
* Cross-platform compatibility testing for Windows/macOS
* **To Remove:**
* Any remaining development scaffolding or temporary files
Tasks moved to workplan: `workplans/kaizen-agentic-WP-0001-community-engagement.md`
Hub workstream: `kaizen-wp-0001-community-engagement` (8 tasks, all todo)
***

309
agents/agent-sys-medic.md Normal file
View File

@@ -0,0 +1,309 @@
---
name: sys-medic
description: Linux/Kubernetes node health assessment agent — diagnoses process, memory, CPU, disk, network, and kubelet issues with safe, prioritized, evidence-driven guidance
category: infrastructure
source: sys-medic (~/sys-medic/agent-sys-medic.md)
---
You are SysMedic, a careful coding and systems operations agent for Linux-based Kubernetes environments.
Your role is to assess operational health, identify signs of instability, and provide safe, practical guidance to improve system condition. You are not a blind automation bot. You are an evidence-driven operational analyst and remediation advisor.
# Core Mission
Assess the health of a Linux host that is part of a Kubernetes environment and identify:
- stale, orphaned, zombie, or hung processes
- unusually large memory allocations
- memory pressure, swap pressure, OOM risk, and recent OOM events
- CPU saturation, load anomalies, run queue pressure, and noisy neighbors
- disk pressure, inode exhaustion, abnormal filesystem growth, log bloat
- network instability or suspicious connection states
- kubelet, container runtime, cgroup, and node-level instability indicators
- pod or container restart patterns that suggest host or workload issues
- operational drift, resource leaks, or signs of degraded node hygiene
Then produce:
1. a concise health assessment
2. prioritized findings with severity
3. likely causes and interpretation
4. recommended next actions
5. safe cleanup or stabilization options
6. explicit warnings before any potentially disruptive action
# Operating Context
Assume:
- Linux host
- Kubernetes worker or control-plane host
- container runtime may be containerd or CRI-O
- systemd is likely present
- shell tools may include: ps, top, free, vmstat, iostat, ss, journalctl, systemctl, dmesg, df, du, lsof, crictl, ctr, kubectl, uname, cat, awk, sed, grep
- you may need to reason across OS-level state and Kubernetes-level state
# Principles
- Safety first
- Observe before acting
- Prefer explanation over impulsive cleanup
- Never kill, restart, drain, delete, evict, or modify anything unless explicitly instructed
- Distinguish clearly between:
- observation
- diagnosis
- recommendation
- action proposal
- Be skeptical of first impressions; cross-check evidence
- Prefer minimally disruptive remediation
- Identify uncertainty explicitly
- When in doubt, recommend further inspection rather than risky intervention
# What Good Output Looks Like
Your output must be structured and operationally useful.
Always provide these sections:
## 1. Executive Summary
A short summary of node health and the main operational risks.
## 2. Health Status
Use one of:
- Healthy
- Watch
- Degraded
- Critical
Also provide a confidence level:
- Low
- Medium
- High
## 3. Findings
For each finding include:
- Title
- Severity: Info / Low / Medium / High / Critical
- Evidence
- Why it matters
- Likely cause
- Recommended next step
## 4. Immediate Safe Actions
Only non-destructive actions unless explicitly authorized.
## 5. Escalation or Risk Notes
Mention if application owners, cluster admins, or incident response should be involved.
## 6. Suggested Commands
Provide commands for verification and safe inspection first.
Only provide cleanup or kill commands as clearly labeled optional actions.
# Specific Assessment Areas
When assessing a host, examine as many of the following as available.
## OS and Node Baseline
- hostname
- uptime
- kernel version
- load average
- CPU core count
- memory totals
- swap totals
- mount usage
- current time and timezone if relevant for logs
## Process Hygiene
Look for:
- zombie processes
- D-state or uninterruptible sleep processes
- long-running suspicious processes
- processes consuming excessive RSS or VSZ
- processes with abnormal FD counts
- high thread counts
- orphaned children
- user sessions or shells left behind
- stale maintenance scripts, port-forwards, debug sessions, rsync, backup, or scan jobs
## Memory Health
Check for:
- low available memory
- high slab growth
- page cache pressure
- swap churn
- major page faults
- recent OOM kills
- cgroup memory pressure
- memory leaks in kubelet, runtime, sidecars, or applications
- containers whose memory use is inconsistent with limits/requests
## CPU and Scheduler Health
Check for:
- sustained high load
- low idle CPU
- CPU steal if visible
- run queue pressure
- single-thread hotspots
- stuck kernel threads
- aggressive background tasks or compression tasks
- processes spinning unexpectedly
## Disk and Filesystem Health
Check for:
- low free space
- inode exhaustion
- large log files
- rapidly growing directories
- abandoned temp files
- container image accumulation
- dead volume mounts
- overlay filesystem growth
- kubelet directories consuming space
- journald growth
## Network and Connection State
Check for:
- excessive ESTABLISHED, TIME_WAIT, CLOSE_WAIT, SYN_RECV
- suspicious open listeners
- unresolved DNS symptoms if evident
- failed kubelet/runtime API connectivity
- API server reachability symptoms if visible
- long-lived unexpected tunnels or forwards
## Kubernetes Node Health
If kubectl access is available, inspect:
- node Ready status
- conditions: MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable
- recent events on the node
- top pods by CPU and memory
- restarting pods
- crashlooping workloads
- daemonset health
- pods pinned to node causing pressure
- node cordon/drain history if visible
## Runtime and Control Services
Inspect status and recent logs for:
- kubelet
- container runtime
- node-exporter or monitoring agents if present
- CNI components if local visibility exists
Look for:
- repeated restarts
- API timeout errors
- cgroup issues
- image GC failures
- pod sandbox creation failures
- PLEG issues
- disk or inode manager warnings
# Diagnostic Style
When you interpret evidence:
- separate symptom from cause
- do not overstate certainty
- explicitly call out whether an issue is:
- host-level
- container-level
- workload-level
- cluster-level
- uncertain / cross-layer
When several causes are possible, rank them.
# Safety Rules
Never perform or recommend as a default:
- kill -9 on broad process sets
- rm -rf on system or kubelet directories
- deleting container images blindly
- restarting kubelet or container runtime without noting impact
- draining or cordoning nodes without explaining implications
- deleting pods without checking controller ownership and service impact
- clearing logs blindly
- dropping caches unless explicitly justified and authorized
If cleanup is needed, prefer:
- inspect first
- estimate impact
- identify ownership
- recommend reversible or bounded steps
- state rollback considerations where applicable
# Guidance Style
Your guidance should be:
- concise but technically solid
- actionable
- prioritized
- explicit about risk
Prefer wording like:
- "Evidence suggests…"
- "Most likely…"
- "Before acting, verify…"
- "Low-risk next step…"
- "Potentially disruptive action…"
- "Do not do this unless…"
# Command Strategy
When suggesting commands, use phases:
## Phase 1 Safe Inspection
Read-only inspection commands.
## Phase 2 Focused Verification
Commands to confirm or disprove likely causes.
## Phase 3 Optional Remediation
Clearly marked commands that may alter system state.
Prefer common Linux/Kubernetes commands and explain what each is for.
# Expected Inputs
You may receive:
- raw command output
- copied logs
- kubectl output
- descriptions of symptoms
- process lists
- memory or disk reports
- journald excerpts
Work with what is available and say what is missing.
# Response Constraints
- Do not invent evidence
- Do not assume root access unless stated
- Do not assume kubectl access unless stated
- Do not assume that high memory usage is bad unless pressure or leak symptoms are present
- Do not assume old processes are stale without contextual clues
- Do not treat cache as a leak by default
- Do not recommend aggressive cleanup merely because resources are non-zero
# Optional Heuristics
Use heuristics such as:
- zombie count > 0 is noteworthy
- D-state tasks deserve attention
- repeated OOM kills are high severity
- memory available trending very low plus reclaim pressure is serious
- CLOSE_WAIT accumulation suggests application/socket cleanup issues
- inode pressure is often missed and operationally important
- frequent restarts plus node pressure may point to host instability
- kubelet and runtime log repetition often reveals the real fault line
# Default Task
When invoked, begin by determining the current operational picture and producing a node health assessment focused on:
- stale or abnormal processes
- excessive memory consumers
- resource pressure
- signs of instability
- safe guidance for stabilization
If insufficient evidence is available, state exactly which safe inspection commands should be run next.

View File

@@ -0,0 +1,37 @@
# KAIZEN-WP-0001 — Community Engagement and Advanced Automation
**Status:** active
**Owner:** kaizen-agentic
**Repo:** kaizen-agentic
**Target version:** 1.1.0
## Goal
Deliver community engagement features, automation tooling, and quality-of-life improvements
to make kaizen-agentic easier to adopt, contribute to, and operate reliably.
## Tasks
### To Add
- [ ] T01 — Developer feedback mechanisms for easy repo user feedback collection
- [ ] T02 — Pre-commit hooks for automated code quality checks
- [ ] T03 — CI/CD pipeline configuration for automated testing and deployment
- [ ] T04 — Usage analytics and telemetry for agent effectiveness tracking
### To Refactor
- [ ] T05 — Enhanced error handling in CLI with more informative messages
- [ ] T06 — Performance optimization for large project installations
### To Fix
- [ ] T07 — Cross-platform compatibility testing and fixes for Windows/macOS
### To Remove
- [ ] T08 — Remove remaining development scaffolding or temporary files
## Notes
Tasks migrated from TODO.md [Unreleased] section on 2026-03-18.

View File

@@ -0,0 +1,259 @@
# KAIZEN-WP-0002 — Agency Framework: Project Memory, Coaching, and sys-medic Integration
**Status:** active
**Owner:** kaizen-agentic
**Repo:** kaizen-agentic
## Goal
Evolve kaizen-agentic from a library of standalone agent instruction sets into a
coherent **agency** — a system where agents are deployed into projects with their
own persistent memory, learn from experience, and are guided by a coaching
meta-agent that distils patterns across the whole agent fleet.
sys-medic is the first concrete integration that drives and validates the framework.
---
## Part 1 — Integrate sys-medic as a Standard kaizen-agentic Agent
Minimal, no new conventions required. Get sys-medic into the library in the
existing format.
### Tasks
- [x] T01 — Copy `agent-sys-medic.md` into `agents/` with correct naming convention
- [x] T02 — Add YAML frontmatter (`name`, `description`, `category: infrastructure`)
- [x] T03 — Collapse to single prompt (remove the "Shorter version" section; the lean
version can live as an inline note at the top of the full prompt)
- [x] T04 — Add a source attribution comment referencing the sys-medic repo
- [x] T05 — Validate agent loads correctly via `kaizen-agentic list` and `validate`
- [x] T06 — Update CHANGELOG.md for the new agent addition
### Definition of done
`kaizen-agentic list` shows `sys-medic` under `infrastructure`. Agent passes
`kaizen-agentic validate`. No other conventions changed.
---
## Part 2 — Agency Framework: Project Memory and Coaching Meta-Agent
### Vision
Each agent deployed into a project accumulates a **project-scoped memory** — a
structured file written at session close and read at session start. A new
**coaching meta-agent** reads across all agent memories in a project and produces
an orientation brief for any newly deployed agent: what has been tried, what
worked, what to watch out for.
kaizen-agentic becomes an agency whose agents arrive in a project informed, not
blank.
### Memory Model
**Location convention:**
```
<project-root>/.kaizen/agents/<agent-name>/memory.md
```
**Memory file structure:**
```markdown
---
agent: <name>
project: <project-root or slug>
last_updated: <ISO date>
session_count: <n>
---
## Project Context
<!-- What this agent knows about the project it is working in -->
## Accumulated Findings
<!-- Patterns, recurring issues, key decisions the agent has encountered -->
## What Worked
<!-- Approaches that produced good results in this project -->
## Watch Points
<!-- Recurring risks, traps, or areas requiring extra care -->
## Open Threads
<!-- Things noticed but not yet acted on -->
## Session Log
<!-- One-line entry per session: date, summary, outcome -->
```
**Session-start protocol (all agents):**
1. Check for `.kaizen/agents/<name>/memory.md` in the project root
2. If present, read it before beginning work
3. Acknowledge the memory in the opening brief
**Session-close protocol (all agents):**
1. Update `## Accumulated Findings`, `## What Worked`, `## Watch Points` as needed
2. Append one line to `## Session Log`
3. Bump `last_updated` and `session_count`
### Coaching Meta-Agent
A new agent `agent-coach.md` (category: `meta`) that:
- Reads all `.kaizen/agents/*/memory.md` files in a project
- Synthesises a **cross-agent brief**: patterns common across agents, cross-domain
risks, resource or architectural signals that multiple agents have flagged
- Produces a **new-agent orientation**: targeted summary for a specific agent about
to be deployed for the first time in this project
- Can be invoked explicitly: *"Coach, brief the sys-medic agent on this project"*
- Does not perform domain work itself — observes, synthesises, and advises
The coaching agent also maintains its own memory file covering meta-level
observations about how the agent fleet is functioning in the project.
### CLI Integration
`kaizen-agentic` CLI gains a `memory` command group:
```
kaizen-agentic memory show <agent> # Print agent memory for current project
kaizen-agentic memory init <agent> # Scaffold empty memory file
kaizen-agentic memory brief <agent> # Run coach, print orientation for agent
kaizen-agentic memory clear <agent> # Wipe memory (with confirmation)
```
### Tasks
**Memory convention and tooling**
- [ ] T07 — Write ADR: project memory convention (file location, structure, lifecycle)
- [ ] T08 — Implement `memory` CLI command group (show, init, brief, clear)
- [ ] T09 — Add session-start and session-close protocol sections to agent template /
contributor guide
**Agent definition updates**
- [ ] T10 — Add session-start and session-close protocol blocks to all existing
agents that do session-bound work (project-management, tdd-workflow,
requirements-engineering, scope-analyst, sys-medic)
- [ ] T11 — Update agent YAML frontmatter schema to include optional
`memory: enabled|disabled` field (default: enabled)
**Coaching meta-agent**
- [ ] T12 — Write `agents/agent-coach.md` definition
- [ ] T13 — Wire `kaizen-agentic memory brief <agent>` to invoke coach logic
- [ ] T14 — Add coach to agent registry and validate
**Documentation**
- [ ] T15 — Write `docs/agency-framework.md` explaining the memory model, coach
agent, and deployment lifecycle
- [ ] T16 — Update README to reflect the agency positioning
### Definition of done
- `.kaizen/agents/<name>/memory.md` convention documented in ADR
- `memory` CLI commands implemented and tested
- `agent-coach.md` loads, validates, and produces a coherent brief when invoked
against a project with at least one populated agent memory file
- At least one existing agent (project-management or tdd-workflow) updated with
session protocols and tested end-to-end
---
## Part 3 — sys-medic with Protocols, Extended via Agency Framework
With the memory framework in place (Part 2), extend sys-medic so it:
- Accumulates project/node-specific operational knowledge across sessions
- Integrates its companion protocols runbook as a managed artifact
### Protocols Runbook Convention
A new optional artifact type alongside agent definitions:
```
agents/protocols/<agent-name>/<slug>.md
```
Protocols are structured runbooks — reusable, parameterised inspection or
remediation checklists that an agent can reference or hand off to the operator.
They are NOT prompts. They are human-readable procedural documents produced or
refined through agent sessions.
The sys-medic k3s health assessment protocol is the first example.
### sys-medic Memory Extensions
sys-medic's memory file gains an additional section beyond the base template:
```markdown
## Node Profiles
<!-- Per-node operational baseline established over sessions -->
<!-- hostname | typical load | known quirks | last assessment date -->
## Recurring Findings
<!-- Issues seen more than once: pattern + first seen + frequency -->
## Cleared Issues
<!-- Issues that were resolved: what was done, when, outcome -->
```
### Tasks
**Protocols convention**
- [ ] T17 — Write ADR: protocols artifact convention (location, structure, lifecycle)
- [ ] T18 — Create `agents/protocols/` directory with `README.md` explaining the
convention
- [ ] T19 — Move/adapt `sys-medic` k3s health assessment protocol into
`agents/protocols/sys-medic/k3s-node-health-assessment.md`
**sys-medic memory integration**
- [ ] T20 — Add session-start and session-close protocol blocks to `agent-sys-medic.md`
(extending the base protocol from Part 2 with the node-profile extensions)
- [ ] T21 — Add `## Node Profiles`, `## Recurring Findings`, `## Cleared Issues`
extensions to sys-medic memory template
- [ ] T22 — Update sys-medic prompt to reference its protocol runbook when performing
structured assessments ("use the k3s protocol if available")
**CLI integration**
- [ ] T23 — Add `kaizen-agentic protocols list [agent]` and
`kaizen-agentic protocols show <agent> <slug>` commands
- [ ] T24 — Add protocol scaffolding to `kaizen-agentic memory init sys-medic`
**Validation and documentation**
- [ ] T25 — End-to-end test: deploy sys-medic into a test project, run two simulated
sessions, verify memory accumulates and coach produces a useful brief
- [ ] T26 — Update `docs/agency-framework.md` with protocols section
- [ ] T27 — Update sys-medic agent doc with memory and protocol references
### Definition of done
- Protocol runbook lives in `agents/protocols/sys-medic/`
- sys-medic memory template includes node-profile extensions
- sys-medic session-start reads memory + references relevant protocol
- sys-medic session-close updates node profiles and findings
- Coach agent produces a brief for sys-medic that includes node-level context from memory
- CLI exposes protocol listing and viewing
---
## Sequencing
```
Part 1 (T01T06) ──→ Part 2 (T07T16) ──→ Part 3 (T17T27)
~1 session ~34 sessions ~23 sessions
```
Part 1 is independent and can ship immediately. Part 3 depends on Part 2's
memory framework being in place. Parts 2 and 3 together define the agency model
that can then be generalised to bring future agents (from other repos like
sys-medic) into the framework at lower incremental cost.
---
## Notes
- The `.kaizen/` directory in target projects is analogous to `.claude/` — a
project-level configuration and state directory owned by the kaizen-agentic
ecosystem
- The coaching meta-agent draws conceptual inspiration from how the `project-management`
agent already maintains session start/close protocols — that pattern is being
generalised and made consistent across the fleet
- Protocol runbooks (Part 3) are distinct from agent prompts: they are operational
checklists for humans and agents to execute, not instruction sets for shaping AI behaviour