WP-0003 Part 5: tdd-workflow metrics pilot

Add metrics frontmatter and session-close recording to tdd-workflow,
document the reference implementation in wiki/AboutKaizenAgents.md,
and add an e2e test covering record → show → optimize → brief.
This commit is contained in:
2026-06-16 01:48:43 +02:00
parent 04fdc249f5
commit fd2edfbe6c
4 changed files with 231 additions and 17 deletions

View File

@@ -2,6 +2,21 @@
name: tdd-workflow
description: Expert guidance for the TDD8 workflow methodology, specializing in the comprehensive ISSUE-TEST-RED-GREEN-REFACTOR-DOCUMENT-REFINE-PUBLISH cycle with sophisticated sidequest management and proper test organization.
category: development-process
memory: enabled
metrics:
primary:
name: test_pass_rate
description: Share of acceptance-criteria tests passing at PUBLISH
measurement: passing_tests / total_tests for the active issue workspace
target: 1.0
secondary:
- name: cycle_time_s
description: Wall-clock time from ISSUE start to PUBLISH
measurement: Session duration in seconds (execution_time_s in ADR-004)
collection:
frequency: per_execution
storage: .kaizen/metrics/tdd-workflow/
retention: 180d
---
# TDDAi Assistant Agent
@@ -372,3 +387,20 @@ The comprehensive 8-step development methodology that transforms requirements in
2. Update `## What Worked` and `## Watch Points` as needed.
3. Append one line to `## Session Log`: `YYYY-MM-DD · <issue or feature> · <outcome>`.
4. Bump `last_updated` to today and increment `session_count`.
5. Record session metrics (ADR-004; adjust values to match outcome):
```bash
# Successful PUBLISH — all acceptance tests green:
echo '{"success": true, "execution_time_s": <seconds>, "quality_score": 0.9, "primary_metric": {"name": "test_pass_rate", "value": 1.0, "target": 1.0}, "metadata": {"issue": "<NUM>", "phase": "PUBLISH"}}' \
| kaizen-agentic metrics record tdd-workflow --json --idempotency-key <session-id>
# Incomplete or failed cycle:
echo '{"success": false, "execution_time_s": <seconds>, "quality_score": 0.4, "primary_metric": {"name": "test_pass_rate", "value": <rate>, "target": 1.0}, "metadata": {"issue": "<NUM>", "phase": "<last-phase>"}}' \
| kaizen-agentic metrics record tdd-workflow --json --idempotency-key <session-id>
```
Shorthand when only outcome and duration matter:
```bash
kaizen-agentic metrics record tdd-workflow --success --time <seconds> --quality <0.0-1.0>
```

View File

@@ -8,8 +8,10 @@ Tests the full workflow:
4. memory brief — verify orientation brief includes own memory and cross-agent context
5. protocols list / show — verify protocol discovery works
6. memory clear — verify wipe works
7. tdd-workflow pilot — record → show → optimize → brief (WP-0003 Part 5)
"""
import json
import textwrap
from pathlib import Path
@@ -17,6 +19,8 @@ import pytest
from click.testing import CliRunner
from kaizen_agentic.cli import cli
from kaizen_agentic.metrics import MetricsStore, OptimizerStore
from kaizen_agentic.optimization import MIN_SAMPLES_FOR_RECOMMENDATIONS
# ---------------------------------------------------------------------------
@@ -67,6 +71,34 @@ def _sys_medic_memory() -> str:
""")
def _tdd_workflow_memory() -> str:
"""Realistic tdd-workflow memory after two issue cycles."""
return textwrap.dedent("""\
---
agent: tdd-workflow
project: demo-app
last_updated: 2026-06-16
session_count: 2
---
## Project Context
Python service using TDD8 with Gitea issues and pytest.
## Accumulated Findings
- Sidequests from REFINE often block PUBLISH when lint debt accumulates
## What Worked
- `make tdd-start NUM=X` before writing tests keeps RED phase focused
## Watch Points
- Flaky integration tests under parallel pytest (-n auto)
## Session Log
2026-06-10 · issue 12 metrics store · PUBLISH complete · success
2026-06-16 · issue 15 CLI flags · stalled at REFINE · partial
""")
def _project_management_memory() -> str:
"""Minimal project-management agent memory."""
return textwrap.dedent("""\
@@ -275,6 +307,104 @@ class TestMemoryClear:
assert "nothing to clear" in result.output
class TestTddWorkflowMetricsPilot:
"""Full measure → analyse → orient loop for the tdd-workflow pilot agent."""
def _populate_memory(self, project: Path) -> None:
memory_dir = project / ".kaizen" / "agents" / "tdd-workflow"
memory_dir.mkdir(parents=True, exist_ok=True)
(memory_dir / "memory.md").write_text(_tdd_workflow_memory())
def test_full_metrics_loop_record_show_optimize_brief(self, project):
runner = CliRunner()
self._populate_memory(project)
sessions = [
{
"success": True,
"execution_time_s": 4200.0,
"quality_score": 0.92,
"primary_metric": {
"name": "test_pass_rate",
"value": 1.0,
"target": 1.0,
},
"metadata": {"issue": "12", "phase": "PUBLISH"},
},
{
"success": False,
"execution_time_s": 5400.0,
"quality_score": 0.45,
"primary_metric": {
"name": "test_pass_rate",
"value": 0.78,
"target": 1.0,
},
"metadata": {"issue": "15", "phase": "REFINE"},
},
]
for index, payload in enumerate(sessions, start=1):
result = runner.invoke(
cli,
[
"metrics",
"record",
"tdd-workflow",
"--target",
str(project),
"--json",
"--idempotency-key",
f"session-{index}",
],
input=json.dumps(payload),
)
assert result.exit_code == 0, result.output
assert "Recorded metrics" in result.output
show_result = runner.invoke(
cli,
["metrics", "show", "tdd-workflow", "--target", str(project)],
)
assert show_result.exit_code == 0
assert "test_pass_rate" in show_result.output or "2 execution" in show_result.output.lower()
store = MetricsStore(project, "tdd-workflow")
for i in range(MIN_SAMPLES_FOR_RECOMMENDATIONS - len(sessions)):
store.append(
{
"success": False,
"execution_time_s": 90.0 + i,
"quality_score": 0.35,
"primary_metric": {
"name": "test_pass_rate",
"value": 0.6,
"target": 1.0,
},
},
idempotency_key=f"seed-{i}",
)
optimize_result = runner.invoke(
cli,
["metrics", "optimize", "tdd-workflow", "--target", str(project)],
)
assert optimize_result.exit_code == 0, optimize_result.output
optimizer = OptimizerStore(project)
assert optimizer.analysis_path.exists()
assert optimizer.recommendations_path.exists()
brief_result = runner.invoke(
cli,
["memory", "brief", "tdd-workflow", "--target", str(project)],
)
assert brief_result.exit_code == 0
assert "## Performance Summary" in brief_result.output
assert "Success rate:" in brief_result.output
assert "issue 12" in brief_result.output or "TDD8" in brief_result.output
assert "Your Memory" in brief_result.output
class TestProtocolsCommand:
def test_protocols_list_finds_sys_medic(self):
"""Protocols list against the real agents dir should include sys-medic k3s protocol."""

View File

@@ -1,24 +1,76 @@
AboutKaizenAgents
# About Kaizen Agents
*Basic concepts of Kaizen Agents*
Basic concepts of Kaizen Agents.
All Kaizen Agents follow the KaizenAgentTemplateDefinition
All Kaizen Agents follow the [KaizenAgentTemplate](KaizenAgentTemplate.md) definition.
That template provides a comprehensive structure for defining Kaizen Agent subagents.
This template provides a comprehensive structure for defining KaizenAgent subagents.
Key sections:
The key sections are:
- **Specification** — declarative outcomes rather than implementation steps
- **Idempotency design** — detect and handle already-completed work
- **Metrics** — measurable success criteria from day one
- **Testing** — scenarios that feed the optimization loop
- **Evolution tracking** — improvement history and performance trends
Specification: Focuses on declarative outcomes rather than implementation steps, making agents more maintainable and testable.
The template enforces separation of concerns, testability, and measurability while
keeping agent definitions consistent across the fleet.
Idempotency Design: Forces you to think upfront about how the agent will detect and handle already-completed work.
---
Metrics: Ensures every agent has measurable success criteria from day one.
## Metrics-enabled pilot: `tdd-workflow`
Testing: Built-in test scenarios that can be automated as part of the optimization loop.
`tdd-workflow` is the reference implementation for project-scoped metrics (WP-0003).
Use it as a template when adding metrics to other agents.
Evolution Tracking: Maintains a history of improvements and provides hooks for the KaizenAgent to analyze performance trends.
### What is measured
The template enforces our design principles - separation of concerns, testability, and measurability - while providing enough structure to ensure consistency across different coding subagents.
| Metric | Role | How |
|--------|------|-----|
| `test_pass_rate` | Primary | Passing tests ÷ total tests at PUBLISH (target: 1.0) |
| `cycle_time_s` | Secondary | Session duration (`execution_time_s` in ADR-004) |
Definitions live in the agent frontmatter (`agents/agent-tdd-workflow.md`).
xxx
### Where data lives
```
<project>/.kaizen/metrics/tdd-workflow/
executions.jsonl # append-only per-session records
summary.json # rolling aggregates (auto-generated)
```
Scaffolded by `kaizen-agentic memory init tdd-workflow` alongside
`.kaizen/agents/tdd-workflow/memory.md`.
### Session-close loop
At the end of each TDD8 session:
1. Update qualitative memory (`## Session Log`, findings, watch points).
2. Record quantitative outcome:
```bash
kaizen-agentic metrics record tdd-workflow --success --time <seconds> --quality <0.0-1.0>
```
Or pass a full ADR-004 record with `primary_metric` via `--json` (see agent spec).
### Analysis and orientation
| Command | Purpose |
|---------|---------|
| `kaizen-agentic metrics show tdd-workflow` | Summary + recent executions |
| `kaizen-agentic metrics optimize tdd-workflow` | Evidence-based recommendations (≥10 records) |
| `kaizen-agentic memory brief tdd-workflow` | Qualitative memory + `## Performance Summary` |
Fleet-level session analytics remain in **agentic-resources** (Helix Forge); project
metrics stay in `.kaizen/metrics/` per [ADR-004](../docs/adr/ADR-004-project-metrics-convention.md)
and [EcosystemIntegration](EcosystemIntegration.md).
### Adopting metrics on another agent
1. Add a `metrics:` block to frontmatter (primary + secondary + collection).
2. Copy the session-close `metrics record` step from `agent-tdd-workflow.md`.
3. Run `kaizen-agentic memory init <agent>` to scaffold storage.
4. Verify with `metrics show` after one session.

View File

@@ -9,7 +9,7 @@ owner: kaizen-agentic
topic_slug: custodian
state_hub_workstream_id: 36252a45-f360-4496-bf77-17b5dfb02767
created: "2026-06-16"
updated: "2026-06-17"
updated: "2026-06-18"
---
# KAIZEN-WP-0003 — Measurement Loop: Metrics Convention, Collection, and Optimizer Integration
@@ -179,10 +179,10 @@ Prove the loop end-to-end on one agent before fleet-wide rollout.
### Tasks
- [ ] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time)
- [ ] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome
- [ ] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation
- [ ] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation
- [x] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time)
- [x] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome
- [x] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation
- [x] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation
### Definition of done