WP-0003 Part 5: tdd-workflow metrics pilot

Add metrics frontmatter and session-close recording to tdd-workflow, document the reference implementation in wiki/AboutKaizenAgents.md, and add an e2e test covering record → show → optimize → brief.
2026-06-16 01:48:43 +02:00
parent 04fdc249f5
commit fd2edfbe6c
4 changed files with 231 additions and 17 deletions
--- a/agents/agent-tdd-workflow.md
+++ b/agents/agent-tdd-workflow.md
@@ -2,6 +2,21 @@
 name: tdd-workflow
 description: Expert guidance for the TDD8 workflow methodology, specializing in the comprehensive ISSUE-TEST-RED-GREEN-REFACTOR-DOCUMENT-REFINE-PUBLISH cycle with sophisticated sidequest management and proper test organization.
 category: development-process
+memory: enabled
+metrics:
+  primary:
+    name: test_pass_rate
+    description: Share of acceptance-criteria tests passing at PUBLISH
+    measurement: passing_tests / total_tests for the active issue workspace
+    target: 1.0
+  secondary:
+    - name: cycle_time_s
+      description: Wall-clock time from ISSUE start to PUBLISH
+      measurement: Session duration in seconds (execution_time_s in ADR-004)
+  collection:
+    frequency: per_execution
+    storage: .kaizen/metrics/tdd-workflow/
+    retention: 180d
 ---

 # TDDAi Assistant Agent
@@ -372,3 +387,20 @@ The comprehensive 8-step development methodology that transforms requirements in
 2. Update `## What Worked` and `## Watch Points` as needed.
 3. Append one line to `## Session Log`: `YYYY-MM-DD · <issue or feature> · <outcome>`.
 4. Bump `last_updated` to today and increment `session_count`.
+5. Record session metrics (ADR-004; adjust values to match outcome):
+
+```bash
+# Successful PUBLISH — all acceptance tests green:
+echo '{"success": true, "execution_time_s": <seconds>, "quality_score": 0.9, "primary_metric": {"name": "test_pass_rate", "value": 1.0, "target": 1.0}, "metadata": {"issue": "<NUM>", "phase": "PUBLISH"}}' \
+  | kaizen-agentic metrics record tdd-workflow --json --idempotency-key <session-id>
+
+# Incomplete or failed cycle:
+echo '{"success": false, "execution_time_s": <seconds>, "quality_score": 0.4, "primary_metric": {"name": "test_pass_rate", "value": <rate>, "target": 1.0}, "metadata": {"issue": "<NUM>", "phase": "<last-phase>"}}' \
+  | kaizen-agentic metrics record tdd-workflow --json --idempotency-key <session-id>
+```
+
+Shorthand when only outcome and duration matter:
+
+```bash
+kaizen-agentic metrics record tdd-workflow --success --time <seconds> --quality <0.0-1.0>
+```
--- a/tests/test_e2e_agency_framework.py
+++ b/tests/test_e2e_agency_framework.py
@@ -8,8 +8,10 @@ Tests the full workflow:
  4. memory brief — verify orientation brief includes own memory and cross-agent context
  5. protocols list / show — verify protocol discovery works
  6. memory clear — verify wipe works
+  7. tdd-workflow pilot — record → show → optimize → brief (WP-0003 Part 5)
 """

+import json
 import textwrap
 from pathlib import Path

@@ -17,6 +19,8 @@ import pytest
 from click.testing import CliRunner

 from kaizen_agentic.cli import cli
+from kaizen_agentic.metrics import MetricsStore, OptimizerStore
+from kaizen_agentic.optimization import MIN_SAMPLES_FOR_RECOMMENDATIONS


 # ---------------------------------------------------------------------------
@@ -67,6 +71,34 @@ def _sys_medic_memory() -> str:
    """)


+def _tdd_workflow_memory() -> str:
+    """Realistic tdd-workflow memory after two issue cycles."""
+    return textwrap.dedent("""\
+        ---
+        agent: tdd-workflow
+        project: demo-app
+        last_updated: 2026-06-16
+        session_count: 2
+        ---
+
+        ## Project Context
+        Python service using TDD8 with Gitea issues and pytest.
+
+        ## Accumulated Findings
+        - Sidequests from REFINE often block PUBLISH when lint debt accumulates
+
+        ## What Worked
+        - `make tdd-start NUM=X` before writing tests keeps RED phase focused
+
+        ## Watch Points
+        - Flaky integration tests under parallel pytest (-n auto)
+
+        ## Session Log
+        2026-06-10 · issue 12 metrics store · PUBLISH complete · success
+        2026-06-16 · issue 15 CLI flags · stalled at REFINE · partial
+    """)
+
+
 def _project_management_memory() -> str:
    """Minimal project-management agent memory."""
    return textwrap.dedent("""\
@@ -275,6 +307,104 @@ class TestMemoryClear:
        assert "nothing to clear" in result.output


+class TestTddWorkflowMetricsPilot:
+    """Full measure → analyse → orient loop for the tdd-workflow pilot agent."""
+
+    def _populate_memory(self, project: Path) -> None:
+        memory_dir = project / ".kaizen" / "agents" / "tdd-workflow"
+        memory_dir.mkdir(parents=True, exist_ok=True)
+        (memory_dir / "memory.md").write_text(_tdd_workflow_memory())
+
+    def test_full_metrics_loop_record_show_optimize_brief(self, project):
+        runner = CliRunner()
+        self._populate_memory(project)
+
+        sessions = [
+            {
+                "success": True,
+                "execution_time_s": 4200.0,
+                "quality_score": 0.92,
+                "primary_metric": {
+                    "name": "test_pass_rate",
+                    "value": 1.0,
+                    "target": 1.0,
+                },
+                "metadata": {"issue": "12", "phase": "PUBLISH"},
+            },
+            {
+                "success": False,
+                "execution_time_s": 5400.0,
+                "quality_score": 0.45,
+                "primary_metric": {
+                    "name": "test_pass_rate",
+                    "value": 0.78,
+                    "target": 1.0,
+                },
+                "metadata": {"issue": "15", "phase": "REFINE"},
+            },
+        ]
+
+        for index, payload in enumerate(sessions, start=1):
+            result = runner.invoke(
+                cli,
+                [
+                    "metrics",
+                    "record",
+                    "tdd-workflow",
+                    "--target",
+                    str(project),
+                    "--json",
+                    "--idempotency-key",
+                    f"session-{index}",
+                ],
+                input=json.dumps(payload),
+            )
+            assert result.exit_code == 0, result.output
+            assert "Recorded metrics" in result.output
+
+        show_result = runner.invoke(
+            cli,
+            ["metrics", "show", "tdd-workflow", "--target", str(project)],
+        )
+        assert show_result.exit_code == 0
+        assert "test_pass_rate" in show_result.output or "2 execution" in show_result.output.lower()
+
+        store = MetricsStore(project, "tdd-workflow")
+        for i in range(MIN_SAMPLES_FOR_RECOMMENDATIONS - len(sessions)):
+            store.append(
+                {
+                    "success": False,
+                    "execution_time_s": 90.0 + i,
+                    "quality_score": 0.35,
+                    "primary_metric": {
+                        "name": "test_pass_rate",
+                        "value": 0.6,
+                        "target": 1.0,
+                    },
+                },
+                idempotency_key=f"seed-{i}",
+            )
+
+        optimize_result = runner.invoke(
+            cli,
+            ["metrics", "optimize", "tdd-workflow", "--target", str(project)],
+        )
+        assert optimize_result.exit_code == 0, optimize_result.output
+        optimizer = OptimizerStore(project)
+        assert optimizer.analysis_path.exists()
+        assert optimizer.recommendations_path.exists()
+
+        brief_result = runner.invoke(
+            cli,
+            ["memory", "brief", "tdd-workflow", "--target", str(project)],
+        )
+        assert brief_result.exit_code == 0
+        assert "## Performance Summary" in brief_result.output
+        assert "Success rate:" in brief_result.output
+        assert "issue 12" in brief_result.output or "TDD8" in brief_result.output
+        assert "Your Memory" in brief_result.output
+
+
 class TestProtocolsCommand:
    def test_protocols_list_finds_sys_medic(self):
        """Protocols list against the real agents dir should include sys-medic k3s protocol."""
--- a/wiki/AboutKaizenAgents.md
+++ b/wiki/AboutKaizenAgents.md
@@ -1,24 +1,76 @@
-AboutKaizenAgents
+# About Kaizen Agents

-*Basic concepts of Kaizen Agents*
+Basic concepts of Kaizen Agents.

-All Kaizen Agents follow the KaizenAgentTemplateDefinition 
+All Kaizen Agents follow the [KaizenAgentTemplate](KaizenAgentTemplate.md) definition.
+That template provides a comprehensive structure for defining Kaizen Agent subagents.

-This template provides a comprehensive structure for defining KaizenAgent subagents. 
+Key sections:

-The key sections are:
+- **Specification** — declarative outcomes rather than implementation steps
+- **Idempotency design** — detect and handle already-completed work
+- **Metrics** — measurable success criteria from day one
+- **Testing** — scenarios that feed the optimization loop
+- **Evolution tracking** — improvement history and performance trends

-Specification: Focuses on declarative outcomes rather than implementation steps, making agents more maintainable and testable.
+The template enforces separation of concerns, testability, and measurability while
+keeping agent definitions consistent across the fleet.

-Idempotency Design: Forces you to think upfront about how the agent will detect and handle already-completed work.
+---

-Metrics: Ensures every agent has measurable success criteria from day one.
+## Metrics-enabled pilot: `tdd-workflow`

-Testing: Built-in test scenarios that can be automated as part of the optimization loop.
+`tdd-workflow` is the reference implementation for project-scoped metrics (WP-0003).
+Use it as a template when adding metrics to other agents.

-Evolution Tracking: Maintains a history of improvements and provides hooks for the KaizenAgent to analyze performance trends.
+### What is measured

-The template enforces our design principles  - separation of concerns, testability, and measurability - while providing enough structure to ensure consistency across different coding subagents.
+| Metric | Role | How |
+|--------|------|-----|
+| `test_pass_rate` | Primary | Passing tests ÷ total tests at PUBLISH (target: 1.0) |
+| `cycle_time_s` | Secondary | Session duration (`execution_time_s` in ADR-004) |

+Definitions live in the agent frontmatter (`agents/agent-tdd-workflow.md`).

-xxx
+### Where data lives
+
+```
+<project>/.kaizen/metrics/tdd-workflow/
+  executions.jsonl    # append-only per-session records
+  summary.json        # rolling aggregates (auto-generated)
+```
+
+Scaffolded by `kaizen-agentic memory init tdd-workflow` alongside
+`.kaizen/agents/tdd-workflow/memory.md`.
+
+### Session-close loop
+
+At the end of each TDD8 session:
+
+1. Update qualitative memory (`## Session Log`, findings, watch points).
+2. Record quantitative outcome:
+
+```bash
+kaizen-agentic metrics record tdd-workflow --success --time <seconds> --quality <0.0-1.0>
+```
+
+Or pass a full ADR-004 record with `primary_metric` via `--json` (see agent spec).
+
+### Analysis and orientation
+
+| Command | Purpose |
+|---------|---------|
+| `kaizen-agentic metrics show tdd-workflow` | Summary + recent executions |
+| `kaizen-agentic metrics optimize tdd-workflow` | Evidence-based recommendations (≥10 records) |
+| `kaizen-agentic memory brief tdd-workflow` | Qualitative memory + `## Performance Summary` |
+
+Fleet-level session analytics remain in **agentic-resources** (Helix Forge); project
+metrics stay in `.kaizen/metrics/` per [ADR-004](../docs/adr/ADR-004-project-metrics-convention.md)
+and [EcosystemIntegration](EcosystemIntegration.md).
+
+### Adopting metrics on another agent
+
+1. Add a `metrics:` block to frontmatter (primary + secondary + collection).
+2. Copy the session-close `metrics record` step from `agent-tdd-workflow.md`.
+3. Run `kaizen-agentic memory init <agent>` to scaffold storage.
+4. Verify with `metrics show` after one session.
--- a/workplans/kaizen-agentic-WP-0003-measurement-loop.md
+++ b/workplans/kaizen-agentic-WP-0003-measurement-loop.md
@@ -9,7 +9,7 @@ owner: kaizen-agentic
 topic_slug: custodian
 state_hub_workstream_id: 36252a45-f360-4496-bf77-17b5dfb02767
 created: "2026-06-16"
-updated: "2026-06-17"
+updated: "2026-06-18"
 ---

 # KAIZEN-WP-0003 — Measurement Loop: Metrics Convention, Collection, and Optimizer Integration
@@ -179,10 +179,10 @@ Prove the loop end-to-end on one agent before fleet-wide rollout.

 ### Tasks

- [ ] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time)
- [ ] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome
- [ ] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation
- [ ] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation
+- [x] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time)
+- [x] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome
+- [x] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation
+- [x] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation

 ### Definition of done