Files
kaizen-agentic/workplans/kaizen-agentic-WP-0003-measurement-loop.md
tegwick fe795ca750
Some checks failed
ci / test (push) Failing after 38s
docs: close WP-0003 workplan and bind State Hub tasks
Add frontmatter tasks list with state_hub_task_id links and completion
section for the measurement loop (ADR-004, metrics CLI, Coach bridge).
2026-06-17 01:04:39 +02:00

404 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: KAIZEN-WP-0003
type: workplan
title: "Measurement Loop: Metrics Convention, Collection, and Optimizer Integration"
domain: custodian
repo: kaizen-agentic
status: completed
owner: kaizen-agentic
topic_slug: custodian
state_hub_workstream_id: 36252a45-f360-4496-bf77-17b5dfb02767
created: "2026-06-16"
updated: "2026-06-17"
tasks:
- id: T01
title: Write ADR-004 project metrics convention
status: done
state_hub_task_id: 4e7b0fd2-38c0-46aa-84a7-bb18366b8c7c
- id: T02
title: Implement MetricsStore in metrics.py
status: done
state_hub_task_id: eeaa99c7-d7a7-403b-a013-364cba45a663
- id: T03
title: Add memory init hook to scaffold metrics directory
status: done
state_hub_task_id: 247c097f-de89-4383-930c-35ee66de9b36
- id: T04
title: Unit tests for MetricsStore
status: done
state_hub_task_id: 3aa14026-6ee3-4384-b409-11300c1302f0
- id: T05
title: Implement metrics CLI command group
status: done
state_hub_task_id: 6b505d29-7d2e-44a2-a4b7-1fe82884390c
- id: T06
title: Integrate metrics record into session-close template
status: done
state_hub_task_id: 84f2a357-f2dd-4fc7-96b6-a4e80d5467a7
- id: T07
title: CLI tests for metrics commands
status: done
state_hub_task_id: 8e9ee64b-b7c4-4dff-ac6e-988fd47ef95d
- id: T08
title: Update CLI cheat sheet and agency-framework with metrics
status: done
state_hub_task_id: 4c41e0db-d5d8-4a1b-b346-06ad004edf4a
- id: T09
title: Add OptimizationLoop.from_metrics_store factory
status: done
state_hub_task_id: 0b374439-6eca-4754-8e15-2a7eece0cd27
- id: T10
title: Implement kaizen-agentic metrics optimize command
status: done
state_hub_task_id: db87a09b-0252-495c-a771-a43b4b98f820
- id: T11
title: Consolidate optimization meta-agent definitions
status: done
state_hub_task_id: 73cb7d73-6fc6-42a9-97aa-d33cdf9ee363
- id: T12
title: Update optimization agent session protocol
status: done
state_hub_task_id: c127eca7-7394-42db-ba5e-721aef0ccb76
- id: T13
title: Unit and integration tests for optimizer recommendations
status: done
state_hub_task_id: f208dc9f-cdf7-47e3-9c03-09097e46eee9
- id: T14
title: Extend memory brief with metrics summary
status: done
state_hub_task_id: d01f969c-bbb1-4eca-a4f1-d79d5c867b35
- id: T15
title: Extend agent-coach.md for metrics context
status: done
state_hub_task_id: 67f791a4-fced-4986-a331-7eb4ea47fe6e
- id: T16
title: E2e test memory brief with metrics sections
status: done
state_hub_task_id: 1fb89b54-8bd2-40bf-9a71-04693cb9f695
- id: T17
title: Add metrics section to agent-tdd-workflow.md
status: done
state_hub_task_id: 1d471a7a-9a98-4805-903e-b4a2b8153717
- id: T18
title: Add session-close metrics record step to tdd-workflow
status: done
state_hub_task_id: abb387f1-86ce-4b9b-a516-2d4efb6aca4c
- id: T19
title: Document tdd-workflow pilot in wiki/AboutKaizenAgents.md
status: done
state_hub_task_id: 67fbc26e-a57d-4133-96e6-3d2cdbd10dc0
- id: T20
title: E2e test full tdd-workflow measure-analyse-orient loop
status: done
state_hub_task_id: fbdd7c8b-e122-48d9-8c8f-de9f82d025e3
- id: T21
title: Sync 4 missing agents into packaged data/
status: done
state_hub_task_id: 9662bcec-34fe-451b-b61f-5d11b9574576
- id: T22
title: Update README orientation links to INTENT and wiki
status: done
state_hub_task_id: 422aae43-5697-4a00-86e9-1569baf09422
- id: T23
title: Update architecture.md agent table
status: done
state_hub_task_id: ba6b3411-d330-4a58-8cd0-62b4fbef8c5f
- id: T24
title: CHANGELOG entry for metrics convention and CLI
status: done
state_hub_task_id: 748be9f3-f6ac-4f26-a844-6330268935b6
---
# KAIZEN-WP-0003 — Measurement Loop: Metrics Convention, Collection, and Optimizer Integration
**Status:** completed
**Owner:** kaizen-agentic
**Repo:** kaizen-agentic
**Target version:** 1.1.0 (partial; remainder in WP-0001)
## Goal
Close the kaizen feedback loop defined in `INTENT.md` and `wiki/AgentKaizenOptimizer.md`:
agents produce **measurable, per-execution performance records** stored in project-scoped
`.kaizen/metrics/`, the existing `OptimizationLoop` reads that data and generates
evidence-based recommendations, and the Coach/optimizer meta-agents share a single
improvement path.
This workplan addresses the P0 gap from the INTENT gap analysis: strategic vision
(memory + qualitative learning) exists; **quantitative measurement → refinement**
does not.
---
## Background
| Layer | State |
|-------|-------|
| `INTENT.md` | Requires measurable-by-default agents and evidence-based refinement |
| `wiki/KaizenAgentTemplate.md` | Defines `metrics`, `idempotency`, `optimization` sections per agent |
| `wiki/AgentKaizenOptimizer.md` | Specifies `.kaizen/metrics/` storage and optimizer behaviour |
| `src/kaizen_agentic/optimization.py` | `OptimizationLoop` + `PerformanceMetrics` implemented, unit-tested, unwired |
| Agency framework (WP-0002) | `.kaizen/agents/<name>/memory.md` + Coach brief — qualitative only |
| WP-0001 T04 | Telemetry — overlaps; WP-0003 defines the convention; WP-0001 can adopt it |
---
## Part 1 — Metrics Convention and Storage
Define the project-scoped metrics artifact alongside the existing memory convention
(ADR-002).
### Location convention
```
<project-root>/.kaizen/metrics/<agent-name>/
executions.jsonl # append-only per-execution records
summary.json # rolling aggregates (regenerated on write)
```
Optimizer-specific aggregates (per `wiki/AgentKaizenOptimizer.md`):
```
<project-root>/.kaizen/metrics/optimizer/
analysis.json # last run output + fingerprint
recommendations.jsonl # append-only recommendation history
```
### Execution record schema (minimum viable)
```json
{
"timestamp": "ISO-8601",
"agent": "tdd-workflow",
"session_id": "optional-uuid-or-hash",
"execution_time_s": 0.0,
"success": true,
"quality_score": 0.0,
"primary_metric": { "name": "...", "value": 0.0, "target": 0.0 },
"metadata": {}
}
```
### Tasks
- [x] T01 — Write ADR-004: project metrics convention (location, schema, lifecycle, retention, Helix Forge correlation)
- [x] T02 — Implement `MetricsStore` in `src/kaizen_agentic/metrics.py` (append, read, summarise, prune by retention)
- [x] T03 — Add `memory init` hook to scaffold `.kaizen/metrics/<agent>/` alongside memory (optional flag `--no-metrics`)
- [x] T04 — Unit tests for `MetricsStore` (append idempotency key, summary regeneration, retention prune)
### Definition of done
- ADR-004 accepted and referenced from `docs/agency-framework.md`
- `MetricsStore` passes unit tests
- `kaizen-agentic memory init <agent>` creates metrics scaffold by default
---
## Part 2 — Metrics CLI
Expose metrics collection and inspection without requiring Python imports in agent
sessions.
### Commands
```
kaizen-agentic metrics record <agent> # Append one execution record (stdin JSON or flags)
kaizen-agentic metrics show <agent> # Print summary + recent executions
kaizen-agentic metrics list # List agents with metrics in current project
kaizen-agentic metrics export <agent> # Dump executions.jsonl to stdout
```
### Options (record)
- `--target / -t` — project root (default: cwd)
- `--success / --failure` — boolean outcome shorthand
- `--time` — execution time in seconds
- `--quality` — quality score 0.01.0
- `--json` — full record on stdin
### Tasks
- [x] T05 — Implement `metrics` CLI command group (record, show, list, export)
- [x] T06 — Integrate `metrics record` into session-close protocol template for pilot agents
- [x] T07 — CLI tests for metrics commands (click.testing, temp project dir)
- [x] T08 — Update `docs/CLI_CHEAT_SHEET.md` and `docs/agency-framework.md` with metrics section
### Definition of done
- All four metrics commands work against a test project with `.kaizen/metrics/`
- Session-close template documents the `metrics record` one-liner for pilot agents
- CLI cheat sheet updated
---
## Part 3 — Wire OptimizationLoop to Project Metrics
Connect the existing Python optimization infrastructure to real project data.
### Tasks
- [x] T09 — Add `OptimizationLoop.from_metrics_store(store)` factory that loads `PerformanceMetrics` from executions
- [x] T10 — Implement `kaizen-agentic metrics optimize [agent]` — run analysis, print recommendations, write `optimizer/analysis.json`
- [x] T11 — Consolidate `agent-optimization.md` and `agent-agent-optimization.md` into single canonical `optimization` agent; update registry
- [x] T12 — Update `agent-optimization.md` session protocol to invoke `metrics optimize` and reference ADR-004
- [x] T13 — Unit + integration tests: synthetic executions → recommendations → non-empty output
### Definition of done
- `kaizen-agentic metrics optimize` produces recommendations when ≥10 execution records exist (per wiki minimum sample size)
- Single canonical optimization meta-agent in registry
- Tests cover insufficient-data and sufficient-data paths
---
## Part 4 — Bridge Coach, Memory, and Metrics
Unify qualitative memory and quantitative metrics in the orientation path.
### Tasks
- [x] T14 — Extend `memory brief` to include metrics summary for target agent (recent success rate, avg quality, trend arrow)
- [x] T15 — Extend `agent-coach.md` to reference metrics context in synthesis instructions
- [x] T16 — E2e test: populate memory + metrics for two agents → `memory brief` includes both qualitative and quantitative sections
### Definition of done
- `memory brief tdd-workflow` output includes a `## Performance Summary` block when metrics exist
- E2e test passes
---
## Part 5 — Pilot Agent and Template Conformance
Prove the loop end-to-end on one agent before fleet-wide rollout.
**Pilot agent:** `tdd-workflow` (high usage, clear success criteria in existing prompt)
### Tasks
- [x] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time)
- [x] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome
- [x] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation
- [x] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation
### Definition of done
- tdd-workflow is the documented reference for metrics-enabled agents
- Full loop demonstrated in e2e test: record → show → optimize → brief
---
## Part 6 — Packaging and Orientation
Close distribution and documentation gaps surfaced in gap analysis.
### Tasks
- [x] T21 — Sync missing 4 agents into `src/kaizen_agentic/data/agents/` (coach, sys-medic, scope-analyst, optimization)
- [x] T22 — Update `README.md` Getting Oriented to link `INTENT.md` and `wiki/` (SCOPE.md already updated)
- [x] T23 — Update `.claude/rules/architecture.md` agent table (20 agents, meta category, sys-medic, coach)
- [x] T24 — CHANGELOG.md entry for metrics convention and CLI
### Definition of done
- `pip install` / packaged data includes all 21 agents
- README orientation path matches SCOPE.md
- architecture.md agent count accurate
---
## Sequencing
```
Part 1 (T01T04) ──→ Part 2 (T05T08) ──→ Part 3 (T09T13)
Part 4 (T14T16) ←────────────┘
Part 5 (T17T20) ──→ Part 6 (T21T24)
```
Parts 12 are blocking. Part 3 depends on storage + CLI. Parts 45 can overlap
once Part 3 factory exists. Part 6 can run in parallel except T21 (needs final
agent consolidation from T11).
Estimated effort: 46 sessions.
---
## Out of Scope (this workplan)
- Full `wiki/KaizenAgentTemplate.md` conformance for all 21 agents (future workplan)
- KaizenGuidance codemod pipeline (`wiki/KaizenGuidance.md`)
- Scheduled/automated optimizer runs (cron, activity-core integration) — convention only
- WP-0001 CI/CD, PyPI publication, cross-platform testing
- ML-based pattern detection (pandas/sklearn in wiki spec) — simple statistics first
---
## Success Criteria
A reader of `INTENT.md` can point to this repo and say:
1. Agents **can** record measurable per-execution outcomes in a standard location.
2. The optimization loop **does** read real project data and produce recommendations.
3. Coach orientation **includes** performance context, not only qualitative memory.
4. At least one agent (tdd-workflow) demonstrates the full measure → analyse → orient cycle.
---
## Completion
**Shipped:** v1.1.0 (2026-06-18); measurement loop operational through v1.2.0.
| Milestone | Detail |
|-----------|--------|
| ADR-004 | Project metrics convention (`.kaizen/metrics/`) |
| CLI | `metrics record`, `show`, `list`, `export`, `optimize` |
| Coach bridge | `memory brief` includes `## Performance Summary` |
| Pilot | `tdd-workflow` reference in `wiki/AboutKaizenAgents.md` |
| Tests | `test_metrics*.py`, `test_optimization_metrics.py`, e2e agency tests pass |
All 24 tasks complete. Fleet-wide template conformance and scheduled optimizer runs
remain out of scope (future workplans).
---
## State Hub Task IDs
| Code | UUID |
|------|------|
| T01 | 4e7b0fd2-38c0-46aa-84a7-bb18366b8c7c |
| T02 | eeaa99c7-d7a7-403b-a013-364cba45a663 |
| T03 | 247c097f-de89-4383-930c-35ee66de9b36 |
| T04 | 3aa14026-6ee3-4384-b409-11300c1302f0 |
| T05 | 6b505d29-7d2e-44a2-a4b7-1fe82884390c |
| T06 | 84f2a357-f2dd-4fc7-96b6-a4e80d5467a7 |
| T07 | 8e9ee64b-b7c4-4dff-ac6e-988fd47ef95d |
| T08 | 4c41e0db-d5d8-4a1b-b346-06ad004edf4a |
| T09 | 0b374439-6eca-4754-8e15-2a7eece0cd27 |
| T10 | db87a09b-0252-495c-a771-a43b4b98f820 |
| T11 | 73cb7d73-6fc6-42a9-97aa-d33cdf9ee363 |
| T12 | c127eca7-7394-42db-ba5e-721aef0ccb76 |
| T13 | f208dc9f-cdf7-47e3-9c03-09097e46eee9 |
| T14 | d01f969c-bbb1-4eca-a4f1-d79d5c867b35 |
| T15 | 67f791a4-fced-4986-a331-7eb4ea47fe6e |
| T16 | 1fb89b54-8bd2-40bf-9a71-04693cb9f695 |
| T17 | 1d471a7a-9a98-4805-903e-b4a2b8153717 |
| T18 | abb387f1-86ce-4b9b-a516-2d4efb6aca4c |
| T19 | 67fbc26e-a57d-4133-96e6-3d2cdbd10dc0 |
| T20 | fbdd7c8b-e122-48d9-8c8f-de9f82d025e3 |
| T21 | 9662bcec-34fe-451b-b61f-5d11b9574576 |
| T22 | 422aae43-5697-4a00-86e9-1569baf09422 |
| T23 | ba6b3411-d330-4a58-8cd0-62b4fbef8c5f |
| T24 | 748be9f3-f6ac-4f26-a844-6330268935b6 |
**Hub workstream:** `kaizen-wp-0003-measurement-loop` (`36252a45-f360-4496-bf77-17b5dfb02767`)
---
## Notes
- Retention default: 180 days (per `wiki/AgentKaizenOptimizer.md`); override via project config in a later iteration
- WP-0001 T04 (telemetry) should consume ADR-004 schema rather than inventing a parallel format
- `OptimizationLoop` threshold constants (30s execution, 0.8 success rate) are starting points; expose in config later