Files
kaizen-agentic/workplans/kaizen-agentic-WP-0003-measurement-loop.md
tegwick 4a9c2d9bea WP-0003 Part 6: packaging sync and docs close-out
Sync coach, sys-medic, scope-analyst, optimization, and updated
tdd-workflow to packaged data (20 agents). Update architecture.md,
README orientation, and CHANGELOG for the metrics loop. Mark WP-0003
completed.
2026-06-16 01:49:27 +02:00

289 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: KAIZEN-WP-0003
type: workplan
title: "Measurement Loop: Metrics Convention, Collection, and Optimizer Integration"
domain: custodian
repo: kaizen-agentic
status: completed
owner: kaizen-agentic
topic_slug: custodian
state_hub_workstream_id: 36252a45-f360-4496-bf77-17b5dfb02767
created: "2026-06-16"
updated: "2026-06-18"
---
# KAIZEN-WP-0003 — Measurement Loop: Metrics Convention, Collection, and Optimizer Integration
**Status:** completed
**Owner:** kaizen-agentic
**Repo:** kaizen-agentic
**Target version:** 1.1.0 (partial; remainder in WP-0001)
## Goal
Close the kaizen feedback loop defined in `INTENT.md` and `wiki/AgentKaizenOptimizer.md`:
agents produce **measurable, per-execution performance records** stored in project-scoped
`.kaizen/metrics/`, the existing `OptimizationLoop` reads that data and generates
evidence-based recommendations, and the Coach/optimizer meta-agents share a single
improvement path.
This workplan addresses the P0 gap from the INTENT gap analysis: strategic vision
(memory + qualitative learning) exists; **quantitative measurement → refinement**
does not.
---
## Background
| Layer | State |
|-------|-------|
| `INTENT.md` | Requires measurable-by-default agents and evidence-based refinement |
| `wiki/KaizenAgentTemplate.md` | Defines `metrics`, `idempotency`, `optimization` sections per agent |
| `wiki/AgentKaizenOptimizer.md` | Specifies `.kaizen/metrics/` storage and optimizer behaviour |
| `src/kaizen_agentic/optimization.py` | `OptimizationLoop` + `PerformanceMetrics` implemented, unit-tested, unwired |
| Agency framework (WP-0002) | `.kaizen/agents/<name>/memory.md` + Coach brief — qualitative only |
| WP-0001 T04 | Telemetry — overlaps; WP-0003 defines the convention; WP-0001 can adopt it |
---
## Part 1 — Metrics Convention and Storage
Define the project-scoped metrics artifact alongside the existing memory convention
(ADR-002).
### Location convention
```
<project-root>/.kaizen/metrics/<agent-name>/
executions.jsonl # append-only per-execution records
summary.json # rolling aggregates (regenerated on write)
```
Optimizer-specific aggregates (per `wiki/AgentKaizenOptimizer.md`):
```
<project-root>/.kaizen/metrics/optimizer/
analysis.json # last run output + fingerprint
recommendations.jsonl # append-only recommendation history
```
### Execution record schema (minimum viable)
```json
{
"timestamp": "ISO-8601",
"agent": "tdd-workflow",
"session_id": "optional-uuid-or-hash",
"execution_time_s": 0.0,
"success": true,
"quality_score": 0.0,
"primary_metric": { "name": "...", "value": 0.0, "target": 0.0 },
"metadata": {}
}
```
### Tasks
- [x] T01 — Write ADR-004: project metrics convention (location, schema, lifecycle, retention, Helix Forge correlation)
- [x] T02 — Implement `MetricsStore` in `src/kaizen_agentic/metrics.py` (append, read, summarise, prune by retention)
- [x] T03 — Add `memory init` hook to scaffold `.kaizen/metrics/<agent>/` alongside memory (optional flag `--no-metrics`)
- [x] T04 — Unit tests for `MetricsStore` (append idempotency key, summary regeneration, retention prune)
### Definition of done
- ADR-004 accepted and referenced from `docs/agency-framework.md`
- `MetricsStore` passes unit tests
- `kaizen-agentic memory init <agent>` creates metrics scaffold by default
---
## Part 2 — Metrics CLI
Expose metrics collection and inspection without requiring Python imports in agent
sessions.
### Commands
```
kaizen-agentic metrics record <agent> # Append one execution record (stdin JSON or flags)
kaizen-agentic metrics show <agent> # Print summary + recent executions
kaizen-agentic metrics list # List agents with metrics in current project
kaizen-agentic metrics export <agent> # Dump executions.jsonl to stdout
```
### Options (record)
- `--target / -t` — project root (default: cwd)
- `--success / --failure` — boolean outcome shorthand
- `--time` — execution time in seconds
- `--quality` — quality score 0.01.0
- `--json` — full record on stdin
### Tasks
- [x] T05 — Implement `metrics` CLI command group (record, show, list, export)
- [x] T06 — Integrate `metrics record` into session-close protocol template for pilot agents
- [x] T07 — CLI tests for metrics commands (click.testing, temp project dir)
- [x] T08 — Update `docs/CLI_CHEAT_SHEET.md` and `docs/agency-framework.md` with metrics section
### Definition of done
- All four metrics commands work against a test project with `.kaizen/metrics/`
- Session-close template documents the `metrics record` one-liner for pilot agents
- CLI cheat sheet updated
---
## Part 3 — Wire OptimizationLoop to Project Metrics
Connect the existing Python optimization infrastructure to real project data.
### Tasks
- [x] T09 — Add `OptimizationLoop.from_metrics_store(store)` factory that loads `PerformanceMetrics` from executions
- [x] T10 — Implement `kaizen-agentic metrics optimize [agent]` — run analysis, print recommendations, write `optimizer/analysis.json`
- [x] T11 — Consolidate `agent-optimization.md` and `agent-agent-optimization.md` into single canonical `optimization` agent; update registry
- [x] T12 — Update `agent-optimization.md` session protocol to invoke `metrics optimize` and reference ADR-004
- [x] T13 — Unit + integration tests: synthetic executions → recommendations → non-empty output
### Definition of done
- `kaizen-agentic metrics optimize` produces recommendations when ≥10 execution records exist (per wiki minimum sample size)
- Single canonical optimization meta-agent in registry
- Tests cover insufficient-data and sufficient-data paths
---
## Part 4 — Bridge Coach, Memory, and Metrics
Unify qualitative memory and quantitative metrics in the orientation path.
### Tasks
- [x] T14 — Extend `memory brief` to include metrics summary for target agent (recent success rate, avg quality, trend arrow)
- [x] T15 — Extend `agent-coach.md` to reference metrics context in synthesis instructions
- [x] T16 — E2e test: populate memory + metrics for two agents → `memory brief` includes both qualitative and quantitative sections
### Definition of done
- `memory brief tdd-workflow` output includes a `## Performance Summary` block when metrics exist
- E2e test passes
---
## Part 5 — Pilot Agent and Template Conformance
Prove the loop end-to-end on one agent before fleet-wide rollout.
**Pilot agent:** `tdd-workflow` (high usage, clear success criteria in existing prompt)
### Tasks
- [x] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time)
- [x] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome
- [x] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation
- [x] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation
### Definition of done
- tdd-workflow is the documented reference for metrics-enabled agents
- Full loop demonstrated in e2e test: record → show → optimize → brief
---
## Part 6 — Packaging and Orientation
Close distribution and documentation gaps surfaced in gap analysis.
### Tasks
- [x] T21 — Sync missing 4 agents into `src/kaizen_agentic/data/agents/` (coach, sys-medic, scope-analyst, optimization)
- [x] T22 — Update `README.md` Getting Oriented to link `INTENT.md` and `wiki/` (SCOPE.md already updated)
- [x] T23 — Update `.claude/rules/architecture.md` agent table (20 agents, meta category, sys-medic, coach)
- [x] T24 — CHANGELOG.md entry for metrics convention and CLI
### Definition of done
- `pip install` / packaged data includes all 21 agents
- README orientation path matches SCOPE.md
- architecture.md agent count accurate
---
## Sequencing
```
Part 1 (T01T04) ──→ Part 2 (T05T08) ──→ Part 3 (T09T13)
Part 4 (T14T16) ←────────────┘
Part 5 (T17T20) ──→ Part 6 (T21T24)
```
Parts 12 are blocking. Part 3 depends on storage + CLI. Parts 45 can overlap
once Part 3 factory exists. Part 6 can run in parallel except T21 (needs final
agent consolidation from T11).
Estimated effort: 46 sessions.
---
## Out of Scope (this workplan)
- Full `wiki/KaizenAgentTemplate.md` conformance for all 21 agents (future workplan)
- KaizenGuidance codemod pipeline (`wiki/KaizenGuidance.md`)
- Scheduled/automated optimizer runs (cron, activity-core integration) — convention only
- WP-0001 CI/CD, PyPI publication, cross-platform testing
- ML-based pattern detection (pandas/sklearn in wiki spec) — simple statistics first
---
## Success Criteria
A reader of `INTENT.md` can point to this repo and say:
1. Agents **can** record measurable per-execution outcomes in a standard location.
2. The optimization loop **does** read real project data and produce recommendations.
3. Coach orientation **includes** performance context, not only qualitative memory.
4. At least one agent (tdd-workflow) demonstrates the full measure → analyse → orient cycle.
---
## State Hub Task IDs
| Code | UUID |
|------|------|
| T01 | 4e7b0fd2-38c0-46aa-84a7-bb18366b8c7c |
| T02 | eeaa99c7-d7a7-403b-a013-364cba45a663 |
| T03 | 247c097f-de89-4383-930c-35ee66de9b36 |
| T04 | 3aa14026-6ee3-4384-b409-11300c1302f0 |
| T05 | 6b505d29-7d2e-44a2-a4b7-1fe82884390c |
| T06 | 84f2a357-f2dd-4fc7-96b6-a4e80d5467a7 |
| T07 | 8e9ee64b-b7c4-4dff-ac6e-988fd47ef95d |
| T08 | 4c41e0db-d5d8-4a1b-b346-06ad004edf4a |
| T09 | 0b374439-6eca-4754-8e15-2a7eece0cd27 |
| T10 | db87a09b-0252-495c-a771-a43b4b98f820 |
| T11 | 73cb7d73-6fc6-42a9-97aa-d33cdf9ee363 |
| T12 | c127eca7-7394-42db-ba5e-721aef0ccb76 |
| T13 | f208dc9f-cdf7-47e3-9c03-09097e46eee9 |
| T14 | d01f969c-bbb1-4eca-a4f1-d79d5c867b35 |
| T15 | 67f791a4-fced-4986-a331-7eb4ea47fe6e |
| T16 | 1fb89b54-8bd2-40bf-9a71-04693cb9f695 |
| T17 | 1d471a7a-9a98-4805-903e-b4a2b8153717 |
| T18 | abb387f1-86ce-4b9b-a516-2d4efb6aca4c |
| T19 | 67fbc26e-a57d-4133-96e6-3d2cdbd10dc0 |
| T20 | fbdd7c8b-e122-48d9-8c8f-de9f82d025e3 |
| T21 | 9662bcec-34fe-451b-b61f-5d11b9574576 |
| T22 | 422aae43-5697-4a00-86e9-1569baf09422 |
| T23 | ba6b3411-d330-4a58-8cd0-62b4fbef8c5f |
| T24 | 748be9f3-f6ac-4f26-a844-6330268935b6 |
**Hub workstream:** `kaizen-wp-0003-measurement-loop` (`36252a45-f360-4496-bf77-17b5dfb02767`)
---
## Notes
- Retention default: 180 days (per `wiki/AgentKaizenOptimizer.md`); override via project config in a later iteration
- WP-0001 T04 (telemetry) should consume ADR-004 schema rather than inventing a parallel format
- `OptimizationLoop` threshold constants (30s execution, 0.8 success rate) are starting points; expose in config later