--- id: KAIZEN-WP-0003 type: workplan title: "Measurement Loop: Metrics Convention, Collection, and Optimizer Integration" domain: custodian repo: kaizen-agentic status: completed owner: kaizen-agentic topic_slug: custodian state_hub_workstream_id: 36252a45-f360-4496-bf77-17b5dfb02767 created: "2026-06-16" updated: "2026-06-18" --- # KAIZEN-WP-0003 — Measurement Loop: Metrics Convention, Collection, and Optimizer Integration **Status:** completed **Owner:** kaizen-agentic **Repo:** kaizen-agentic **Target version:** 1.1.0 (partial; remainder in WP-0001) ## Goal Close the kaizen feedback loop defined in `INTENT.md` and `wiki/AgentKaizenOptimizer.md`: agents produce **measurable, per-execution performance records** stored in project-scoped `.kaizen/metrics/`, the existing `OptimizationLoop` reads that data and generates evidence-based recommendations, and the Coach/optimizer meta-agents share a single improvement path. This workplan addresses the P0 gap from the INTENT gap analysis: strategic vision (memory + qualitative learning) exists; **quantitative measurement → refinement** does not. --- ## Background | Layer | State | |-------|-------| | `INTENT.md` | Requires measurable-by-default agents and evidence-based refinement | | `wiki/KaizenAgentTemplate.md` | Defines `metrics`, `idempotency`, `optimization` sections per agent | | `wiki/AgentKaizenOptimizer.md` | Specifies `.kaizen/metrics/` storage and optimizer behaviour | | `src/kaizen_agentic/optimization.py` | `OptimizationLoop` + `PerformanceMetrics` implemented, unit-tested, unwired | | Agency framework (WP-0002) | `.kaizen/agents//memory.md` + Coach brief — qualitative only | | WP-0001 T04 | Telemetry — overlaps; WP-0003 defines the convention; WP-0001 can adopt it | --- ## Part 1 — Metrics Convention and Storage Define the project-scoped metrics artifact alongside the existing memory convention (ADR-002). ### Location convention ``` /.kaizen/metrics// executions.jsonl # append-only per-execution records summary.json # rolling aggregates (regenerated on write) ``` Optimizer-specific aggregates (per `wiki/AgentKaizenOptimizer.md`): ``` /.kaizen/metrics/optimizer/ analysis.json # last run output + fingerprint recommendations.jsonl # append-only recommendation history ``` ### Execution record schema (minimum viable) ```json { "timestamp": "ISO-8601", "agent": "tdd-workflow", "session_id": "optional-uuid-or-hash", "execution_time_s": 0.0, "success": true, "quality_score": 0.0, "primary_metric": { "name": "...", "value": 0.0, "target": 0.0 }, "metadata": {} } ``` ### Tasks - [x] T01 — Write ADR-004: project metrics convention (location, schema, lifecycle, retention, Helix Forge correlation) - [x] T02 — Implement `MetricsStore` in `src/kaizen_agentic/metrics.py` (append, read, summarise, prune by retention) - [x] T03 — Add `memory init` hook to scaffold `.kaizen/metrics//` alongside memory (optional flag `--no-metrics`) - [x] T04 — Unit tests for `MetricsStore` (append idempotency key, summary regeneration, retention prune) ### Definition of done - ADR-004 accepted and referenced from `docs/agency-framework.md` - `MetricsStore` passes unit tests - `kaizen-agentic memory init ` creates metrics scaffold by default --- ## Part 2 — Metrics CLI Expose metrics collection and inspection without requiring Python imports in agent sessions. ### Commands ``` kaizen-agentic metrics record # Append one execution record (stdin JSON or flags) kaizen-agentic metrics show # Print summary + recent executions kaizen-agentic metrics list # List agents with metrics in current project kaizen-agentic metrics export # Dump executions.jsonl to stdout ``` ### Options (record) - `--target / -t` — project root (default: cwd) - `--success / --failure` — boolean outcome shorthand - `--time` — execution time in seconds - `--quality` — quality score 0.0–1.0 - `--json` — full record on stdin ### Tasks - [x] T05 — Implement `metrics` CLI command group (record, show, list, export) - [x] T06 — Integrate `metrics record` into session-close protocol template for pilot agents - [x] T07 — CLI tests for metrics commands (click.testing, temp project dir) - [x] T08 — Update `docs/CLI_CHEAT_SHEET.md` and `docs/agency-framework.md` with metrics section ### Definition of done - All four metrics commands work against a test project with `.kaizen/metrics/` - Session-close template documents the `metrics record` one-liner for pilot agents - CLI cheat sheet updated --- ## Part 3 — Wire OptimizationLoop to Project Metrics Connect the existing Python optimization infrastructure to real project data. ### Tasks - [x] T09 — Add `OptimizationLoop.from_metrics_store(store)` factory that loads `PerformanceMetrics` from executions - [x] T10 — Implement `kaizen-agentic metrics optimize [agent]` — run analysis, print recommendations, write `optimizer/analysis.json` - [x] T11 — Consolidate `agent-optimization.md` and `agent-agent-optimization.md` into single canonical `optimization` agent; update registry - [x] T12 — Update `agent-optimization.md` session protocol to invoke `metrics optimize` and reference ADR-004 - [x] T13 — Unit + integration tests: synthetic executions → recommendations → non-empty output ### Definition of done - `kaizen-agentic metrics optimize` produces recommendations when ≥10 execution records exist (per wiki minimum sample size) - Single canonical optimization meta-agent in registry - Tests cover insufficient-data and sufficient-data paths --- ## Part 4 — Bridge Coach, Memory, and Metrics Unify qualitative memory and quantitative metrics in the orientation path. ### Tasks - [x] T14 — Extend `memory brief` to include metrics summary for target agent (recent success rate, avg quality, trend arrow) - [x] T15 — Extend `agent-coach.md` to reference metrics context in synthesis instructions - [x] T16 — E2e test: populate memory + metrics for two agents → `memory brief` includes both qualitative and quantitative sections ### Definition of done - `memory brief tdd-workflow` output includes a `## Performance Summary` block when metrics exist - E2e test passes --- ## Part 5 — Pilot Agent and Template Conformance Prove the loop end-to-end on one agent before fleet-wide rollout. **Pilot agent:** `tdd-workflow` (high usage, clear success criteria in existing prompt) ### Tasks - [x] T17 — Add `metrics` section to `agent-tdd-workflow.md` frontmatter (primary: test-pass rate; secondary: cycle time) - [x] T18 — Add session-close step: invoke `kaizen-agentic metrics record tdd-workflow` with session outcome - [x] T19 — Document pilot in `wiki/AboutKaizenAgents.md` as reference implementation - [x] T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation ### Definition of done - tdd-workflow is the documented reference for metrics-enabled agents - Full loop demonstrated in e2e test: record → show → optimize → brief --- ## Part 6 — Packaging and Orientation Close distribution and documentation gaps surfaced in gap analysis. ### Tasks - [x] T21 — Sync missing 4 agents into `src/kaizen_agentic/data/agents/` (coach, sys-medic, scope-analyst, optimization) - [x] T22 — Update `README.md` Getting Oriented to link `INTENT.md` and `wiki/` (SCOPE.md already updated) - [x] T23 — Update `.claude/rules/architecture.md` agent table (20 agents, meta category, sys-medic, coach) - [x] T24 — CHANGELOG.md entry for metrics convention and CLI ### Definition of done - `pip install` / packaged data includes all 21 agents - README orientation path matches SCOPE.md - architecture.md agent count accurate --- ## Sequencing ``` Part 1 (T01–T04) ──→ Part 2 (T05–T08) ──→ Part 3 (T09–T13) │ Part 4 (T14–T16) ←────────────┘ │ Part 5 (T17–T20) ──→ Part 6 (T21–T24) ``` Parts 1–2 are blocking. Part 3 depends on storage + CLI. Parts 4–5 can overlap once Part 3 factory exists. Part 6 can run in parallel except T21 (needs final agent consolidation from T11). Estimated effort: 4–6 sessions. --- ## Out of Scope (this workplan) - Full `wiki/KaizenAgentTemplate.md` conformance for all 21 agents (future workplan) - KaizenGuidance codemod pipeline (`wiki/KaizenGuidance.md`) - Scheduled/automated optimizer runs (cron, activity-core integration) — convention only - WP-0001 CI/CD, PyPI publication, cross-platform testing - ML-based pattern detection (pandas/sklearn in wiki spec) — simple statistics first --- ## Success Criteria A reader of `INTENT.md` can point to this repo and say: 1. Agents **can** record measurable per-execution outcomes in a standard location. 2. The optimization loop **does** read real project data and produce recommendations. 3. Coach orientation **includes** performance context, not only qualitative memory. 4. At least one agent (tdd-workflow) demonstrates the full measure → analyse → orient cycle. --- ## State Hub Task IDs | Code | UUID | |------|------| | T01 | 4e7b0fd2-38c0-46aa-84a7-bb18366b8c7c | | T02 | eeaa99c7-d7a7-403b-a013-364cba45a663 | | T03 | 247c097f-de89-4383-930c-35ee66de9b36 | | T04 | 3aa14026-6ee3-4384-b409-11300c1302f0 | | T05 | 6b505d29-7d2e-44a2-a4b7-1fe82884390c | | T06 | 84f2a357-f2dd-4fc7-96b6-a4e80d5467a7 | | T07 | 8e9ee64b-b7c4-4dff-ac6e-988fd47ef95d | | T08 | 4c41e0db-d5d8-4a1b-b346-06ad004edf4a | | T09 | 0b374439-6eca-4754-8e15-2a7eece0cd27 | | T10 | db87a09b-0252-495c-a771-a43b4b98f820 | | T11 | 73cb7d73-6fc6-42a9-97aa-d33cdf9ee363 | | T12 | c127eca7-7394-42db-ba5e-721aef0ccb76 | | T13 | f208dc9f-cdf7-47e3-9c03-09097e46eee9 | | T14 | d01f969c-bbb1-4eca-a4f1-d79d5c867b35 | | T15 | 67f791a4-fced-4986-a331-7eb4ea47fe6e | | T16 | 1fb89b54-8bd2-40bf-9a71-04693cb9f695 | | T17 | 1d471a7a-9a98-4805-903e-b4a2b8153717 | | T18 | abb387f1-86ce-4b9b-a516-2d4efb6aca4c | | T19 | 67fbc26e-a57d-4133-96e6-3d2cdbd10dc0 | | T20 | fbdd7c8b-e122-48d9-8c8f-de9f82d025e3 | | T21 | 9662bcec-34fe-451b-b61f-5d11b9574576 | | T22 | 422aae43-5697-4a00-86e9-1569baf09422 | | T23 | ba6b3411-d330-4a58-8cd0-62b4fbef8c5f | | T24 | 748be9f3-f6ac-4f26-a844-6330268935b6 | **Hub workstream:** `kaizen-wp-0003-measurement-loop` (`36252a45-f360-4496-bf77-17b5dfb02767`) --- ## Notes - Retention default: 180 days (per `wiki/AgentKaizenOptimizer.md`); override via project config in a later iteration - WP-0001 T04 (telemetry) should consume ADR-004 schema rather than inventing a parallel format - `OptimizationLoop` threshold constants (30s execution, 0.8 success rate) are starting points; expose in config later