Sync coach, sys-medic, scope-analyst, optimization, and updated tdd-workflow to packaged data (20 agents). Update architecture.md, README orientation, and CHANGELOG for the metrics loop. Mark WP-0003 completed.
11 KiB
id, type, title, domain, repo, status, owner, topic_slug, state_hub_workstream_id, created, updated
| id | type | title | domain | repo | status | owner | topic_slug | state_hub_workstream_id | created | updated |
|---|---|---|---|---|---|---|---|---|---|---|
| KAIZEN-WP-0003 | workplan | Measurement Loop: Metrics Convention, Collection, and Optimizer Integration | custodian | kaizen-agentic | completed | kaizen-agentic | custodian | 36252a45-f360-4496-bf77-17b5dfb02767 | 2026-06-16 | 2026-06-18 |
KAIZEN-WP-0003 — Measurement Loop: Metrics Convention, Collection, and Optimizer Integration
Status: completed Owner: kaizen-agentic Repo: kaizen-agentic Target version: 1.1.0 (partial; remainder in WP-0001)
Goal
Close the kaizen feedback loop defined in INTENT.md and wiki/AgentKaizenOptimizer.md:
agents produce measurable, per-execution performance records stored in project-scoped
.kaizen/metrics/, the existing OptimizationLoop reads that data and generates
evidence-based recommendations, and the Coach/optimizer meta-agents share a single
improvement path.
This workplan addresses the P0 gap from the INTENT gap analysis: strategic vision (memory + qualitative learning) exists; quantitative measurement → refinement does not.
Background
| Layer | State |
|---|---|
INTENT.md |
Requires measurable-by-default agents and evidence-based refinement |
wiki/KaizenAgentTemplate.md |
Defines metrics, idempotency, optimization sections per agent |
wiki/AgentKaizenOptimizer.md |
Specifies .kaizen/metrics/ storage and optimizer behaviour |
src/kaizen_agentic/optimization.py |
OptimizationLoop + PerformanceMetrics implemented, unit-tested, unwired |
| Agency framework (WP-0002) | .kaizen/agents/<name>/memory.md + Coach brief — qualitative only |
| WP-0001 T04 | Telemetry — overlaps; WP-0003 defines the convention; WP-0001 can adopt it |
Part 1 — Metrics Convention and Storage
Define the project-scoped metrics artifact alongside the existing memory convention (ADR-002).
Location convention
<project-root>/.kaizen/metrics/<agent-name>/
executions.jsonl # append-only per-execution records
summary.json # rolling aggregates (regenerated on write)
Optimizer-specific aggregates (per wiki/AgentKaizenOptimizer.md):
<project-root>/.kaizen/metrics/optimizer/
analysis.json # last run output + fingerprint
recommendations.jsonl # append-only recommendation history
Execution record schema (minimum viable)
{
"timestamp": "ISO-8601",
"agent": "tdd-workflow",
"session_id": "optional-uuid-or-hash",
"execution_time_s": 0.0,
"success": true,
"quality_score": 0.0,
"primary_metric": { "name": "...", "value": 0.0, "target": 0.0 },
"metadata": {}
}
Tasks
- T01 — Write ADR-004: project metrics convention (location, schema, lifecycle, retention, Helix Forge correlation)
- T02 — Implement
MetricsStoreinsrc/kaizen_agentic/metrics.py(append, read, summarise, prune by retention) - T03 — Add
memory inithook to scaffold.kaizen/metrics/<agent>/alongside memory (optional flag--no-metrics) - T04 — Unit tests for
MetricsStore(append idempotency key, summary regeneration, retention prune)
Definition of done
- ADR-004 accepted and referenced from
docs/agency-framework.md MetricsStorepasses unit testskaizen-agentic memory init <agent>creates metrics scaffold by default
Part 2 — Metrics CLI
Expose metrics collection and inspection without requiring Python imports in agent sessions.
Commands
kaizen-agentic metrics record <agent> # Append one execution record (stdin JSON or flags)
kaizen-agentic metrics show <agent> # Print summary + recent executions
kaizen-agentic metrics list # List agents with metrics in current project
kaizen-agentic metrics export <agent> # Dump executions.jsonl to stdout
Options (record)
--target / -t— project root (default: cwd)--success / --failure— boolean outcome shorthand--time— execution time in seconds--quality— quality score 0.0–1.0--json— full record on stdin
Tasks
- T05 — Implement
metricsCLI command group (record, show, list, export) - T06 — Integrate
metrics recordinto session-close protocol template for pilot agents - T07 — CLI tests for metrics commands (click.testing, temp project dir)
- T08 — Update
docs/CLI_CHEAT_SHEET.mdanddocs/agency-framework.mdwith metrics section
Definition of done
- All four metrics commands work against a test project with
.kaizen/metrics/ - Session-close template documents the
metrics recordone-liner for pilot agents - CLI cheat sheet updated
Part 3 — Wire OptimizationLoop to Project Metrics
Connect the existing Python optimization infrastructure to real project data.
Tasks
- T09 — Add
OptimizationLoop.from_metrics_store(store)factory that loadsPerformanceMetricsfrom executions - T10 — Implement
kaizen-agentic metrics optimize [agent]— run analysis, print recommendations, writeoptimizer/analysis.json - T11 — Consolidate
agent-optimization.mdandagent-agent-optimization.mdinto single canonicaloptimizationagent; update registry - T12 — Update
agent-optimization.mdsession protocol to invokemetrics optimizeand reference ADR-004 - T13 — Unit + integration tests: synthetic executions → recommendations → non-empty output
Definition of done
kaizen-agentic metrics optimizeproduces recommendations when ≥10 execution records exist (per wiki minimum sample size)- Single canonical optimization meta-agent in registry
- Tests cover insufficient-data and sufficient-data paths
Part 4 — Bridge Coach, Memory, and Metrics
Unify qualitative memory and quantitative metrics in the orientation path.
Tasks
- T14 — Extend
memory briefto include metrics summary for target agent (recent success rate, avg quality, trend arrow) - T15 — Extend
agent-coach.mdto reference metrics context in synthesis instructions - T16 — E2e test: populate memory + metrics for two agents →
memory briefincludes both qualitative and quantitative sections
Definition of done
memory brief tdd-workflowoutput includes a## Performance Summaryblock when metrics exist- E2e test passes
Part 5 — Pilot Agent and Template Conformance
Prove the loop end-to-end on one agent before fleet-wide rollout.
Pilot agent: tdd-workflow (high usage, clear success criteria in existing prompt)
Tasks
- T17 — Add
metricssection toagent-tdd-workflow.mdfrontmatter (primary: test-pass rate; secondary: cycle time) - T18 — Add session-close step: invoke
kaizen-agentic metrics record tdd-workflowwith session outcome - T19 — Document pilot in
wiki/AboutKaizenAgents.mdas reference implementation - T20 — E2e test: two simulated tdd-workflow sessions → metrics accumulate → optimize produces recommendation
Definition of done
- tdd-workflow is the documented reference for metrics-enabled agents
- Full loop demonstrated in e2e test: record → show → optimize → brief
Part 6 — Packaging and Orientation
Close distribution and documentation gaps surfaced in gap analysis.
Tasks
- T21 — Sync missing 4 agents into
src/kaizen_agentic/data/agents/(coach, sys-medic, scope-analyst, optimization) - T22 — Update
README.mdGetting Oriented to linkINTENT.mdandwiki/(SCOPE.md already updated) - T23 — Update
.claude/rules/architecture.mdagent table (20 agents, meta category, sys-medic, coach) - T24 — CHANGELOG.md entry for metrics convention and CLI
Definition of done
pip install/ packaged data includes all 21 agents- README orientation path matches SCOPE.md
- architecture.md agent count accurate
Sequencing
Part 1 (T01–T04) ──→ Part 2 (T05–T08) ──→ Part 3 (T09–T13)
│
Part 4 (T14–T16) ←────────────┘
│
Part 5 (T17–T20) ──→ Part 6 (T21–T24)
Parts 1–2 are blocking. Part 3 depends on storage + CLI. Parts 4–5 can overlap once Part 3 factory exists. Part 6 can run in parallel except T21 (needs final agent consolidation from T11).
Estimated effort: 4–6 sessions.
Out of Scope (this workplan)
- Full
wiki/KaizenAgentTemplate.mdconformance for all 21 agents (future workplan) - KaizenGuidance codemod pipeline (
wiki/KaizenGuidance.md) - Scheduled/automated optimizer runs (cron, activity-core integration) — convention only
- WP-0001 CI/CD, PyPI publication, cross-platform testing
- ML-based pattern detection (pandas/sklearn in wiki spec) — simple statistics first
Success Criteria
A reader of INTENT.md can point to this repo and say:
- Agents can record measurable per-execution outcomes in a standard location.
- The optimization loop does read real project data and produce recommendations.
- Coach orientation includes performance context, not only qualitative memory.
- At least one agent (tdd-workflow) demonstrates the full measure → analyse → orient cycle.
State Hub Task IDs
| Code | UUID |
|---|---|
| T01 | 4e7b0fd2-38c0-46aa-84a7-bb18366b8c7c |
| T02 | eeaa99c7-d7a7-403b-a013-364cba45a663 |
| T03 | 247c097f-de89-4383-930c-35ee66de9b36 |
| T04 | 3aa14026-6ee3-4384-b409-11300c1302f0 |
| T05 | 6b505d29-7d2e-44a2-a4b7-1fe82884390c |
| T06 | 84f2a357-f2dd-4fc7-96b6-a4e80d5467a7 |
| T07 | 8e9ee64b-b7c4-4dff-ac6e-988fd47ef95d |
| T08 | 4c41e0db-d5d8-4a1b-b346-06ad004edf4a |
| T09 | 0b374439-6eca-4754-8e15-2a7eece0cd27 |
| T10 | db87a09b-0252-495c-a771-a43b4b98f820 |
| T11 | 73cb7d73-6fc6-42a9-97aa-d33cdf9ee363 |
| T12 | c127eca7-7394-42db-ba5e-721aef0ccb76 |
| T13 | f208dc9f-cdf7-47e3-9c03-09097e46eee9 |
| T14 | d01f969c-bbb1-4eca-a4f1-d79d5c867b35 |
| T15 | 67f791a4-fced-4986-a331-7eb4ea47fe6e |
| T16 | 1fb89b54-8bd2-40bf-9a71-04693cb9f695 |
| T17 | 1d471a7a-9a98-4805-903e-b4a2b8153717 |
| T18 | abb387f1-86ce-4b9b-a516-2d4efb6aca4c |
| T19 | 67fbc26e-a57d-4133-96e6-3d2cdbd10dc0 |
| T20 | fbdd7c8b-e122-48d9-8c8f-de9f82d025e3 |
| T21 | 9662bcec-34fe-451b-b61f-5d11b9574576 |
| T22 | 422aae43-5697-4a00-86e9-1569baf09422 |
| T23 | ba6b3411-d330-4a58-8cd0-62b4fbef8c5f |
| T24 | 748be9f3-f6ac-4f26-a844-6330268935b6 |
Hub workstream: kaizen-wp-0003-measurement-loop (36252a45-f360-4496-bf77-17b5dfb02767)
Notes
- Retention default: 180 days (per
wiki/AgentKaizenOptimizer.md); override via project config in a later iteration - WP-0001 T04 (telemetry) should consume ADR-004 schema rather than inventing a parallel format
OptimizationLoopthreshold constants (30s execution, 0.8 success rate) are starting points; expose in config later