Wire OptimizationLoop to project metrics and add metrics optimize.
Add from_metrics_store factory, OptimizerStore persistence, metrics optimize CLI, consolidate duplicate optimization agent, and add integration tests.
This commit is contained in:
5
SCOPE.md
5
SCOPE.md
@@ -22,7 +22,7 @@ This repo is the canonical home for the **KaizenAgentic** operating model (`INTE
|
||||
## In Scope
|
||||
|
||||
- **Strategic framing**: `INTENT.md` (purpose, boundaries, design principles) and `wiki/` (mission, agent template, guidance model, brand/pricing)
|
||||
- **21 agent definitions** (`agents/agent-*.md`) — markdown persona instruction sets with YAML frontmatter (reference fleet; see `INTENT.md` boundaries)
|
||||
- **20 agent definitions** (`agents/agent-*.md`) — markdown persona instruction sets with YAML frontmatter (reference fleet; see `INTENT.md` boundaries)
|
||||
- **Agent categories**: project-management, development-process, code-quality, infrastructure, testing, documentation, meta
|
||||
- **Agency framework**: project memory convention (ADR-002), session-start/close protocols, Coach meta-agent (`agent-coach.md`)
|
||||
- **Protocol runbooks** (`agents/protocols/<agent>/<slug>.md`) — procedural checklists distinct from agent prompts
|
||||
@@ -162,6 +162,5 @@ keywords: [kaizen, intent, template, optimization, digital-talent-agency]
|
||||
|
||||
## Notes
|
||||
|
||||
- `agents/` (21 files) is the development source of truth; `src/kaizen_agentic/data/agents/` (17 files) is what pip installs ship — coach, sys-medic, scope-analyst, and optimization are not yet bundled
|
||||
- `agent-optimization.md` and `agent-agent-optimization.md` both exist; consolidation planned in WP-0003
|
||||
- `agents/` (20 files) is the development source of truth; `src/kaizen_agentic/data/agents/` (16 files) is what pip installs ship — coach, sys-medic, scope-analyst, and optimization are not yet bundled
|
||||
- Agent definitions use minimal frontmatter today; full `wiki/KaizenAgentTemplate.md` conformance is a maturity target, not current reality
|
||||
@@ -1,169 +0,0 @@
|
||||
---
|
||||
name: optimization
|
||||
description: Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Use PROACTIVELY for agent ecosystem improvement.
|
||||
model: inherit
|
||||
category: infrastructure
|
||||
---
|
||||
|
||||
# Kaizen Optimizer - Agent Performance Meta-Optimizer
|
||||
|
||||
## Purpose
|
||||
|
||||
Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Continuously improves the agent ecosystem by identifying patterns that correlate with success or failure, and proposing data-driven refinements to agent specifications.
|
||||
|
||||
## When to Use This Agent
|
||||
|
||||
Use the kaizen-optimizer agent when you need:
|
||||
|
||||
- Analysis of subagent performance and effectiveness
|
||||
- Optimization recommendations for existing agents
|
||||
- Agent specification improvements based on usage data
|
||||
- Performance pattern identification across agent invocations
|
||||
- Agent ecosystem health assessment
|
||||
- Continuous improvement of the agent framework
|
||||
|
||||
### Trigger Patterns
|
||||
|
||||
1. **Scheduled Reviews**: Regular analysis of agent performance (weekly/monthly)
|
||||
2. **Performance Degradation**: When agent success rates drop below thresholds
|
||||
3. **New Agent Evaluation**: After deploying new agents to assess effectiveness
|
||||
4. **Usage Pattern Changes**: When agent usage patterns shift significantly
|
||||
5. **Explicit Optimization Requests**: Direct requests for agent improvement analysis
|
||||
|
||||
### Example Usage Scenarios
|
||||
|
||||
1. **Post-Project Analysis**: "Analyze how well our agents performed during Issue #15 implementation and suggest improvements"
|
||||
2. **Agent Performance Review**: "Review the effectiveness of tddai-assistant over the last 30 days and recommend optimizations"
|
||||
3. **Ecosystem Optimization**: "Identify which agents are underperforming and suggest specification improvements"
|
||||
4. **Success Pattern Analysis**: "Analyze successful agent chains and recommend best practices"
|
||||
|
||||
## Agent Capabilities
|
||||
|
||||
### Performance Analysis
|
||||
- **Success Rate Analysis**: Track agent task completion and success metrics
|
||||
- **Usage Pattern Recognition**: Identify how agents are being used effectively
|
||||
- **Failure Mode Analysis**: Categorize and analyze agent failure patterns
|
||||
- **Response Quality Assessment**: Evaluate the quality of agent outputs
|
||||
|
||||
### Optimization Recommendations
|
||||
- **Specification Refinements**: Suggest improvements to agent descriptions and capabilities
|
||||
- **Trigger Pattern Optimization**: Refine when and how agents should be invoked
|
||||
- **Chain Optimization**: Recommend better agent collaboration patterns
|
||||
- **Scope Adjustments**: Identify agents that are too broad or too narrow in scope
|
||||
|
||||
### Meta-Learning
|
||||
- **Pattern Detection**: Identify successful agent behaviors and specifications
|
||||
- **Correlation Analysis**: Find relationships between agent characteristics and performance
|
||||
- **Best Practice Extraction**: Distill successful patterns into reusable guidelines
|
||||
- **Evolution Tracking**: Monitor how agent improvements affect performance over time
|
||||
|
||||
## Analysis Framework
|
||||
|
||||
### Data Collection Focus
|
||||
Since this operates within Claude Code's environment, analysis is based on:
|
||||
|
||||
- **Conversation Context**: Agent invocation patterns and outcomes within sessions
|
||||
- **User Feedback Patterns**: Implicit success signals from user interactions
|
||||
- **Task Completion Rates**: Whether agents successfully complete their assigned tasks
|
||||
- **Agent Specification Quality**: How well specifications match actual usage
|
||||
|
||||
### Performance Metrics
|
||||
- **Invocation Success**: How often agents complete tasks as intended
|
||||
- **User Satisfaction Indicators**: Continued usage, follow-up requests, task completion
|
||||
- **Agent Utilization**: Which agents are used most/least and why
|
||||
- **Chain Effectiveness**: Success rates of multi-agent workflows
|
||||
|
||||
## Optimization Strategies
|
||||
|
||||
### Specification Enhancement
|
||||
- **Clarity Improvements**: Make agent purposes and capabilities clearer
|
||||
- **Scope Refinement**: Adjust agent boundaries for better effectiveness
|
||||
- **Example Enhancement**: Add better usage examples and scenarios
|
||||
- **Integration Guidance**: Improve agent-to-agent collaboration descriptions
|
||||
|
||||
### Performance Improvement
|
||||
- **Trigger Optimization**: Refine when agents should be automatically suggested
|
||||
- **Capability Matching**: Ensure agent capabilities match user needs
|
||||
- **Redundancy Reduction**: Identify and resolve agent overlap issues
|
||||
- **Gap Identification**: Find missing capabilities in the agent ecosystem
|
||||
|
||||
## Integration with Agent Ecosystem
|
||||
|
||||
### Analyzes All Agents
|
||||
- **general-purpose**: Assess effectiveness for research and multi-step tasks
|
||||
- **tddai-assistant**: Evaluate TDD workflow support and methodology adherence
|
||||
- **project-assistant**: Review project management and milestone tracking performance
|
||||
- **claude-expert**: Analyze documentation and feature explanation effectiveness
|
||||
- **statusline-setup**: Assess configuration task success rates
|
||||
- **output-style-setup**: Evaluate creative task completion effectiveness
|
||||
|
||||
### Collaborative Analysis
|
||||
Works with other agents to gather performance data:
|
||||
- Uses **general-purpose** for complex analysis tasks
|
||||
- Coordinates with **project-assistant** for milestone-based performance tracking
|
||||
- Leverages **claude-expert** for framework knowledge and best practices
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
### Performance Analysis Reports
|
||||
- Agent effectiveness rankings with supporting evidence
|
||||
- Usage pattern analysis and trend identification
|
||||
- Success/failure correlation analysis
|
||||
- Performance bottleneck identification
|
||||
|
||||
### Optimization Recommendations
|
||||
- Specific agent specification improvements
|
||||
- Trigger pattern refinements
|
||||
- Agent chain optimization suggestions
|
||||
- New agent capability recommendations
|
||||
|
||||
### Implementation Guidance
|
||||
- Prioritized improvement roadmap
|
||||
- Specification update templates
|
||||
- A/B testing suggestions for agent improvements
|
||||
- Rollback strategies for failed optimizations
|
||||
|
||||
## Best Practices for Usage
|
||||
|
||||
### Provide Performance Context
|
||||
- Share specific agent interactions that were particularly effective or ineffective
|
||||
- Describe user experience challenges with current agents
|
||||
- Include examples of successful and unsuccessful agent chains
|
||||
- Specify performance concerns or optimization goals
|
||||
|
||||
### Be Specific About Scope
|
||||
- Focus on particular agents or agent categories for analysis
|
||||
- Define time windows for performance analysis
|
||||
- Specify success criteria for optimization efforts
|
||||
- Clarify whether analysis should be broad ecosystem or targeted
|
||||
|
||||
### Implementation Approach
|
||||
- Request prioritized recommendations based on impact vs. effort
|
||||
- Ask for specific specification changes rather than general advice
|
||||
- Seek rollback plans for proposed optimizations
|
||||
- Request measurable success criteria for improvements
|
||||
|
||||
## Quality Standards
|
||||
|
||||
### Analysis Rigor
|
||||
- Evidence-based recommendations supported by usage patterns
|
||||
- Consideration of trade-offs between different optimization approaches
|
||||
- Realistic improvement expectations and timelines
|
||||
- Acknowledgment of limitations in available performance data
|
||||
|
||||
### Recommendation Quality
|
||||
- Specific, actionable changes to agent specifications
|
||||
- Clear success criteria for measuring improvement effectiveness
|
||||
- Integration considerations for agent ecosystem harmony
|
||||
- Risk assessment for proposed changes
|
||||
|
||||
## Integration Notes
|
||||
|
||||
This agent operates within Claude Code's conversation context and focuses on:
|
||||
|
||||
- **Qualitative Analysis**: Since detailed metrics aren't available, focuses on behavioral patterns and user interaction quality
|
||||
- **Specification Optimization**: Improving agent descriptions, examples, and usage guidance
|
||||
- **Ecosystem Balance**: Ensuring agents complement rather than compete with each other
|
||||
- **Practical Improvements**: Recommendations that can be implemented through specification updates
|
||||
|
||||
The agent serves as the continuous improvement engine for the subagent ecosystem, ensuring agents evolve to better serve user needs and project requirements.
|
||||
@@ -2,7 +2,8 @@
|
||||
name: optimization
|
||||
description: Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Use PROACTIVELY for agent ecosystem improvement.
|
||||
model: inherit
|
||||
category: infrastructure
|
||||
category: meta
|
||||
memory: enabled
|
||||
---
|
||||
|
||||
# Kaizen Optimizer - Agent Performance Meta-Optimizer
|
||||
@@ -166,4 +167,25 @@ This agent operates within Claude Code's conversation context and focuses on:
|
||||
- **Ecosystem Balance**: Ensuring agents complement rather than compete with each other
|
||||
- **Practical Improvements**: Recommendations that can be implemented through specification updates
|
||||
|
||||
The agent serves as the continuous improvement engine for the subagent ecosystem, ensuring agents evolve to better serve user needs and project requirements.
|
||||
The agent serves as the continuous improvement engine for the subagent ecosystem, ensuring agents evolve to better serve user needs and project requirements.
|
||||
|
||||
## Session Start
|
||||
|
||||
1. Check for `.kaizen/agents/optimization/memory.md` in the project root.
|
||||
2. If present, read it before beginning analysis.
|
||||
3. Review `.kaizen/metrics/optimizer/analysis.json` if it exists for the latest fleet report.
|
||||
|
||||
## Session Close
|
||||
|
||||
1. When analysis completes, note key findings in `## Accumulated Findings`.
|
||||
2. Append one line to `## Session Log`: `YYYY-MM-DD · <agents reviewed> · <outcome>`.
|
||||
3. Bump `last_updated` and increment `session_count`.
|
||||
4. Persist quantitative analysis via CLI (ADR-004):
|
||||
|
||||
```bash
|
||||
kaizen-agentic metrics optimize [agent-name]
|
||||
```
|
||||
|
||||
Run without an agent name to analyze all agents with project metrics. Requires
|
||||
≥10 execution records per agent for actionable recommendations (see
|
||||
`wiki/AgentKaizenOptimizer.md`).
|
||||
@@ -61,6 +61,8 @@ echo '{"success": true, "quality_score": 1.0}' | kaizen-agentic metrics record t
|
||||
kaizen-agentic metrics show tdd-workflow
|
||||
kaizen-agentic metrics list
|
||||
kaizen-agentic metrics export tdd-workflow
|
||||
kaizen-agentic metrics optimize tdd-workflow # analyze one agent (≥10 records)
|
||||
kaizen-agentic metrics optimize # analyze all agents with metrics
|
||||
|
||||
# Scaffold memory + metrics together
|
||||
kaizen-agentic memory init tdd-workflow
|
||||
|
||||
@@ -259,7 +259,7 @@ kaizen-agentic metrics record <agent> # Append execution record at session clo
|
||||
kaizen-agentic metrics show <agent> # Summary + recent executions
|
||||
kaizen-agentic metrics list # Agents with metrics in project
|
||||
kaizen-agentic metrics export <agent> # Dump executions.jsonl
|
||||
kaizen-agentic metrics optimize [agent] # Run optimizer on project metrics
|
||||
kaizen-agentic metrics optimize [agent] # Run optimizer on project metrics (≥10 records)
|
||||
```
|
||||
|
||||
`memory brief` includes a `## Performance Summary` when metrics exist (WP-0003
|
||||
|
||||
@@ -11,7 +11,8 @@ from typing import List, Optional
|
||||
|
||||
from .registry import AgentRegistry, AgentCategory
|
||||
from .installer import AgentInstaller, ProjectInitializer, InstallationConfig
|
||||
from .metrics import MetricsStore
|
||||
from .metrics import MetricsStore, OptimizerStore
|
||||
from .optimization import OptimizationLoop, MIN_SAMPLES_FOR_RECOMMENDATIONS
|
||||
|
||||
|
||||
def safe_cli_wrapper():
|
||||
@@ -1039,6 +1040,63 @@ def metrics_list(target: str):
|
||||
click.echo(f" • {name} ({count} executions)")
|
||||
|
||||
|
||||
@metrics.command("optimize")
|
||||
@click.argument("agent_name", required=False)
|
||||
@click.option("--target", "-t", default=".", help="Project root (default: current)")
|
||||
@click.option(
|
||||
"--min-samples",
|
||||
default=MIN_SAMPLES_FOR_RECOMMENDATIONS,
|
||||
show_default=True,
|
||||
help="Minimum execution records required for recommendations",
|
||||
)
|
||||
def metrics_optimize(agent_name: Optional[str], target: str, min_samples: int):
|
||||
"""Run optimizer analysis on project metrics and write recommendations."""
|
||||
project_root = _project_root(target)
|
||||
agents = [agent_name] if agent_name else MetricsStore.list_agents(project_root)
|
||||
|
||||
if not agents:
|
||||
click.echo("No agent metrics found to optimize.")
|
||||
click.echo(" Record executions with: kaizen-agentic metrics record <agent> --success")
|
||||
return
|
||||
|
||||
optimizer_store = OptimizerStore(project_root)
|
||||
combined_reports = []
|
||||
|
||||
for name in agents:
|
||||
store = MetricsStore(project_root, name)
|
||||
records = store.read_executions()
|
||||
loop = OptimizationLoop.from_metrics_store(store, min_samples=1)
|
||||
report = loop.get_optimization_report_json()
|
||||
report["sample_threshold"] = min_samples
|
||||
report["meets_sample_threshold"] = len(records) >= min_samples
|
||||
combined_reports.append(report)
|
||||
|
||||
click.echo(f"Agent: {name}")
|
||||
click.echo("=" * 40)
|
||||
click.echo(json.dumps(report, indent=2))
|
||||
|
||||
if len(records) >= min_samples:
|
||||
optimizer_store.append_recommendations(
|
||||
name,
|
||||
report["recommendations"],
|
||||
metrics_count=len(records),
|
||||
)
|
||||
else:
|
||||
click.echo(
|
||||
f" Note: {len(records)} record(s) — need {min_samples} for actionable recommendations"
|
||||
)
|
||||
click.echo()
|
||||
|
||||
analysis_payload = {
|
||||
"project": project_root.name,
|
||||
"optimized_at": _today(),
|
||||
"min_samples": min_samples,
|
||||
"agents": combined_reports,
|
||||
}
|
||||
analysis_path = optimizer_store.write_analysis(analysis_payload)
|
||||
click.echo(f"Wrote optimizer analysis: {analysis_path}")
|
||||
|
||||
|
||||
@metrics.command("export")
|
||||
@click.argument("agent_name")
|
||||
@click.option("--target", "-t", default=".", help="Project root (default: current)")
|
||||
|
||||
@@ -1,168 +0,0 @@
|
||||
---
|
||||
name: agent-optimizer
|
||||
description: Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Use PROACTIVELY for agent ecosystem improvement.
|
||||
model: inherit
|
||||
---
|
||||
|
||||
# Kaizen Optimizer - Agent Performance Meta-Optimizer
|
||||
|
||||
## Purpose
|
||||
|
||||
Meta-agent that analyzes and optimizes other Claude Code subagents based on their performance data, usage patterns, and effectiveness metrics. Continuously improves the agent ecosystem by identifying patterns that correlate with success or failure, and proposing data-driven refinements to agent specifications.
|
||||
|
||||
## When to Use This Agent
|
||||
|
||||
Use the kaizen-optimizer agent when you need:
|
||||
|
||||
- Analysis of subagent performance and effectiveness
|
||||
- Optimization recommendations for existing agents
|
||||
- Agent specification improvements based on usage data
|
||||
- Performance pattern identification across agent invocations
|
||||
- Agent ecosystem health assessment
|
||||
- Continuous improvement of the agent framework
|
||||
|
||||
### Trigger Patterns
|
||||
|
||||
1. **Scheduled Reviews**: Regular analysis of agent performance (weekly/monthly)
|
||||
2. **Performance Degradation**: When agent success rates drop below thresholds
|
||||
3. **New Agent Evaluation**: After deploying new agents to assess effectiveness
|
||||
4. **Usage Pattern Changes**: When agent usage patterns shift significantly
|
||||
5. **Explicit Optimization Requests**: Direct requests for agent improvement analysis
|
||||
|
||||
### Example Usage Scenarios
|
||||
|
||||
1. **Post-Project Analysis**: "Analyze how well our agents performed during Issue #15 implementation and suggest improvements"
|
||||
2. **Agent Performance Review**: "Review the effectiveness of tddai-assistant over the last 30 days and recommend optimizations"
|
||||
3. **Ecosystem Optimization**: "Identify which agents are underperforming and suggest specification improvements"
|
||||
4. **Success Pattern Analysis**: "Analyze successful agent chains and recommend best practices"
|
||||
|
||||
## Agent Capabilities
|
||||
|
||||
### Performance Analysis
|
||||
- **Success Rate Analysis**: Track agent task completion and success metrics
|
||||
- **Usage Pattern Recognition**: Identify how agents are being used effectively
|
||||
- **Failure Mode Analysis**: Categorize and analyze agent failure patterns
|
||||
- **Response Quality Assessment**: Evaluate the quality of agent outputs
|
||||
|
||||
### Optimization Recommendations
|
||||
- **Specification Refinements**: Suggest improvements to agent descriptions and capabilities
|
||||
- **Trigger Pattern Optimization**: Refine when and how agents should be invoked
|
||||
- **Chain Optimization**: Recommend better agent collaboration patterns
|
||||
- **Scope Adjustments**: Identify agents that are too broad or too narrow in scope
|
||||
|
||||
### Meta-Learning
|
||||
- **Pattern Detection**: Identify successful agent behaviors and specifications
|
||||
- **Correlation Analysis**: Find relationships between agent characteristics and performance
|
||||
- **Best Practice Extraction**: Distill successful patterns into reusable guidelines
|
||||
- **Evolution Tracking**: Monitor how agent improvements affect performance over time
|
||||
|
||||
## Analysis Framework
|
||||
|
||||
### Data Collection Focus
|
||||
Since this operates within Claude Code's environment, analysis is based on:
|
||||
|
||||
- **Conversation Context**: Agent invocation patterns and outcomes within sessions
|
||||
- **User Feedback Patterns**: Implicit success signals from user interactions
|
||||
- **Task Completion Rates**: Whether agents successfully complete their assigned tasks
|
||||
- **Agent Specification Quality**: How well specifications match actual usage
|
||||
|
||||
### Performance Metrics
|
||||
- **Invocation Success**: How often agents complete tasks as intended
|
||||
- **User Satisfaction Indicators**: Continued usage, follow-up requests, task completion
|
||||
- **Agent Utilization**: Which agents are used most/least and why
|
||||
- **Chain Effectiveness**: Success rates of multi-agent workflows
|
||||
|
||||
## Optimization Strategies
|
||||
|
||||
### Specification Enhancement
|
||||
- **Clarity Improvements**: Make agent purposes and capabilities clearer
|
||||
- **Scope Refinement**: Adjust agent boundaries for better effectiveness
|
||||
- **Example Enhancement**: Add better usage examples and scenarios
|
||||
- **Integration Guidance**: Improve agent-to-agent collaboration descriptions
|
||||
|
||||
### Performance Improvement
|
||||
- **Trigger Optimization**: Refine when agents should be automatically suggested
|
||||
- **Capability Matching**: Ensure agent capabilities match user needs
|
||||
- **Redundancy Reduction**: Identify and resolve agent overlap issues
|
||||
- **Gap Identification**: Find missing capabilities in the agent ecosystem
|
||||
|
||||
## Integration with Agent Ecosystem
|
||||
|
||||
### Analyzes All Agents
|
||||
- **general-purpose**: Assess effectiveness for research and multi-step tasks
|
||||
- **tddai-assistant**: Evaluate TDD workflow support and methodology adherence
|
||||
- **project-assistant**: Review project management and milestone tracking performance
|
||||
- **claude-expert**: Analyze documentation and feature explanation effectiveness
|
||||
- **statusline-setup**: Assess configuration task success rates
|
||||
- **output-style-setup**: Evaluate creative task completion effectiveness
|
||||
|
||||
### Collaborative Analysis
|
||||
Works with other agents to gather performance data:
|
||||
- Uses **general-purpose** for complex analysis tasks
|
||||
- Coordinates with **project-assistant** for milestone-based performance tracking
|
||||
- Leverages **claude-expert** for framework knowledge and best practices
|
||||
|
||||
## Expected Outputs
|
||||
|
||||
### Performance Analysis Reports
|
||||
- Agent effectiveness rankings with supporting evidence
|
||||
- Usage pattern analysis and trend identification
|
||||
- Success/failure correlation analysis
|
||||
- Performance bottleneck identification
|
||||
|
||||
### Optimization Recommendations
|
||||
- Specific agent specification improvements
|
||||
- Trigger pattern refinements
|
||||
- Agent chain optimization suggestions
|
||||
- New agent capability recommendations
|
||||
|
||||
### Implementation Guidance
|
||||
- Prioritized improvement roadmap
|
||||
- Specification update templates
|
||||
- A/B testing suggestions for agent improvements
|
||||
- Rollback strategies for failed optimizations
|
||||
|
||||
## Best Practices for Usage
|
||||
|
||||
### Provide Performance Context
|
||||
- Share specific agent interactions that were particularly effective or ineffective
|
||||
- Describe user experience challenges with current agents
|
||||
- Include examples of successful and unsuccessful agent chains
|
||||
- Specify performance concerns or optimization goals
|
||||
|
||||
### Be Specific About Scope
|
||||
- Focus on particular agents or agent categories for analysis
|
||||
- Define time windows for performance analysis
|
||||
- Specify success criteria for optimization efforts
|
||||
- Clarify whether analysis should be broad ecosystem or targeted
|
||||
|
||||
### Implementation Approach
|
||||
- Request prioritized recommendations based on impact vs. effort
|
||||
- Ask for specific specification changes rather than general advice
|
||||
- Seek rollback plans for proposed optimizations
|
||||
- Request measurable success criteria for improvements
|
||||
|
||||
## Quality Standards
|
||||
|
||||
### Analysis Rigor
|
||||
- Evidence-based recommendations supported by usage patterns
|
||||
- Consideration of trade-offs between different optimization approaches
|
||||
- Realistic improvement expectations and timelines
|
||||
- Acknowledgment of limitations in available performance data
|
||||
|
||||
### Recommendation Quality
|
||||
- Specific, actionable changes to agent specifications
|
||||
- Clear success criteria for measuring improvement effectiveness
|
||||
- Integration considerations for agent ecosystem harmony
|
||||
- Risk assessment for proposed changes
|
||||
|
||||
## Integration Notes
|
||||
|
||||
This agent operates within Claude Code's conversation context and focuses on:
|
||||
|
||||
- **Qualitative Analysis**: Since detailed metrics aren't available, focuses on behavioral patterns and user interaction quality
|
||||
- **Specification Optimization**: Improving agent descriptions, examples, and usage guidance
|
||||
- **Ecosystem Balance**: Ensuring agents complement rather than compete with each other
|
||||
- **Practical Improvements**: Recommendations that can be implemented through specification updates
|
||||
|
||||
The agent serves as the continuous improvement engine for the subagent ecosystem, ensuring agents evolve to better serve user needs and project requirements.
|
||||
@@ -205,4 +205,43 @@ class MetricsStore:
|
||||
return removed
|
||||
|
||||
def _has_idempotency_key(self, key: str) -> bool:
|
||||
return any(r.get("idempotency_key") == key for r in self.read_executions())
|
||||
return any(r.get("idempotency_key") == key for r in self.read_executions())
|
||||
|
||||
|
||||
@dataclass
|
||||
class OptimizerStore:
|
||||
"""Persist optimizer analysis output under .kaizen/metrics/optimizer/."""
|
||||
|
||||
project_root: Path
|
||||
|
||||
def __post_init__(self) -> None:
|
||||
self.project_root = Path(self.project_root).resolve()
|
||||
self.optimizer_dir = self.project_root / ".kaizen" / "metrics" / "optimizer"
|
||||
self.analysis_path = self.optimizer_dir / "analysis.json"
|
||||
self.recommendations_path = self.optimizer_dir / "recommendations.jsonl"
|
||||
|
||||
def write_analysis(self, report: Dict[str, Any]) -> Path:
|
||||
self.optimizer_dir.mkdir(parents=True, exist_ok=True)
|
||||
self.analysis_path.write_text(
|
||||
json.dumps(report, indent=2, sort_keys=True) + "\n",
|
||||
encoding="utf-8",
|
||||
)
|
||||
return self.analysis_path
|
||||
|
||||
def append_recommendations(
|
||||
self,
|
||||
agent_name: str,
|
||||
recommendations: List[Dict[str, Any]],
|
||||
*,
|
||||
metrics_count: int,
|
||||
) -> None:
|
||||
self.optimizer_dir.mkdir(parents=True, exist_ok=True)
|
||||
entry = {
|
||||
"timestamp": _utc_now_iso(),
|
||||
"agent": agent_name,
|
||||
"metrics_count": metrics_count,
|
||||
"recommendations": recommendations,
|
||||
}
|
||||
with self.recommendations_path.open("a", encoding="utf-8") as handle:
|
||||
handle.write(json.dumps(entry, sort_keys=True))
|
||||
handle.write("\n")
|
||||
@@ -5,11 +5,16 @@ This module implements the kaizen loop for measuring, analyzing, and refining
|
||||
agent performance over time.
|
||||
"""
|
||||
|
||||
from typing import Dict, Any, List, Optional
|
||||
from typing import TYPE_CHECKING, Any, Dict, List, Optional
|
||||
from dataclasses import dataclass
|
||||
from datetime import datetime
|
||||
import statistics
|
||||
|
||||
if TYPE_CHECKING:
|
||||
from .metrics import MetricsStore
|
||||
|
||||
MIN_SAMPLES_FOR_RECOMMENDATIONS = 10
|
||||
|
||||
|
||||
@dataclass
|
||||
class PerformanceMetrics:
|
||||
@@ -35,6 +40,60 @@ class OptimizationLoop:
|
||||
self.metrics_history: List[PerformanceMetrics] = []
|
||||
self.optimization_history: List[Dict[str, Any]] = []
|
||||
|
||||
@classmethod
|
||||
def from_metrics_store(
|
||||
cls,
|
||||
store: "MetricsStore",
|
||||
*,
|
||||
min_samples: int = 1,
|
||||
) -> "OptimizationLoop":
|
||||
"""Build an optimization loop from project-scoped execution records."""
|
||||
loop = cls(store.agent_name)
|
||||
records = store.read_executions()
|
||||
if len(records) < min_samples:
|
||||
return loop
|
||||
for record in records:
|
||||
loop.record_metrics(cls._metrics_from_record(record))
|
||||
return loop
|
||||
|
||||
@staticmethod
|
||||
def _metrics_from_record(record: Dict[str, Any]) -> PerformanceMetrics:
|
||||
timestamp_raw = record.get("timestamp")
|
||||
try:
|
||||
timestamp = datetime.fromisoformat(
|
||||
str(timestamp_raw).replace("Z", "+00:00")
|
||||
)
|
||||
except (TypeError, ValueError):
|
||||
timestamp = datetime.now()
|
||||
|
||||
success = bool(record.get("success", False))
|
||||
quality = record.get("quality_score")
|
||||
if quality is None:
|
||||
quality = 1.0 if success else 0.0
|
||||
|
||||
metadata = {
|
||||
k: v
|
||||
for k, v in record.items()
|
||||
if k
|
||||
not in {
|
||||
"timestamp",
|
||||
"agent",
|
||||
"success",
|
||||
"execution_time_s",
|
||||
"quality_score",
|
||||
"primary_metric",
|
||||
}
|
||||
}
|
||||
|
||||
return PerformanceMetrics(
|
||||
timestamp=timestamp,
|
||||
execution_time=float(record.get("execution_time_s") or 0.0),
|
||||
success_rate=1.0 if success else 0.0,
|
||||
quality_score=float(quality),
|
||||
resource_usage={},
|
||||
metadata=metadata or None,
|
||||
)
|
||||
|
||||
def record_metrics(self, metrics: PerformanceMetrics) -> None:
|
||||
"""Record performance metrics for analysis."""
|
||||
self.metrics_history.append(metrics)
|
||||
@@ -160,3 +219,17 @@ class OptimizationLoop:
|
||||
"metrics_count": len(self.metrics_history),
|
||||
"optimization_cycles": len(self.optimization_history),
|
||||
}
|
||||
|
||||
def get_optimization_report_json(self) -> Dict[str, Any]:
|
||||
"""JSON-serializable optimization report."""
|
||||
return _to_json_safe(self.get_optimization_report())
|
||||
|
||||
|
||||
def _to_json_safe(value: Any) -> Any:
|
||||
if isinstance(value, datetime):
|
||||
return value.isoformat()
|
||||
if isinstance(value, dict):
|
||||
return {k: _to_json_safe(v) for k, v in value.items()}
|
||||
if isinstance(value, list):
|
||||
return [_to_json_safe(item) for item in value]
|
||||
return value
|
||||
|
||||
133
tests/test_optimization_metrics.py
Normal file
133
tests/test_optimization_metrics.py
Normal file
@@ -0,0 +1,133 @@
|
||||
"""Tests for OptimizationLoop integration with MetricsStore."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
from click.testing import CliRunner
|
||||
|
||||
from kaizen_agentic.cli import cli
|
||||
from kaizen_agentic.metrics import MetricsStore, OptimizerStore
|
||||
from kaizen_agentic.optimization import (
|
||||
MIN_SAMPLES_FOR_RECOMMENDATIONS,
|
||||
OptimizationLoop,
|
||||
)
|
||||
|
||||
|
||||
def _seed_executions(
|
||||
store: MetricsStore,
|
||||
count: int,
|
||||
*,
|
||||
success: bool = True,
|
||||
execution_time_s: float = 5.0,
|
||||
quality_score: float = 0.9,
|
||||
) -> None:
|
||||
for i in range(count):
|
||||
store.append(
|
||||
{
|
||||
"success": success,
|
||||
"execution_time_s": execution_time_s + i,
|
||||
"quality_score": quality_score,
|
||||
},
|
||||
idempotency_key=f"run-{i}",
|
||||
)
|
||||
|
||||
|
||||
@pytest.fixture
|
||||
def project_dir(tmp_path: Path) -> Path:
|
||||
root = tmp_path / "demo-project"
|
||||
root.mkdir()
|
||||
return root
|
||||
|
||||
|
||||
class TestOptimizationFromMetricsStore:
|
||||
def test_from_metrics_store_loads_execution_records(self, project_dir: Path):
|
||||
store = MetricsStore(project_dir, "tdd-workflow")
|
||||
_seed_executions(store, 3)
|
||||
|
||||
loop = OptimizationLoop.from_metrics_store(store)
|
||||
|
||||
assert len(loop.metrics_history) == 3
|
||||
assert loop.metrics_history[0].success_rate == 1.0
|
||||
|
||||
def test_insufficient_data_recommendations(self, project_dir: Path):
|
||||
store = MetricsStore(project_dir, "tdd-workflow")
|
||||
loop = OptimizationLoop.from_metrics_store(store)
|
||||
|
||||
recommendations = loop.generate_improvement_recommendations()
|
||||
|
||||
assert recommendations[0]["type"] == "info"
|
||||
assert "Insufficient data" in recommendations[0]["message"]
|
||||
|
||||
def test_sufficient_data_produces_performance_recommendations(
|
||||
self, project_dir: Path
|
||||
):
|
||||
store = MetricsStore(project_dir, "tdd-workflow")
|
||||
_seed_executions(
|
||||
store,
|
||||
MIN_SAMPLES_FOR_RECOMMENDATIONS,
|
||||
success=False,
|
||||
execution_time_s=60.0,
|
||||
quality_score=0.4,
|
||||
)
|
||||
|
||||
loop = OptimizationLoop.from_metrics_store(store)
|
||||
recommendations = loop.generate_improvement_recommendations()
|
||||
types = {item["type"] for item in recommendations}
|
||||
|
||||
assert "info" not in types
|
||||
assert "reliability" in types or "quality" in types or "performance" in types
|
||||
|
||||
def test_get_optimization_report_json_is_serializable(self, project_dir: Path):
|
||||
import json
|
||||
|
||||
store = MetricsStore(project_dir, "coach")
|
||||
_seed_executions(store, 4)
|
||||
|
||||
report = OptimizationLoop.from_metrics_store(store).get_optimization_report_json()
|
||||
json.dumps(report)
|
||||
|
||||
|
||||
class TestMetricsOptimizeCli:
|
||||
def test_optimize_insufficient_samples_writes_analysis_only(
|
||||
self, project_dir: Path
|
||||
):
|
||||
store = MetricsStore(project_dir, "tdd-workflow")
|
||||
_seed_executions(store, 2)
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(
|
||||
cli,
|
||||
["metrics", "optimize", "tdd-workflow", "--target", str(project_dir)],
|
||||
)
|
||||
|
||||
assert result.exit_code == 0
|
||||
assert "need 10" in result.output
|
||||
optimizer = OptimizerStore(project_dir)
|
||||
assert optimizer.analysis_path.exists()
|
||||
assert not optimizer.recommendations_path.exists()
|
||||
|
||||
def test_optimize_sufficient_samples_writes_recommendations(
|
||||
self, project_dir: Path
|
||||
):
|
||||
store = MetricsStore(project_dir, "tdd-workflow")
|
||||
_seed_executions(
|
||||
store,
|
||||
MIN_SAMPLES_FOR_RECOMMENDATIONS,
|
||||
success=False,
|
||||
execution_time_s=60.0,
|
||||
quality_score=0.4,
|
||||
)
|
||||
|
||||
runner = CliRunner()
|
||||
result = runner.invoke(
|
||||
cli,
|
||||
["metrics", "optimize", "tdd-workflow", "--target", str(project_dir)],
|
||||
)
|
||||
|
||||
assert result.exit_code == 0
|
||||
optimizer = OptimizerStore(project_dir)
|
||||
assert optimizer.analysis_path.exists()
|
||||
assert optimizer.recommendations_path.exists()
|
||||
assert '"type": "reliability"' in result.output or '"type": "quality"' in result.output
|
||||
@@ -140,11 +140,11 @@ Connect the existing Python optimization infrastructure to real project data.
|
||||
|
||||
### Tasks
|
||||
|
||||
- [ ] T09 — Add `OptimizationLoop.from_metrics_store(store)` factory that loads `PerformanceMetrics` from executions
|
||||
- [ ] T10 — Implement `kaizen-agentic metrics optimize [agent]` — run analysis, print recommendations, write `optimizer/analysis.json`
|
||||
- [ ] T11 — Consolidate `agent-optimization.md` and `agent-agent-optimization.md` into single canonical `optimization` agent; update registry
|
||||
- [ ] T12 — Update `agent-optimization.md` session protocol to invoke `metrics optimize` and reference ADR-004
|
||||
- [ ] T13 — Unit + integration tests: synthetic executions → recommendations → non-empty output
|
||||
- [x] T09 — Add `OptimizationLoop.from_metrics_store(store)` factory that loads `PerformanceMetrics` from executions
|
||||
- [x] T10 — Implement `kaizen-agentic metrics optimize [agent]` — run analysis, print recommendations, write `optimizer/analysis.json`
|
||||
- [x] T11 — Consolidate `agent-optimization.md` and `agent-agent-optimization.md` into single canonical `optimization` agent; update registry
|
||||
- [x] T12 — Update `agent-optimization.md` session protocol to invoke `metrics optimize` and reference ADR-004
|
||||
- [x] T13 — Unit + integration tests: synthetic executions → recommendations → non-empty output
|
||||
|
||||
### Definition of done
|
||||
|
||||
|
||||
Reference in New Issue
Block a user