session-memory: Phase 4 Measure — baseline, effectiveness, trend (WP-0009)

Closes the loop. metrics.py: fleet metrics (infra-overhead share, error rate, schema-thrash, token percentiles, success) + persisted baseline trend. effect.py: before/after per-pattern effectiveness with an improved verdict per metric. measure entrypoint with trend + --since effectiveness + JSON. Recorded pre-fix baseline: 27 sessions, overhead median 11.7%, error rate 0.96, schema-thrash 8. 13 new tests; suite 139/139. Capture->Detect->Curate->Distribute->Measure complete. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 15:49:22 +02:00
parent 035c7a20d3
commit 4f28cd67cf
11 changed files with 497 additions and 5 deletions
--- a/session_memory/config.toml
+++ b/session_memory/config.toml
@@ -39,6 +39,10 @@ min_substantive = 3    # require >= this many substantive (edit/read/shell) tool
 min_prompt_len  = 25   # first prompt shorter than this is treated as trivial

 # Curate phase (AGENTIC-WP-0004): catalog location + promotion evidence bar.
+# Measure phase (AGENTIC-WP-0009): persisted baseline/trend of fleet metrics.
+[measure]
+baselines = "session_memory/measure/baselines.jsonl"  # timestamped metric snapshots (committed)
+
 # Distribute phase (AGENTIC-WP-0007): where per-flavor proposals + the active
 # registry are written. Proposals are HITL — reviewed, never auto-applied.
 [distribute]