session-memory: Phase 4 Measure — baseline, effectiveness, trend (WP-0009)

Closes the loop. metrics.py: fleet metrics (infra-overhead share, error rate, schema-thrash, token percentiles, success) + persisted baseline trend. effect.py: before/after per-pattern effectiveness with an improved verdict per metric. measure entrypoint with trend + --since effectiveness + JSON. Recorded pre-fix baseline: 27 sessions, overhead median 11.7%, error rate 0.96, schema-thrash 8. 13 new tests; suite 139/139. Capture->Detect->Curate->Distribute->Measure complete. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 15:49:22 +02:00
parent 035c7a20d3
commit 4f28cd67cf
11 changed files with 497 additions and 5 deletions
--- a/session_memory/README.md
+++ b/session_memory/README.md
@@ -39,6 +39,9 @@ session_memory/
  distribute/grok.py   # native instruction renderer       }  different targets)
  distribute/proposals.py  # scoping + proposed-not-applied output + active registry
  distribute/__main__.py   # python -m session_memory.distribute
+  measure/metrics.py   # fleet metrics + persisted baseline snapshots
+  measure/effect.py    # before/after per-pattern effectiveness
+  measure/__main__.py  # python -m session_memory.measure
  config.toml          # store paths, retention caps, sources, repo->domain map, curate gate
 ```

@@ -141,6 +144,25 @@ python -m session_memory.distribute --json
  `distribute/active_patterns.json` records which pattern+version is proposed in
  which `(repo, flavor)` (FR-X4).

+## Measure effectiveness (closing the loop)
+
+Track whether the fleet is getting cheaper / more reliable, and whether a
+distributed pattern actually helped.
+
+```bash
+python -m session_memory.measure --label "baseline"      # snapshot + trend
+python -m session_memory.measure --since 2026-06-07      # before/after a change
+python -m session_memory.measure --no-save --json
+```
+
+- A **snapshot** (infra-overhead share, error rate, schema-thrash, token
+  percentiles, success rate) is appended to `measure/baselines.jsonl` to build a
+  trend (FR-M3).
+- `--since DATE` splits sessions before/after a change and diffs the metrics, with
+  an `improved` verdict per metric (FR-M1/FR-M2) — so ineffective patterns can be
+  retired. Recorded pre-fix baseline (2026-06-07): 27 sessions, infra-overhead
+  median 11.7 %, error rate 0.96, schema-thrash 8 sessions.
+
 ## Retention knobs (`[retention]` in config.toml)

 | Key | Meaning |
@@ -174,4 +196,6 @@ python -m pytest          # schema, adapters, store, digest, retention, ingest,
 - **Phase 3** (AGENTIC-WP-0007): Distribute — per-flavor distributor adapters
  render approved patterns into proposed (HITL) artifacts, scoped by repo/domain,
  with an active-pattern registry.
- **Next — Phase 4 (Measure)** closes the loop per the PRD.
+- **Phase 4** (AGENTIC-WP-0009): Measure — fleet baseline/trend + before/after
+  per-pattern effectiveness. The Capture → Detect → Curate → Distribute → Measure
+  loop is closed.