generated from coulomb/repo-seed
session-memory: Phase 4 Measure — baseline, effectiveness, trend (WP-0009)
Closes the loop. metrics.py: fleet metrics (infra-overhead share, error rate, schema-thrash, token percentiles, success) + persisted baseline trend. effect.py: before/after per-pattern effectiveness with an improved verdict per metric. measure entrypoint with trend + --since effectiveness + JSON. Recorded pre-fix baseline: 27 sessions, overhead median 11.7%, error rate 0.96, schema-thrash 8. 13 new tests; suite 139/139. Capture->Detect->Curate->Distribute->Measure complete. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
@@ -39,6 +39,9 @@ session_memory/
|
||||
distribute/grok.py # native instruction renderer } different targets)
|
||||
distribute/proposals.py # scoping + proposed-not-applied output + active registry
|
||||
distribute/__main__.py # python -m session_memory.distribute
|
||||
measure/metrics.py # fleet metrics + persisted baseline snapshots
|
||||
measure/effect.py # before/after per-pattern effectiveness
|
||||
measure/__main__.py # python -m session_memory.measure
|
||||
config.toml # store paths, retention caps, sources, repo->domain map, curate gate
|
||||
```
|
||||
|
||||
@@ -141,6 +144,25 @@ python -m session_memory.distribute --json
|
||||
`distribute/active_patterns.json` records which pattern+version is proposed in
|
||||
which `(repo, flavor)` (FR-X4).
|
||||
|
||||
## Measure effectiveness (closing the loop)
|
||||
|
||||
Track whether the fleet is getting cheaper / more reliable, and whether a
|
||||
distributed pattern actually helped.
|
||||
|
||||
```bash
|
||||
python -m session_memory.measure --label "baseline" # snapshot + trend
|
||||
python -m session_memory.measure --since 2026-06-07 # before/after a change
|
||||
python -m session_memory.measure --no-save --json
|
||||
```
|
||||
|
||||
- A **snapshot** (infra-overhead share, error rate, schema-thrash, token
|
||||
percentiles, success rate) is appended to `measure/baselines.jsonl` to build a
|
||||
trend (FR-M3).
|
||||
- `--since DATE` splits sessions before/after a change and diffs the metrics, with
|
||||
an `improved` verdict per metric (FR-M1/FR-M2) — so ineffective patterns can be
|
||||
retired. Recorded pre-fix baseline (2026-06-07): 27 sessions, infra-overhead
|
||||
median 11.7 %, error rate 0.96, schema-thrash 8 sessions.
|
||||
|
||||
## Retention knobs (`[retention]` in config.toml)
|
||||
|
||||
| Key | Meaning |
|
||||
@@ -174,4 +196,6 @@ python -m pytest # schema, adapters, store, digest, retention, ingest,
|
||||
- **Phase 3** (AGENTIC-WP-0007): Distribute — per-flavor distributor adapters
|
||||
render approved patterns into proposed (HITL) artifacts, scoped by repo/domain,
|
||||
with an active-pattern registry.
|
||||
- **Next — Phase 4 (Measure)** closes the loop per the PRD.
|
||||
- **Phase 4** (AGENTIC-WP-0009): Measure — fleet baseline/trend + before/after
|
||||
per-pattern effectiveness. The Capture → Detect → Curate → Distribute → Measure
|
||||
loop is closed.
|
||||
|
||||
Reference in New Issue
Block a user