Files

tegwick 4f28cd67cf session-memory: Phase 4 Measure — baseline, effectiveness, trend (WP-0009)

Closes the loop. metrics.py: fleet metrics (infra-overhead share, error rate,
schema-thrash, token percentiles, success) + persisted baseline trend. effect.py:
before/after per-pattern effectiveness with an improved verdict per metric.
measure entrypoint with trend + --since effectiveness + JSON. Recorded pre-fix
baseline: 27 sessions, overhead median 11.7%, error rate 0.96, schema-thrash 8.
13 new tests; suite 139/139. Capture->Detect->Curate->Distribute->Measure complete.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-07 15:49:22 +02:00

2.2 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_id

id	type	title	domain	repo	status	owner	topic_slug	created	updated	state_hub_workstream_id
AGENTIC-WP-0009	workplan	Coding Session Memory — Phase 4 (Measure: effectiveness + fleet trend)	helix_forge	agentic-resources	finished	codex	helix-forge	2026-06-07	2026-06-07	99f1d836-3be0-40e5-9f17-63d3ecc5fcca

Coding Session Memory — Phase 4 (Measure)

Implements Measure (PRD §6.5, FR-M1–FR-M3) — the loop-closer. After patterns are distributed (Phase 3) and changes land (e.g. the State Hub skill [STATE-WP-0058] and the Read-before-Edit reflex AGENTIC-WP-0008), Measure answers: did it actually help?

Reuses what is already captured — WP-0005 tool buckets, WP-0006 error mining — so this is computation over existing digests, not new capture.

Baseline Metrics Module + Persisted Baseline

id: AGENTIC-WP-0009-T01
status: done
priority: high
state_hub_task_id: "e5c2016a-2d51-4382-a013-7153e053e8ed"

session_memory/measure/metrics.py: compute fleet metrics over real sessions (infra-overhead share, error rate, recurring-error count, schema-thrash, cost percentiles) and persist a timestamped baseline snapshot. Reuses detect.signals.tool_bucket and the digest error_snippets. Unit-tested.

Before/After Per-Pattern Effectiveness

id: AGENTIC-WP-0009-T02
status: done
priority: high
state_hub_task_id: "aa097a00-3462-41da-a137-67e1d61d8d33"

Given a change/pattern with an applied-at date, compare sessions after it against the pre-change baseline (cost, error rate, infra-overhead, success) to surface per-pattern effectiveness so ineffective patterns can be revised or retired (FR-M1/FR-M2). Unit-tested.

Fleet-Trend Report + Entrypoint + Tests

id: AGENTIC-WP-0009-T03
status: done
priority: medium
state_hub_task_id: "f1147d59-2fb7-4d35-baec-b8f001bb9d62"

python -m session_memory.measure: fleet-level trend (is the median session getting cheaper / more reliable over time, FR-M3) plus per-pattern effectiveness; markdown + JSON. Document in session_memory/README.md. After updates, notify the operator to run make fix-consistency REPO=agentic-resources.

[STATE-WP-0058]: handed off to the state-hub repo worker

2.2 KiB Raw Blame History Unescape Escape

Coding Session Memory — Phase 4 (Measure)

Baseline Metrics Module + Persisted Baseline

Before/After Per-Pattern Effectiveness

Fleet-Trend Report + Entrypoint + Tests

2.2 KiB

Raw Blame History