Compare commits

...

3 Commits

Author SHA1 Message Date
1b6081cd88 session-memory: denoise error fingerprints (WP-0006 follow-up)
Tighten _is_failed: exclude successful hub JSON responses (top-level no-error
payloads) and file-read snapshots (numbered cat -n source lines) that were
polluting error_snippets. JSON verdict classifies error vs success payloads
directly. Cuts distinct fingerprints 444 -> 269 (~40%) over the real corpus with
the top errors unchanged. Assessment caveat updated. 5 new tests; suite 102/102.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 13:39:08 +02:00
7cce276d32 session-memory: error root-cause assessment + v2 re-ingest (WP-0006 T03)
Re-ingested under schema v2 (populates error_snippets) and re-ran detect over
27 real sessions. Added a 'content-level root causes' section to
docs/ASSESSMENT-infra-friction.md: top recurring error is Edit/Write-before-Read
(12/27 sessions, 8 repos), then stale-read conflicts, a cross-flavor (claude+grok)
make fix-consistency failure, and State Hub MCP instability. Documented a
fingerprint-noise caveat. WP-0006 finished; suite 98/98.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 13:09:29 +02:00
e022c0f9d6 session-memory: recurring-error signal + clustering (WP-0006 T02)
detect/signals.py sig_recurring_error emits one signal per distinct error
fingerprint per session (magnitude = in-session occurrences), so the same error
recurring across sessions/repos/flavors clusters into a candidate root-cause
problem pattern via the existing clusterer — cross-flavor flagged automatically.
3 new tests; suite 98/98 green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 13:01:29 +02:00
6 changed files with 201 additions and 10 deletions

View File

@@ -86,15 +86,59 @@ issue.** Two high-ROI moves:
share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
This is precisely what the Measure phase is for — the loop closes here.
## Content-level root causes (error-body mining)
*Added 2026-06-07 from [AGENTIC-WP-0006] — `build_digest` now mines normalized
error fingerprints into the durable digest, and `sig_recurring_error` clusters
them. This is the "why" the tool-mix view above could not see.*
**26 of 27 real sessions hit at least one error.** Top recurring error
fingerprints across the corpus (by # sessions affected):
| # sessions | occ | flavors | top sample |
|-----------:|----:|---------|------------|
| **12** | 32 | claude | `<tool_use_error>File has not been read yet. Read it first before writing to it.` |
| **6** | 13 | claude | `<tool_use_error>File has been modified since read …` |
| **4** | 9 | **claude + grok** | `make: *** [Makefile:227: fix-consistency] Error 1` |
| 3 | 21 | claude | `MCP error -32602: Invalid request parameters` |
| 3 | 6 | claude | `Error calling tool 'update_task_status': 'title'` |
| 2 | 6 | claude | `make: *** [Makefile:21: test] Error 1` |
Reading:
- **#1 — Edit/Write-before-Read (12/27 sessions, 8 repos).** The single most
common error is agents trying to edit a file they haven't read into context.
This is a *workflow* friction, highly addressable: a Read-before-Edit reflex in
the agent instructions / a skill, or a harness affordance. (Observed live: the
author hit this exact error twice while writing this workplan.)
- **#2 — stale-read conflicts (6 sessions):** "File has been modified since read"
— same family, a re-read-before-edit discipline fixes both.
- **#3 — cross-flavor `make fix-consistency` failures (claude + grok, 3 repos):**
the consistency tooling itself fails across flavors — a shared infra issue worth
a look on the state-hub side (cf. [STATE-WP-0058]).
- **State Hub MCP instability** (`-32602`, `update_task_status 'title'`) shows up
in 3 sessions each — corroborates the plumbing-overhead story and the live MCP
flakiness seen during this work (REST fallback used).
**Fingerprint noise — mostly handled.** `_is_failed` now excludes successful hub
JSON responses (top-level no-error payloads) and file-read snapshots (numbered
`cat -n` source lines), which cut distinct fingerprints **444 → 269 (~40 %)**
without touching the top entries. Residual low-value items remain in the long tail
(bare structural lines like `{`, linter "N errors" summaries); the *top*
fingerprints are real. Note several entries (`MCP error -32602`,
`update_task_status 'title'`) reflect the State Hub MCP instability hit live during
this work — genuine, if self-referential, friction.
## What this assessment still can't see
- **Why** a session was expensive at the *content* level (specific error
messages, repeated failed approaches) — the digest captures tool histograms and
prompt/response snippets but not error-body text. Mining tool-result bodies for
recurring failure messages is the natural next extension if root-cause depth is
needed.
- ~~**Why** a session was expensive at the content level.~~ **Now addressed**
(error-body mining, above), modulo the fingerprint-noise caveat.
- Repeated *failed approaches* (as opposed to surfaced errors) — e.g. an agent
silently retrying a wrong strategy without an error — are still invisible.
- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
friction claims are Claude-weighted for now.
[AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
[AGENTIC-WP-0006]: ../workplans/AGENTIC-WP-0006-error-body-mining.md
[STATE-WP-0058]: handed off to the state-hub repo worker
[detect/quality.py]: ../session_memory/detect/quality.py

View File

@@ -12,6 +12,7 @@ belongs to the Detect phase (PRD §6.2).
from __future__ import annotations
import collections
import json
import re
from typing import Any
@@ -22,6 +23,12 @@ _FAIL_HINTS = ("error", "failed", "exception", "traceback", "fatal", "non-zero")
# Substrings suggesting a clean test pass.
_PASS_HINTS = ("passed", "0 failed", "ok", "success")
# A line that is numbered source content from a Read result (`cat -n` style),
# e.g. "229\t raise InfospaceError(" — code text, never a runtime error.
_NUMBERED_LINE_RE = re.compile(r"^\s*\d+\t")
# Top-level keys that mark a JSON tool-result as an actual error (vs. success).
_JSON_ERROR_KEYS = ("error", "errors", "detail")
# Normalization patterns so the same error collapses to one fingerprint
# regardless of paths / ids / counts (WP-0006 T01).
_UUID_RE = re.compile(r"\b[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\b", re.I)
@@ -195,12 +202,48 @@ def _error_body(event: SessionEvent, blobs: dict) -> str:
return event.summary or ""
def _looks_like_file_read(body: str) -> bool:
"""True if the body is mostly numbered source lines (a Read result), not an error."""
lines = [ln for ln in body.splitlines() if ln.strip()]
if not lines:
return False
numbered = sum(1 for ln in lines if _NUMBERED_LINE_RE.match(ln))
return numbered >= max(3, len(lines) // 2)
def _json_verdict(body: str):
"""Classify a JSON tool-result body: 'error', 'success', or None (not JSON).
Hub MCP successes look like ``{"result": "..."}`` and mention 'error' deep
inside summaries but are not failures ('success'). A payload with a top-level
error key (``{"detail": ...}`` / ``{"error": ...}``) is 'error'. Non-JSON text
returns None so the plain fail-hint heuristic still applies.
"""
s = body.strip()
if not s or s[0] not in "{[":
return None
try:
obj = json.loads(s)
except (ValueError, TypeError):
return None
if isinstance(obj, dict) and any(k in obj for k in _JSON_ERROR_KEYS):
return "error"
return "success"
def _is_failed(event: SessionEvent, blobs: dict) -> bool:
if event.kind == "error":
return True
if event.kind == "tool_result":
body = _error_body(event, blobs).lower()
return bool(body) and any(h in body for h in _FAIL_HINTS)
body = _error_body(event, blobs)
if not body.strip():
return False
if _looks_like_file_read(body):
return False
verdict = _json_verdict(body)
if verdict is not None:
return verdict == "error"
return any(h in body.lower() for h in _FAIL_HINTS)
return False

View File

@@ -156,10 +156,28 @@ def sig_tool_thrash(digest, ctx) -> list[Signal]:
return []
def sig_recurring_error(digest, ctx) -> list[Signal]:
"""Problem: a normalized error fingerprint (WP-0006) — one signal per distinct
error in the session, so the same error across sessions/repos/flavors clusters
into a candidate root-cause pattern (locus = fingerprint, magnitude = in-session
occurrences). This is the content-level 'why', not just a coarse error count.
"""
out: list[Signal] = []
for snip in digest.get("error_snippets", []) or []:
fp = snip.get("fingerprint")
if not fp:
continue
out.append(_base(digest, "recurring_error", PROBLEM, fp, float(snip.get("count", 1)),
sample=snip.get("sample", ""), tool=snip.get("tool"),
occurrences=snip.get("count", 1)))
return out
EXTRACTORS: list[Callable] = [
sig_retry_storm, sig_repeated_errors, sig_budget_overrun, sig_abandoned,
sig_clean_pass, sig_error_then_recovery,
sig_infra_overhead, sig_schema_thrash, sig_tool_thrash,
sig_recurring_error,
]

View File

@@ -0,0 +1,59 @@
"""Recurring-error signal + clustering (WP-0006 T02)."""
import os
import sys
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from session_memory.detect.cluster import cluster # noqa: E402
from session_memory.detect.signals import ( # noqa: E402
extract_signals,
sig_recurring_error,
)
def _digest(uid, repo, flavor="claude", snippets=None):
return {
"session_uid": uid, "flavor": flavor, "repo": repo, "outcome": "success",
"cost": {"input_tokens": 1, "output_tokens": 1},
"markers": {"errors": 0, "retries": 0, "test_runs": 0},
"tool_histogram": {}, "error_snippets": snippets or [],
}
_FP = "modulenotfounderror: no module named 'foo' at <path>:<n>"
def test_signal_per_distinct_fingerprint():
d = _digest("claude:a", "r1", snippets=[
{"fingerprint": _FP, "sample": "ModuleNotFoundError ...", "count": 3, "tool": "Bash"},
{"fingerprint": "keyerror: <str>", "sample": "KeyError", "count": 1, "tool": None},
])
sigs = sig_recurring_error(d, {})
assert len(sigs) == 2
top = [s for s in sigs if s.locus == _FP][0]
assert top.type == "recurring_error"
assert top.magnitude == 3.0
assert top.detail["sample"].startswith("ModuleNotFound")
def test_clusters_across_sessions_and_flavors():
# same fingerprint in a claude and a grok session -> cross-flavor candidate
digs = [
_digest("claude:a", "r1", "claude",
[{"fingerprint": _FP, "sample": "ModuleNotFoundError", "count": 2, "tool": "Bash"}]),
_digest("grok:b", "r2", "grok",
[{"fingerprint": _FP, "sample": "ModuleNotFoundError", "count": 1, "tool": None}]),
]
signals = extract_signals(digs)
pats = cluster([s for s in signals if s.type == "recurring_error"], min_frequency=2)
assert len(pats) == 1
p = pats[0]
assert p.signal_type == "recurring_error"
assert p.cross_flavor is True
assert sorted(p.flavors) == ["claude", "grok"]
assert p.frequency == 2
def test_no_snippets_no_signal():
assert sig_recurring_error(_digest("claude:a", "r1"), {}) == []

View File

@@ -59,6 +59,33 @@ def test_clean_tool_result_not_mined():
assert _error_snippets(events, blobs) == []
def test_success_json_not_mined():
# a hub MCP success payload mentioning 'error' deep inside is NOT a failure
blobs = {"b1": '{"result": "{\\"domain\\": \\"custodian\\", \\"note\\": \\"no errors\\"}"}'}
events = [_ev(0, "tool_result", tool="mcp__state-hub__get_domain_summary", payload_ref="b1")]
assert _error_snippets(events, blobs) == []
def test_error_json_still_mined():
blobs = {"b1": '{"detail": "Invalid request parameters"}'}
events = [_ev(0, "tool_result", tool="Bash", payload_ref="b1")]
snips = _error_snippets(events, blobs)
assert len(snips) == 1
def test_plain_mcp_error_still_mined():
blobs = {"b1": "MCP error -32602: Invalid request parameters"}
events = [_ev(0, "tool_result", tool="Bash", payload_ref="b1")]
assert len(_error_snippets(events, blobs)) == 1
def test_file_read_snapshot_not_mined():
# a Read result of source code containing 'raise ...Error' is not a runtime error
blobs = {"b1": "227\t def f():\n228\t x = 1\n229\t raise InfospaceError()\n"}
events = [_ev(0, "tool_result", tool="Read", payload_ref="b1")]
assert _error_snippets(events, blobs) == []
def test_build_digest_includes_error_snippets_and_v2():
s = Session(session_uid="claude:s", flavor="claude", native_session_id="s", repo="r")
events = [_ev(0, "user_msg"), _ev(1, "error", payload_ref="b1"), _ev(2, "assistant_msg")]

View File

@@ -4,7 +4,7 @@ type: workplan
title: "Coding Session Memory — Error-Body Mining (content-level root causes)"
domain: helix_forge
repo: agentic-resources
status: ready
status: finished
owner: codex
topic_slug: helix-forge
created: "2026-06-07"
@@ -48,7 +48,7 @@ sessions with repeated and varied errors.
```task
id: AGENTIC-WP-0006-T02
status: todo
status: done
priority: high
state_hub_task_id: "1a41b6f5-48bc-4080-bd18-94f2186ef566"
```
@@ -64,7 +64,7 @@ synthetic digests sharing a fingerprint.
```task
id: AGENTIC-WP-0006-T03
status: todo
status: done
priority: medium
state_hub_task_id: "bed16d23-3971-4257-b066-d1e639fef150"
```