session-memory: denoise error fingerprints (WP-0006 follow-up)

Tighten _is_failed: exclude successful hub JSON responses (top-level no-error payloads) and file-read snapshots (numbered cat -n source lines) that were polluting error_snippets. JSON verdict classifies error vs success payloads directly. Cuts distinct fingerprints 444 -> 269 (~40%) over the real corpus with the top errors unchanged. Assessment caveat updated. 5 new tests; suite 102/102. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
session-memory: error root-cause assessment + v2 re-ingest (WP-0006 T03)
2026-06-07 13:39:08 +02:00 · 2026-06-07 13:09:29 +02:00 · 2026-06-07 13:01:29 +02:00
6 changed files with 201 additions and 10 deletions
--- a/docs/ASSESSMENT-infra-friction.md
+++ b/docs/ASSESSMENT-infra-friction.md
@@ -86,15 +86,59 @@ issue.** Two high-ROI moves:
  share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
  This is precisely what the Measure phase is for — the loop closes here.

+## Content-level root causes (error-body mining)
+
+*Added 2026-06-07 from [AGENTIC-WP-0006] — `build_digest` now mines normalized
+error fingerprints into the durable digest, and `sig_recurring_error` clusters
+them. This is the "why" the tool-mix view above could not see.*
+
+**26 of 27 real sessions hit at least one error.** Top recurring error
+fingerprints across the corpus (by # sessions affected):
+
+| # sessions | occ | flavors | top sample |
+|-----------:|----:|---------|------------|
+| **12** | 32 | claude | `<tool_use_error>File has not been read yet. Read it first before writing to it.` |
+| **6** | 13 | claude | `<tool_use_error>File has been modified since read …` |
+| **4** | 9 | **claude + grok** | `make: *** [Makefile:227: fix-consistency] Error 1` |
+| 3 | 21 | claude | `MCP error -32602: Invalid request parameters` |
+| 3 | 6 | claude | `Error calling tool 'update_task_status': 'title'` |
+| 2 | 6 | claude | `make: *** [Makefile:21: test] Error 1` |
+
+Reading:
+
+- **#1 — Edit/Write-before-Read (12/27 sessions, 8 repos).** The single most
+  common error is agents trying to edit a file they haven't read into context.
+  This is a *workflow* friction, highly addressable: a Read-before-Edit reflex in
+  the agent instructions / a skill, or a harness affordance. (Observed live: the
+  author hit this exact error twice while writing this workplan.)
+- **#2 — stale-read conflicts (6 sessions):** "File has been modified since read"
+  — same family, a re-read-before-edit discipline fixes both.
+- **#3 — cross-flavor `make fix-consistency` failures (claude + grok, 3 repos):**
+  the consistency tooling itself fails across flavors — a shared infra issue worth
+  a look on the state-hub side (cf. [STATE-WP-0058]).
+- **State Hub MCP instability** (`-32602`, `update_task_status 'title'`) shows up
+  in 3 sessions each — corroborates the plumbing-overhead story and the live MCP
+  flakiness seen during this work (REST fallback used).
+
+**Fingerprint noise — mostly handled.** `_is_failed` now excludes successful hub
+JSON responses (top-level no-error payloads) and file-read snapshots (numbered
+`cat -n` source lines), which cut distinct fingerprints **444 → 269 (~40 %)**
+without touching the top entries. Residual low-value items remain in the long tail
+(bare structural lines like `{`, linter "N errors" summaries); the *top*
+fingerprints are real. Note several entries (`MCP error -32602`,
+`update_task_status 'title'`) reflect the State Hub MCP instability hit live during
+this work — genuine, if self-referential, friction.
+
 ## What this assessment still can't see

- **Why** a session was expensive at the *content* level (specific error
-  messages, repeated failed approaches) — the digest captures tool histograms and
-  prompt/response snippets but not error-body text. Mining tool-result bodies for
-  recurring failure messages is the natural next extension if root-cause depth is
-  needed.
+- ~~**Why** a session was expensive at the content level.~~ **Now addressed**
+  (error-body mining, above), modulo the fingerprint-noise caveat.
+- Repeated *failed approaches* (as opposed to surfaced errors) — e.g. an agent
+  silently retrying a wrong strategy without an error — are still invisible.
 - Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
  friction claims are Claude-weighted for now.

 [AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
+[AGENTIC-WP-0006]: ../workplans/AGENTIC-WP-0006-error-body-mining.md
+[STATE-WP-0058]: handed off to the state-hub repo worker
 [detect/quality.py]: ../session_memory/detect/quality.py
--- a/session_memory/core/digest.py
+++ b/session_memory/core/digest.py
@@ -12,6 +12,7 @@ belongs to the Detect phase (PRD §6.2).
 from __future__ import annotations

 import collections
+import json
 import re
 from typing import Any

@@ -22,6 +23,12 @@ _FAIL_HINTS = ("error", "failed", "exception", "traceback", "fatal", "non-zero")
 # Substrings suggesting a clean test pass.
 _PASS_HINTS = ("passed", "0 failed", "ok", "success")

+# A line that is numbered source content from a Read result (`cat -n` style),
+# e.g. "229\t    raise InfospaceError(" — code text, never a runtime error.
+_NUMBERED_LINE_RE = re.compile(r"^\s*\d+\t")
+# Top-level keys that mark a JSON tool-result as an actual error (vs. success).
+_JSON_ERROR_KEYS = ("error", "errors", "detail")
+
 # Normalization patterns so the same error collapses to one fingerprint
 # regardless of paths / ids / counts (WP-0006 T01).
 _UUID_RE = re.compile(r"\b[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\b", re.I)
@@ -195,12 +202,48 @@ def _error_body(event: SessionEvent, blobs: dict) -> str:
    return event.summary or ""


+def _looks_like_file_read(body: str) -> bool:
+    """True if the body is mostly numbered source lines (a Read result), not an error."""
+    lines = [ln for ln in body.splitlines() if ln.strip()]
+    if not lines:
+        return False
+    numbered = sum(1 for ln in lines if _NUMBERED_LINE_RE.match(ln))
+    return numbered >= max(3, len(lines) // 2)
+
+
+def _json_verdict(body: str):
+    """Classify a JSON tool-result body: 'error', 'success', or None (not JSON).
+
+    Hub MCP successes look like ``{"result": "..."}`` and mention 'error' deep
+    inside summaries but are not failures ('success'). A payload with a top-level
+    error key (``{"detail": ...}`` / ``{"error": ...}``) is 'error'. Non-JSON text
+    returns None so the plain fail-hint heuristic still applies.
+    """
+    s = body.strip()
+    if not s or s[0] not in "{[":
+        return None
+    try:
+        obj = json.loads(s)
+    except (ValueError, TypeError):
+        return None
+    if isinstance(obj, dict) and any(k in obj for k in _JSON_ERROR_KEYS):
+        return "error"
+    return "success"
+
+
 def _is_failed(event: SessionEvent, blobs: dict) -> bool:
    if event.kind == "error":
        return True
    if event.kind == "tool_result":
-        body = _error_body(event, blobs).lower()
-        return bool(body) and any(h in body for h in _FAIL_HINTS)
+        body = _error_body(event, blobs)
+        if not body.strip():
+            return False
+        if _looks_like_file_read(body):
+            return False
+        verdict = _json_verdict(body)
+        if verdict is not None:
+            return verdict == "error"
+        return any(h in body.lower() for h in _FAIL_HINTS)
    return False


--- a/session_memory/detect/signals.py
+++ b/session_memory/detect/signals.py
@@ -156,10 +156,28 @@ def sig_tool_thrash(digest, ctx) -> list[Signal]:
    return []


+def sig_recurring_error(digest, ctx) -> list[Signal]:
+    """Problem: a normalized error fingerprint (WP-0006) — one signal per distinct
+    error in the session, so the same error across sessions/repos/flavors clusters
+    into a candidate root-cause pattern (locus = fingerprint, magnitude = in-session
+    occurrences). This is the content-level 'why', not just a coarse error count.
+    """
+    out: list[Signal] = []
+    for snip in digest.get("error_snippets", []) or []:
+        fp = snip.get("fingerprint")
+        if not fp:
+            continue
+        out.append(_base(digest, "recurring_error", PROBLEM, fp, float(snip.get("count", 1)),
+                         sample=snip.get("sample", ""), tool=snip.get("tool"),
+                         occurrences=snip.get("count", 1)))
+    return out
+
+
 EXTRACTORS: list[Callable] = [
    sig_retry_storm, sig_repeated_errors, sig_budget_overrun, sig_abandoned,
    sig_clean_pass, sig_error_then_recovery,
    sig_infra_overhead, sig_schema_thrash, sig_tool_thrash,
+    sig_recurring_error,
 ]


--- a/tests/test_detect_recurring_error.py
+++ b/tests/test_detect_recurring_error.py
@@ -0,0 +1,59 @@
+"""Recurring-error signal + clustering (WP-0006 T02)."""
+
+import os
+import sys
+
+sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from session_memory.detect.cluster import cluster  # noqa: E402
+from session_memory.detect.signals import (  # noqa: E402
+    extract_signals,
+    sig_recurring_error,
+)
+
+
+def _digest(uid, repo, flavor="claude", snippets=None):
+    return {
+        "session_uid": uid, "flavor": flavor, "repo": repo, "outcome": "success",
+        "cost": {"input_tokens": 1, "output_tokens": 1},
+        "markers": {"errors": 0, "retries": 0, "test_runs": 0},
+        "tool_histogram": {}, "error_snippets": snippets or [],
+    }
+
+
+_FP = "modulenotfounderror: no module named 'foo' at <path>:<n>"
+
+
+def test_signal_per_distinct_fingerprint():
+    d = _digest("claude:a", "r1", snippets=[
+        {"fingerprint": _FP, "sample": "ModuleNotFoundError ...", "count": 3, "tool": "Bash"},
+        {"fingerprint": "keyerror: <str>", "sample": "KeyError", "count": 1, "tool": None},
+    ])
+    sigs = sig_recurring_error(d, {})
+    assert len(sigs) == 2
+    top = [s for s in sigs if s.locus == _FP][0]
+    assert top.type == "recurring_error"
+    assert top.magnitude == 3.0
+    assert top.detail["sample"].startswith("ModuleNotFound")
+
+
+def test_clusters_across_sessions_and_flavors():
+    # same fingerprint in a claude and a grok session -> cross-flavor candidate
+    digs = [
+        _digest("claude:a", "r1", "claude",
+                [{"fingerprint": _FP, "sample": "ModuleNotFoundError", "count": 2, "tool": "Bash"}]),
+        _digest("grok:b", "r2", "grok",
+                [{"fingerprint": _FP, "sample": "ModuleNotFoundError", "count": 1, "tool": None}]),
+    ]
+    signals = extract_signals(digs)
+    pats = cluster([s for s in signals if s.type == "recurring_error"], min_frequency=2)
+    assert len(pats) == 1
+    p = pats[0]
+    assert p.signal_type == "recurring_error"
+    assert p.cross_flavor is True
+    assert sorted(p.flavors) == ["claude", "grok"]
+    assert p.frequency == 2
+
+
+def test_no_snippets_no_signal():
+    assert sig_recurring_error(_digest("claude:a", "r1"), {}) == []
--- a/tests/test_digest_errors.py
+++ b/tests/test_digest_errors.py
@@ -59,6 +59,33 @@ def test_clean_tool_result_not_mined():
    assert _error_snippets(events, blobs) == []


+def test_success_json_not_mined():
+    # a hub MCP success payload mentioning 'error' deep inside is NOT a failure
+    blobs = {"b1": '{"result": "{\\"domain\\": \\"custodian\\", \\"note\\": \\"no errors\\"}"}'}
+    events = [_ev(0, "tool_result", tool="mcp__state-hub__get_domain_summary", payload_ref="b1")]
+    assert _error_snippets(events, blobs) == []
+
+
+def test_error_json_still_mined():
+    blobs = {"b1": '{"detail": "Invalid request parameters"}'}
+    events = [_ev(0, "tool_result", tool="Bash", payload_ref="b1")]
+    snips = _error_snippets(events, blobs)
+    assert len(snips) == 1
+
+
+def test_plain_mcp_error_still_mined():
+    blobs = {"b1": "MCP error -32602: Invalid request parameters"}
+    events = [_ev(0, "tool_result", tool="Bash", payload_ref="b1")]
+    assert len(_error_snippets(events, blobs)) == 1
+
+
+def test_file_read_snapshot_not_mined():
+    # a Read result of source code containing 'raise ...Error' is not a runtime error
+    blobs = {"b1": "227\t    def f():\n228\t        x = 1\n229\t        raise InfospaceError()\n"}
+    events = [_ev(0, "tool_result", tool="Read", payload_ref="b1")]
+    assert _error_snippets(events, blobs) == []
+
+
 def test_build_digest_includes_error_snippets_and_v2():
    s = Session(session_uid="claude:s", flavor="claude", native_session_id="s", repo="r")
    events = [_ev(0, "user_msg"), _ev(1, "error", payload_ref="b1"), _ev(2, "assistant_msg")]
--- a/workplans/AGENTIC-WP-0006-error-body-mining.md
+++ b/workplans/AGENTIC-WP-0006-error-body-mining.md
@@ -4,7 +4,7 @@ type: workplan
 title: "Coding Session Memory — Error-Body Mining (content-level root causes)"
 domain: helix_forge
 repo: agentic-resources
-status: ready
+status: finished
 owner: codex
 topic_slug: helix-forge
 created: "2026-06-07"
@@ -48,7 +48,7 @@ sessions with repeated and varied errors.

 ```task
 id: AGENTIC-WP-0006-T02
-status: todo
+status: done
 priority: high
 state_hub_task_id: "1a41b6f5-48bc-4080-bd18-94f2186ef566"
 ```
@@ -64,7 +64,7 @@ synthetic digests sharing a fingerprint.

 ```task
 id: AGENTIC-WP-0006-T03
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "bed16d23-3971-4257-b066-d1e639fef150"
 ```