generated from coulomb/repo-seed
Compare commits
3 Commits
2bd6aa3b41
...
1b6081cd88
| Author | SHA1 | Date | |
|---|---|---|---|
| 1b6081cd88 | |||
| 7cce276d32 | |||
| e022c0f9d6 |
@@ -86,15 +86,59 @@ issue.** Two high-ROI moves:
|
||||
share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
|
||||
This is precisely what the Measure phase is for — the loop closes here.
|
||||
|
||||
## Content-level root causes (error-body mining)
|
||||
|
||||
*Added 2026-06-07 from [AGENTIC-WP-0006] — `build_digest` now mines normalized
|
||||
error fingerprints into the durable digest, and `sig_recurring_error` clusters
|
||||
them. This is the "why" the tool-mix view above could not see.*
|
||||
|
||||
**26 of 27 real sessions hit at least one error.** Top recurring error
|
||||
fingerprints across the corpus (by # sessions affected):
|
||||
|
||||
| # sessions | occ | flavors | top sample |
|
||||
|-----------:|----:|---------|------------|
|
||||
| **12** | 32 | claude | `<tool_use_error>File has not been read yet. Read it first before writing to it.` |
|
||||
| **6** | 13 | claude | `<tool_use_error>File has been modified since read …` |
|
||||
| **4** | 9 | **claude + grok** | `make: *** [Makefile:227: fix-consistency] Error 1` |
|
||||
| 3 | 21 | claude | `MCP error -32602: Invalid request parameters` |
|
||||
| 3 | 6 | claude | `Error calling tool 'update_task_status': 'title'` |
|
||||
| 2 | 6 | claude | `make: *** [Makefile:21: test] Error 1` |
|
||||
|
||||
Reading:
|
||||
|
||||
- **#1 — Edit/Write-before-Read (12/27 sessions, 8 repos).** The single most
|
||||
common error is agents trying to edit a file they haven't read into context.
|
||||
This is a *workflow* friction, highly addressable: a Read-before-Edit reflex in
|
||||
the agent instructions / a skill, or a harness affordance. (Observed live: the
|
||||
author hit this exact error twice while writing this workplan.)
|
||||
- **#2 — stale-read conflicts (6 sessions):** "File has been modified since read"
|
||||
— same family, a re-read-before-edit discipline fixes both.
|
||||
- **#3 — cross-flavor `make fix-consistency` failures (claude + grok, 3 repos):**
|
||||
the consistency tooling itself fails across flavors — a shared infra issue worth
|
||||
a look on the state-hub side (cf. [STATE-WP-0058]).
|
||||
- **State Hub MCP instability** (`-32602`, `update_task_status 'title'`) shows up
|
||||
in 3 sessions each — corroborates the plumbing-overhead story and the live MCP
|
||||
flakiness seen during this work (REST fallback used).
|
||||
|
||||
**Fingerprint noise — mostly handled.** `_is_failed` now excludes successful hub
|
||||
JSON responses (top-level no-error payloads) and file-read snapshots (numbered
|
||||
`cat -n` source lines), which cut distinct fingerprints **444 → 269 (~40 %)**
|
||||
without touching the top entries. Residual low-value items remain in the long tail
|
||||
(bare structural lines like `{`, linter "N errors" summaries); the *top*
|
||||
fingerprints are real. Note several entries (`MCP error -32602`,
|
||||
`update_task_status 'title'`) reflect the State Hub MCP instability hit live during
|
||||
this work — genuine, if self-referential, friction.
|
||||
|
||||
## What this assessment still can't see
|
||||
|
||||
- **Why** a session was expensive at the *content* level (specific error
|
||||
messages, repeated failed approaches) — the digest captures tool histograms and
|
||||
prompt/response snippets but not error-body text. Mining tool-result bodies for
|
||||
recurring failure messages is the natural next extension if root-cause depth is
|
||||
needed.
|
||||
- ~~**Why** a session was expensive at the content level.~~ **Now addressed**
|
||||
(error-body mining, above), modulo the fingerprint-noise caveat.
|
||||
- Repeated *failed approaches* (as opposed to surfaced errors) — e.g. an agent
|
||||
silently retrying a wrong strategy without an error — are still invisible.
|
||||
- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
|
||||
friction claims are Claude-weighted for now.
|
||||
|
||||
[AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
|
||||
[AGENTIC-WP-0006]: ../workplans/AGENTIC-WP-0006-error-body-mining.md
|
||||
[STATE-WP-0058]: handed off to the state-hub repo worker
|
||||
[detect/quality.py]: ../session_memory/detect/quality.py
|
||||
|
||||
@@ -12,6 +12,7 @@ belongs to the Detect phase (PRD §6.2).
|
||||
from __future__ import annotations
|
||||
|
||||
import collections
|
||||
import json
|
||||
import re
|
||||
from typing import Any
|
||||
|
||||
@@ -22,6 +23,12 @@ _FAIL_HINTS = ("error", "failed", "exception", "traceback", "fatal", "non-zero")
|
||||
# Substrings suggesting a clean test pass.
|
||||
_PASS_HINTS = ("passed", "0 failed", "ok", "success")
|
||||
|
||||
# A line that is numbered source content from a Read result (`cat -n` style),
|
||||
# e.g. "229\t raise InfospaceError(" — code text, never a runtime error.
|
||||
_NUMBERED_LINE_RE = re.compile(r"^\s*\d+\t")
|
||||
# Top-level keys that mark a JSON tool-result as an actual error (vs. success).
|
||||
_JSON_ERROR_KEYS = ("error", "errors", "detail")
|
||||
|
||||
# Normalization patterns so the same error collapses to one fingerprint
|
||||
# regardless of paths / ids / counts (WP-0006 T01).
|
||||
_UUID_RE = re.compile(r"\b[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}\b", re.I)
|
||||
@@ -195,12 +202,48 @@ def _error_body(event: SessionEvent, blobs: dict) -> str:
|
||||
return event.summary or ""
|
||||
|
||||
|
||||
def _looks_like_file_read(body: str) -> bool:
|
||||
"""True if the body is mostly numbered source lines (a Read result), not an error."""
|
||||
lines = [ln for ln in body.splitlines() if ln.strip()]
|
||||
if not lines:
|
||||
return False
|
||||
numbered = sum(1 for ln in lines if _NUMBERED_LINE_RE.match(ln))
|
||||
return numbered >= max(3, len(lines) // 2)
|
||||
|
||||
|
||||
def _json_verdict(body: str):
|
||||
"""Classify a JSON tool-result body: 'error', 'success', or None (not JSON).
|
||||
|
||||
Hub MCP successes look like ``{"result": "..."}`` and mention 'error' deep
|
||||
inside summaries but are not failures ('success'). A payload with a top-level
|
||||
error key (``{"detail": ...}`` / ``{"error": ...}``) is 'error'. Non-JSON text
|
||||
returns None so the plain fail-hint heuristic still applies.
|
||||
"""
|
||||
s = body.strip()
|
||||
if not s or s[0] not in "{[":
|
||||
return None
|
||||
try:
|
||||
obj = json.loads(s)
|
||||
except (ValueError, TypeError):
|
||||
return None
|
||||
if isinstance(obj, dict) and any(k in obj for k in _JSON_ERROR_KEYS):
|
||||
return "error"
|
||||
return "success"
|
||||
|
||||
|
||||
def _is_failed(event: SessionEvent, blobs: dict) -> bool:
|
||||
if event.kind == "error":
|
||||
return True
|
||||
if event.kind == "tool_result":
|
||||
body = _error_body(event, blobs).lower()
|
||||
return bool(body) and any(h in body for h in _FAIL_HINTS)
|
||||
body = _error_body(event, blobs)
|
||||
if not body.strip():
|
||||
return False
|
||||
if _looks_like_file_read(body):
|
||||
return False
|
||||
verdict = _json_verdict(body)
|
||||
if verdict is not None:
|
||||
return verdict == "error"
|
||||
return any(h in body.lower() for h in _FAIL_HINTS)
|
||||
return False
|
||||
|
||||
|
||||
|
||||
@@ -156,10 +156,28 @@ def sig_tool_thrash(digest, ctx) -> list[Signal]:
|
||||
return []
|
||||
|
||||
|
||||
def sig_recurring_error(digest, ctx) -> list[Signal]:
|
||||
"""Problem: a normalized error fingerprint (WP-0006) — one signal per distinct
|
||||
error in the session, so the same error across sessions/repos/flavors clusters
|
||||
into a candidate root-cause pattern (locus = fingerprint, magnitude = in-session
|
||||
occurrences). This is the content-level 'why', not just a coarse error count.
|
||||
"""
|
||||
out: list[Signal] = []
|
||||
for snip in digest.get("error_snippets", []) or []:
|
||||
fp = snip.get("fingerprint")
|
||||
if not fp:
|
||||
continue
|
||||
out.append(_base(digest, "recurring_error", PROBLEM, fp, float(snip.get("count", 1)),
|
||||
sample=snip.get("sample", ""), tool=snip.get("tool"),
|
||||
occurrences=snip.get("count", 1)))
|
||||
return out
|
||||
|
||||
|
||||
EXTRACTORS: list[Callable] = [
|
||||
sig_retry_storm, sig_repeated_errors, sig_budget_overrun, sig_abandoned,
|
||||
sig_clean_pass, sig_error_then_recovery,
|
||||
sig_infra_overhead, sig_schema_thrash, sig_tool_thrash,
|
||||
sig_recurring_error,
|
||||
]
|
||||
|
||||
|
||||
|
||||
59
tests/test_detect_recurring_error.py
Normal file
59
tests/test_detect_recurring_error.py
Normal file
@@ -0,0 +1,59 @@
|
||||
"""Recurring-error signal + clustering (WP-0006 T02)."""
|
||||
|
||||
import os
|
||||
import sys
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
|
||||
from session_memory.detect.cluster import cluster # noqa: E402
|
||||
from session_memory.detect.signals import ( # noqa: E402
|
||||
extract_signals,
|
||||
sig_recurring_error,
|
||||
)
|
||||
|
||||
|
||||
def _digest(uid, repo, flavor="claude", snippets=None):
|
||||
return {
|
||||
"session_uid": uid, "flavor": flavor, "repo": repo, "outcome": "success",
|
||||
"cost": {"input_tokens": 1, "output_tokens": 1},
|
||||
"markers": {"errors": 0, "retries": 0, "test_runs": 0},
|
||||
"tool_histogram": {}, "error_snippets": snippets or [],
|
||||
}
|
||||
|
||||
|
||||
_FP = "modulenotfounderror: no module named 'foo' at <path>:<n>"
|
||||
|
||||
|
||||
def test_signal_per_distinct_fingerprint():
|
||||
d = _digest("claude:a", "r1", snippets=[
|
||||
{"fingerprint": _FP, "sample": "ModuleNotFoundError ...", "count": 3, "tool": "Bash"},
|
||||
{"fingerprint": "keyerror: <str>", "sample": "KeyError", "count": 1, "tool": None},
|
||||
])
|
||||
sigs = sig_recurring_error(d, {})
|
||||
assert len(sigs) == 2
|
||||
top = [s for s in sigs if s.locus == _FP][0]
|
||||
assert top.type == "recurring_error"
|
||||
assert top.magnitude == 3.0
|
||||
assert top.detail["sample"].startswith("ModuleNotFound")
|
||||
|
||||
|
||||
def test_clusters_across_sessions_and_flavors():
|
||||
# same fingerprint in a claude and a grok session -> cross-flavor candidate
|
||||
digs = [
|
||||
_digest("claude:a", "r1", "claude",
|
||||
[{"fingerprint": _FP, "sample": "ModuleNotFoundError", "count": 2, "tool": "Bash"}]),
|
||||
_digest("grok:b", "r2", "grok",
|
||||
[{"fingerprint": _FP, "sample": "ModuleNotFoundError", "count": 1, "tool": None}]),
|
||||
]
|
||||
signals = extract_signals(digs)
|
||||
pats = cluster([s for s in signals if s.type == "recurring_error"], min_frequency=2)
|
||||
assert len(pats) == 1
|
||||
p = pats[0]
|
||||
assert p.signal_type == "recurring_error"
|
||||
assert p.cross_flavor is True
|
||||
assert sorted(p.flavors) == ["claude", "grok"]
|
||||
assert p.frequency == 2
|
||||
|
||||
|
||||
def test_no_snippets_no_signal():
|
||||
assert sig_recurring_error(_digest("claude:a", "r1"), {}) == []
|
||||
@@ -59,6 +59,33 @@ def test_clean_tool_result_not_mined():
|
||||
assert _error_snippets(events, blobs) == []
|
||||
|
||||
|
||||
def test_success_json_not_mined():
|
||||
# a hub MCP success payload mentioning 'error' deep inside is NOT a failure
|
||||
blobs = {"b1": '{"result": "{\\"domain\\": \\"custodian\\", \\"note\\": \\"no errors\\"}"}'}
|
||||
events = [_ev(0, "tool_result", tool="mcp__state-hub__get_domain_summary", payload_ref="b1")]
|
||||
assert _error_snippets(events, blobs) == []
|
||||
|
||||
|
||||
def test_error_json_still_mined():
|
||||
blobs = {"b1": '{"detail": "Invalid request parameters"}'}
|
||||
events = [_ev(0, "tool_result", tool="Bash", payload_ref="b1")]
|
||||
snips = _error_snippets(events, blobs)
|
||||
assert len(snips) == 1
|
||||
|
||||
|
||||
def test_plain_mcp_error_still_mined():
|
||||
blobs = {"b1": "MCP error -32602: Invalid request parameters"}
|
||||
events = [_ev(0, "tool_result", tool="Bash", payload_ref="b1")]
|
||||
assert len(_error_snippets(events, blobs)) == 1
|
||||
|
||||
|
||||
def test_file_read_snapshot_not_mined():
|
||||
# a Read result of source code containing 'raise ...Error' is not a runtime error
|
||||
blobs = {"b1": "227\t def f():\n228\t x = 1\n229\t raise InfospaceError()\n"}
|
||||
events = [_ev(0, "tool_result", tool="Read", payload_ref="b1")]
|
||||
assert _error_snippets(events, blobs) == []
|
||||
|
||||
|
||||
def test_build_digest_includes_error_snippets_and_v2():
|
||||
s = Session(session_uid="claude:s", flavor="claude", native_session_id="s", repo="r")
|
||||
events = [_ev(0, "user_msg"), _ev(1, "error", payload_ref="b1"), _ev(2, "assistant_msg")]
|
||||
|
||||
@@ -4,7 +4,7 @@ type: workplan
|
||||
title: "Coding Session Memory — Error-Body Mining (content-level root causes)"
|
||||
domain: helix_forge
|
||||
repo: agentic-resources
|
||||
status: ready
|
||||
status: finished
|
||||
owner: codex
|
||||
topic_slug: helix-forge
|
||||
created: "2026-06-07"
|
||||
@@ -48,7 +48,7 @@ sessions with repeated and varied errors.
|
||||
|
||||
```task
|
||||
id: AGENTIC-WP-0006-T02
|
||||
status: todo
|
||||
status: done
|
||||
priority: high
|
||||
state_hub_task_id: "1a41b6f5-48bc-4080-bd18-94f2186ef566"
|
||||
```
|
||||
@@ -64,7 +64,7 @@ synthetic digests sharing a fingerprint.
|
||||
|
||||
```task
|
||||
id: AGENTIC-WP-0006-T03
|
||||
status: todo
|
||||
status: done
|
||||
priority: medium
|
||||
state_hub_task_id: "bed16d23-3971-4257-b066-d1e639fef150"
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user