From 7cce276d32df8cf646ed69862fd22d387f670db2 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sun, 7 Jun 2026 13:09:29 +0200 Subject: [PATCH] session-memory: error root-cause assessment + v2 re-ingest (WP-0006 T03) Re-ingested under schema v2 (populates error_snippets) and re-ran detect over 27 real sessions. Added a 'content-level root causes' section to docs/ASSESSMENT-infra-friction.md: top recurring error is Edit/Write-before-Read (12/27 sessions, 8 repos), then stale-read conflicts, a cross-flavor (claude+grok) make fix-consistency failure, and State Hub MCP instability. Documented a fingerprint-noise caveat. WP-0006 finished; suite 98/98. Co-Authored-By: Claude Opus 4.8 --- docs/ASSESSMENT-infra-friction.md | 51 +++++++++++++++++-- .../AGENTIC-WP-0006-error-body-mining.md | 4 +- 2 files changed, 48 insertions(+), 7 deletions(-) diff --git a/docs/ASSESSMENT-infra-friction.md b/docs/ASSESSMENT-infra-friction.md index 10de702..9a054b9 100644 --- a/docs/ASSESSMENT-infra-friction.md +++ b/docs/ASSESSMENT-infra-friction.md @@ -86,15 +86,56 @@ issue.** Two high-ROI moves: share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %). This is precisely what the Measure phase is for — the loop closes here. +## Content-level root causes (error-body mining) + +*Added 2026-06-07 from [AGENTIC-WP-0006] — `build_digest` now mines normalized +error fingerprints into the durable digest, and `sig_recurring_error` clusters +them. This is the "why" the tool-mix view above could not see.* + +**26 of 27 real sessions hit at least one error.** Top recurring error +fingerprints across the corpus (by # sessions affected): + +| # sessions | occ | flavors | top sample | +|-----------:|----:|---------|------------| +| **12** | 32 | claude | `File has not been read yet. Read it first before writing to it.` | +| **6** | 13 | claude | `File has been modified since read …` | +| **4** | 9 | **claude + grok** | `make: *** [Makefile:227: fix-consistency] Error 1` | +| 3 | 21 | claude | `MCP error -32602: Invalid request parameters` | +| 3 | 6 | claude | `Error calling tool 'update_task_status': 'title'` | +| 2 | 6 | claude | `make: *** [Makefile:21: test] Error 1` | + +Reading: + +- **#1 — Edit/Write-before-Read (12/27 sessions, 8 repos).** The single most + common error is agents trying to edit a file they haven't read into context. + This is a *workflow* friction, highly addressable: a Read-before-Edit reflex in + the agent instructions / a skill, or a harness affordance. (Observed live: the + author hit this exact error twice while writing this workplan.) +- **#2 — stale-read conflicts (6 sessions):** "File has been modified since read" + — same family, a re-read-before-edit discipline fixes both. +- **#3 — cross-flavor `make fix-consistency` failures (claude + grok, 3 repos):** + the consistency tooling itself fails across flavors — a shared infra issue worth + a look on the state-hub side (cf. [STATE-WP-0058]). +- **State Hub MCP instability** (`-32602`, `update_task_status 'title'`) shows up + in 3 sessions each — corroborates the plumbing-overhead story and the live MCP + flakiness seen during this work (REST fallback used). + +**Caveat — fingerprint noise:** the fail-hint heuristic also catches non-failures +(successful hub JSON responses, source lines containing `raise …Error`, linter +"N errors" summaries). The *top* fingerprints above are real; a future refinement +should tighten `_is_failed` (e.g. skip valid-JSON success payloads and code-read +snapshots) before trusting the long tail. + ## What this assessment still can't see -- **Why** a session was expensive at the *content* level (specific error - messages, repeated failed approaches) — the digest captures tool histograms and - prompt/response snippets but not error-body text. Mining tool-result bodies for - recurring failure messages is the natural next extension if root-cause depth is - needed. +- ~~**Why** a session was expensive at the content level.~~ **Now addressed** + (error-body mining, above), modulo the fingerprint-noise caveat. +- Repeated *failed approaches* (as opposed to surfaced errors) — e.g. an agent + silently retrying a wrong strategy without an error — are still invisible. - Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor friction claims are Claude-weighted for now. [AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md +[AGENTIC-WP-0006]: ../workplans/AGENTIC-WP-0006-error-body-mining.md +[STATE-WP-0058]: handed off to the state-hub repo worker [detect/quality.py]: ../session_memory/detect/quality.py diff --git a/workplans/AGENTIC-WP-0006-error-body-mining.md b/workplans/AGENTIC-WP-0006-error-body-mining.md index 9c34e7a..b7c5097 100644 --- a/workplans/AGENTIC-WP-0006-error-body-mining.md +++ b/workplans/AGENTIC-WP-0006-error-body-mining.md @@ -4,7 +4,7 @@ type: workplan title: "Coding Session Memory — Error-Body Mining (content-level root causes)" domain: helix_forge repo: agentic-resources -status: ready +status: finished owner: codex topic_slug: helix-forge created: "2026-06-07" @@ -64,7 +64,7 @@ synthetic digests sharing a fingerprint. ```task id: AGENTIC-WP-0006-T03 -status: todo +status: done priority: medium state_hub_task_id: "bed16d23-3971-4257-b066-d1e639fef150" ```