Files

tegwick 1b6081cd88 session-memory: denoise error fingerprints (WP-0006 follow-up)

Tighten _is_failed: exclude successful hub JSON responses (top-level no-error
payloads) and file-read snapshots (numbered cat -n source lines) that were
polluting error_snippets. JSON verdict classifies error vs success payloads
directly. Cuts distinct fingerprints 444 -> 269 (~40%) over the real corpus with
the top errors unchanged. Assessment caveat updated. 5 new tests; suite 102/102.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-07 13:39:08 +02:00

7.2 KiB

Raw Blame History

Infrastructure Friction Assessment

Generated 2026-06-07 from captured coding-session data (Helix Forge session memory), after the Detect-hardening pass (AGENTIC-WP-0005). First data-driven assessment of where our agentic coding sessions spend effort on plumbing rather than work.

Method & data quality

Corpus: 72 sessions captured across Claude + Grok. A session-quality filter ([detect/quality.py]) drops health-checks, smoke-tests, and interrupted runs (mostly llm-connect "Say hello in one word"). 27 are real coding sessions.
Caveat: the 41 % that were filtered out had been mislabeled abandoned by the outcome heuristic and produced a false-positive "cross-flavor abandoned" pattern in the first catalog — now purged. Treat any pre-hardening finding with suspicion.
Key framing: all 27 real sessions ended in success. So the friction here is cost/efficiency, not failure — sessions get there, but pay an avoidable tax to do it.

The headline number

Across the 27 real sessions, tool-call activity breaks down as:

Bucket	Share
shell (Bash / run_terminal)	38.2 %
edit	30.2 %
read	12.9 %
State Hub MCP	10.3 %
task-management plumbing	5.8 %
schema-loading (`ToolSearch`)	1.5 %
other	1.1 %

~17.6 % of all tool calls in real coding sessions are coordination plumbing (hub + task + schema-loading), not touching the repo. Per-session infra-overhead share: median 11.7 %, p90 26.1 %, max 43.3 % — it concentrates badly.

Ranked friction

1. State Hub call volume — highest cost, addressable

State Hub MCP is 10.3 % of all tool calls and dominates the worst sessions:

Repo (one session)	total calls	State Hub calls	overhead share
vergabe-teilnahme	570	231	43 %
activity-core	488	98	23 %
flex-auth	236	35 (+27 task)	29 %
net-kingdom	129	25	22 %

Root cause: many fine-grained calls — per-task status updates, per-event progress writes, repeated get_domain_summary. 231 hub calls in a single session is coordination overhead, not work.

2. Schema-loading thrash (`ToolSearch`) — low cost, near-zero-effort fix

106 ToolSearch calls across 22 of 27 sessions (81 %). The State Hub MCP tools are deferred, so nearly every session re-discovers and re-loads the same tool schemas before it can call them. This is pure overhead with no work value — and it is exactly the CLI/MCP-interface friction hypothesized.

3. Task-management plumbing — 5.8 %

TaskUpdate / TaskCreate / todo_write / update_task_status. Overlaps with (1); much of it is redundant status churn within a session.

4. Tool thrash — session-shape, watch only

11 sessions hammer a single tool 80–230× (usually Bash or Edit). Less an infra problem than a sign of missing higher-level tooling; low priority.

5. Budget overrun — 3 sessions

Token cost well above peers. Secondary; revisit once (1)–(2) are addressed.

Recommendations

The CLI/MCP-interface hypothesis is validated as a top-2 friction, not a minor issue. Two high-ROI moves:

A. A State Hub skill (highest ROI). A skill (or a pre-loaded tool manifest) that (i) front-loads the common hub tool schemas so agents stop ToolSearch-ing for them — eliminates finding #2 almost entirely (81 % of sessions) — and (ii) teaches batched writes (sync N task statuses in one call, fewer progress events) to attack finding #1. Low effort, broad reach.
B. Coarser hub operations. Add bulk endpoints / a single "sync workplan statuses" op so a session doesn't make 200+ individual hub calls. This is the structural fix behind the skill's guidance.
C. Measure the effect (Phase 4). After A/B land, compare infra-overhead share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %). This is precisely what the Measure phase is for — the loop closes here.

Content-level root causes (error-body mining)

Added 2026-06-07 from AGENTIC-WP-0006 — build_digest now mines normalized error fingerprints into the durable digest, and sig_recurring_error clusters them. This is the "why" the tool-mix view above could not see.

26 of 27 real sessions hit at least one error. Top recurring error fingerprints across the corpus (by # sessions affected):

# sessions	occ	flavors	top sample
12	32	claude	`<tool_use_error>File has not been read yet. Read it first before writing to it.`
6	13	claude	`<tool_use_error>File has been modified since read …`
4	9	claude + grok	`make: *** [Makefile:227: fix-consistency] Error 1`
3	21	claude	`MCP error -32602: Invalid request parameters`
3	6	claude	`Error calling tool 'update_task_status': 'title'`
2	6	claude	`make: *** [Makefile:21: test] Error 1`

Reading:

#1 — Edit/Write-before-Read (12/27 sessions, 8 repos). The single most common error is agents trying to edit a file they haven't read into context. This is a workflow friction, highly addressable: a Read-before-Edit reflex in the agent instructions / a skill, or a harness affordance. (Observed live: the author hit this exact error twice while writing this workplan.)
#2 — stale-read conflicts (6 sessions): "File has been modified since read" — same family, a re-read-before-edit discipline fixes both.
#3 — cross-flavor make fix-consistency failures (claude + grok, 3 repos): the consistency tooling itself fails across flavors — a shared infra issue worth a look on the state-hub side (cf. [STATE-WP-0058]).
State Hub MCP instability (-32602, update_task_status 'title') shows up in 3 sessions each — corroborates the plumbing-overhead story and the live MCP flakiness seen during this work (REST fallback used).

Fingerprint noise — mostly handled. _is_failed now excludes successful hub JSON responses (top-level no-error payloads) and file-read snapshots (numbered cat -n source lines), which cut distinct fingerprints 444 → 269 (~40 %) without touching the top entries. Residual low-value items remain in the long tail (bare structural lines like {, linter "N errors" summaries); the top fingerprints are real. Note several entries (MCP error -32602, update_task_status 'title') reflect the State Hub MCP instability hit live during this work — genuine, if self-referential, friction.

What this assessment still can't see

Why a session was expensive at the content level. Now addressed (error-body mining, above), modulo the fingerprint-noise caveat.
Repeated failed approaches (as opposed to surfaced errors) — e.g. an agent silently retrying a wrong strategy without an error — are still invisible.
Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor friction claims are Claude-weighted for now.

[STATE-WP-0058]: handed off to the state-hub repo worker [detect/quality.py]: ../session_memory/detect/quality.py

7.2 KiB Raw Blame History Unescape Escape