Files
agentic-resources/docs/ASSESSMENT-infra-friction.md
tegwick 1b6081cd88 session-memory: denoise error fingerprints (WP-0006 follow-up)
Tighten _is_failed: exclude successful hub JSON responses (top-level no-error
payloads) and file-read snapshots (numbered cat -n source lines) that were
polluting error_snippets. JSON verdict classifies error vs success payloads
directly. Cuts distinct fingerprints 444 -> 269 (~40%) over the real corpus with
the top errors unchanged. Assessment caveat updated. 5 new tests; suite 102/102.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 13:39:08 +02:00

145 lines
7.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Infrastructure Friction Assessment
*Generated 2026-06-07 from captured coding-session data (Helix Forge session
memory), after the Detect-hardening pass ([AGENTIC-WP-0005]). First data-driven
assessment of where our agentic coding sessions spend effort on plumbing rather
than work.*
## Method & data quality
- **Corpus:** 72 sessions captured across Claude + Grok. A session-quality filter
([detect/quality.py]) drops health-checks, smoke-tests, and interrupted runs
(mostly `llm-connect` *"Say hello in one word"*). **27 are real coding sessions.**
- **Caveat:** the 41 % that were filtered out had been mislabeled `abandoned` by
the outcome heuristic and produced a *false-positive* "cross-flavor abandoned"
pattern in the first catalog — now purged. Treat any pre-hardening finding with
suspicion.
- **Key framing:** all 27 real sessions ended in `success`. So the friction here
is **cost/efficiency, not failure** — sessions get there, but pay an avoidable
tax to do it.
## The headline number
Across the 27 real sessions, tool-call activity breaks down as:
| Bucket | Share |
|--------|------:|
| shell (Bash / run_terminal) | 38.2 % |
| edit | 30.2 % |
| read | 12.9 % |
| **State Hub MCP** | **10.3 %** |
| **task-management plumbing** | **5.8 %** |
| **schema-loading (`ToolSearch`)** | **1.5 %** |
| other | 1.1 % |
**~17.6 % of all tool calls in real coding sessions are coordination plumbing
(hub + task + schema-loading), not touching the repo.** Per-session infra-overhead
share: median **11.7 %**, p90 **26.1 %**, max **43.3 %** — it concentrates badly.
## Ranked friction
### 1. State Hub call volume — *highest cost, addressable*
State Hub MCP is 10.3 % of all tool calls and dominates the worst sessions:
| Repo (one session) | total calls | State Hub calls | overhead share |
|--------------------|------:|------:|------:|
| vergabe-teilnahme | 570 | **231** | 43 % |
| activity-core | 488 | 98 | 23 % |
| flex-auth | 236 | 35 (+27 task) | 29 % |
| net-kingdom | 129 | 25 | 22 % |
Root cause: many **fine-grained** calls — per-task status updates, per-event
progress writes, repeated `get_domain_summary`. 231 hub calls in a single session
is coordination overhead, not work.
### 2. Schema-loading thrash (`ToolSearch`) — *low cost, near-zero-effort fix*
**106 `ToolSearch` calls across 22 of 27 sessions (81 %).** The State Hub MCP
tools are *deferred*, so nearly every session re-discovers and re-loads the same
tool schemas before it can call them. This is pure overhead with no work value —
and it is **exactly the CLI/MCP-interface friction hypothesized.**
### 3. Task-management plumbing — 5.8 %
`TaskUpdate` / `TaskCreate` / `todo_write` / `update_task_status`. Overlaps with
(1); much of it is redundant status churn within a session.
### 4. Tool thrash — *session-shape, watch only*
11 sessions hammer a single tool 80230× (usually Bash or Edit). Less an infra
problem than a sign of missing higher-level tooling; low priority.
### 5. Budget overrun — 3 sessions
Token cost well above peers. Secondary; revisit once (1)(2) are addressed.
## Recommendations
**The CLI/MCP-interface hypothesis is validated as a top-2 friction, not a minor
issue.** Two high-ROI moves:
- **A. A State Hub skill (highest ROI).** A skill (or a pre-loaded tool manifest)
that (i) **front-loads the common hub tool schemas** so agents stop
`ToolSearch`-ing for them — eliminates finding #2 almost entirely (81 % of
sessions) — and (ii) **teaches batched writes** (sync N task statuses in one
call, fewer progress events) to attack finding #1. Low effort, broad reach.
- **B. Coarser hub operations.** Add bulk endpoints / a single "sync workplan
statuses" op so a session doesn't make 200+ individual hub calls. This is the
structural fix behind the skill's guidance.
- **C. Measure the effect (Phase 4).** After A/B land, compare infra-overhead
share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
This is precisely what the Measure phase is for — the loop closes here.
## Content-level root causes (error-body mining)
*Added 2026-06-07 from [AGENTIC-WP-0006] — `build_digest` now mines normalized
error fingerprints into the durable digest, and `sig_recurring_error` clusters
them. This is the "why" the tool-mix view above could not see.*
**26 of 27 real sessions hit at least one error.** Top recurring error
fingerprints across the corpus (by # sessions affected):
| # sessions | occ | flavors | top sample |
|-----------:|----:|---------|------------|
| **12** | 32 | claude | `<tool_use_error>File has not been read yet. Read it first before writing to it.` |
| **6** | 13 | claude | `<tool_use_error>File has been modified since read …` |
| **4** | 9 | **claude + grok** | `make: *** [Makefile:227: fix-consistency] Error 1` |
| 3 | 21 | claude | `MCP error -32602: Invalid request parameters` |
| 3 | 6 | claude | `Error calling tool 'update_task_status': 'title'` |
| 2 | 6 | claude | `make: *** [Makefile:21: test] Error 1` |
Reading:
- **#1 — Edit/Write-before-Read (12/27 sessions, 8 repos).** The single most
common error is agents trying to edit a file they haven't read into context.
This is a *workflow* friction, highly addressable: a Read-before-Edit reflex in
the agent instructions / a skill, or a harness affordance. (Observed live: the
author hit this exact error twice while writing this workplan.)
- **#2 — stale-read conflicts (6 sessions):** "File has been modified since read"
— same family, a re-read-before-edit discipline fixes both.
- **#3 — cross-flavor `make fix-consistency` failures (claude + grok, 3 repos):**
the consistency tooling itself fails across flavors — a shared infra issue worth
a look on the state-hub side (cf. [STATE-WP-0058]).
- **State Hub MCP instability** (`-32602`, `update_task_status 'title'`) shows up
in 3 sessions each — corroborates the plumbing-overhead story and the live MCP
flakiness seen during this work (REST fallback used).
**Fingerprint noise — mostly handled.** `_is_failed` now excludes successful hub
JSON responses (top-level no-error payloads) and file-read snapshots (numbered
`cat -n` source lines), which cut distinct fingerprints **444 → 269 (~40 %)**
without touching the top entries. Residual low-value items remain in the long tail
(bare structural lines like `{`, linter "N errors" summaries); the *top*
fingerprints are real. Note several entries (`MCP error -32602`,
`update_task_status 'title'`) reflect the State Hub MCP instability hit live during
this work — genuine, if self-referential, friction.
## What this assessment still can't see
- ~~**Why** a session was expensive at the content level.~~ **Now addressed**
(error-body mining, above), modulo the fingerprint-noise caveat.
- Repeated *failed approaches* (as opposed to surfaced errors) — e.g. an agent
silently retrying a wrong strategy without an error — are still invisible.
- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
friction claims are Claude-weighted for now.
[AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
[AGENTIC-WP-0006]: ../workplans/AGENTIC-WP-0006-error-body-mining.md
[STATE-WP-0058]: handed off to the state-hub repo worker
[detect/quality.py]: ../session_memory/detect/quality.py