Re-ingested under schema v2 (populates error_snippets) and re-ran detect over 27 real sessions. Added a 'content-level root causes' section to docs/ASSESSMENT-infra-friction.md: top recurring error is Edit/Write-before-Read (12/27 sessions, 8 repos), then stale-read conflicts, a cross-flavor (claude+grok) make fix-consistency failure, and State Hub MCP instability. Documented a fingerprint-noise caveat. WP-0006 finished; suite 98/98. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
7.0 KiB
Infrastructure Friction Assessment
Generated 2026-06-07 from captured coding-session data (Helix Forge session memory), after the Detect-hardening pass (AGENTIC-WP-0005). First data-driven assessment of where our agentic coding sessions spend effort on plumbing rather than work.
Method & data quality
- Corpus: 72 sessions captured across Claude + Grok. A session-quality filter
([detect/quality.py]) drops health-checks, smoke-tests, and interrupted runs
(mostly
llm-connect"Say hello in one word"). 27 are real coding sessions. - Caveat: the 41 % that were filtered out had been mislabeled
abandonedby the outcome heuristic and produced a false-positive "cross-flavor abandoned" pattern in the first catalog — now purged. Treat any pre-hardening finding with suspicion. - Key framing: all 27 real sessions ended in
success. So the friction here is cost/efficiency, not failure — sessions get there, but pay an avoidable tax to do it.
The headline number
Across the 27 real sessions, tool-call activity breaks down as:
| Bucket | Share |
|---|---|
| shell (Bash / run_terminal) | 38.2 % |
| edit | 30.2 % |
| read | 12.9 % |
| State Hub MCP | 10.3 % |
| task-management plumbing | 5.8 % |
schema-loading (ToolSearch) |
1.5 % |
| other | 1.1 % |
~17.6 % of all tool calls in real coding sessions are coordination plumbing (hub + task + schema-loading), not touching the repo. Per-session infra-overhead share: median 11.7 %, p90 26.1 %, max 43.3 % — it concentrates badly.
Ranked friction
1. State Hub call volume — highest cost, addressable
State Hub MCP is 10.3 % of all tool calls and dominates the worst sessions:
| Repo (one session) | total calls | State Hub calls | overhead share |
|---|---|---|---|
| vergabe-teilnahme | 570 | 231 | 43 % |
| activity-core | 488 | 98 | 23 % |
| flex-auth | 236 | 35 (+27 task) | 29 % |
| net-kingdom | 129 | 25 | 22 % |
Root cause: many fine-grained calls — per-task status updates, per-event
progress writes, repeated get_domain_summary. 231 hub calls in a single session
is coordination overhead, not work.
2. Schema-loading thrash (ToolSearch) — low cost, near-zero-effort fix
106 ToolSearch calls across 22 of 27 sessions (81 %). The State Hub MCP
tools are deferred, so nearly every session re-discovers and re-loads the same
tool schemas before it can call them. This is pure overhead with no work value —
and it is exactly the CLI/MCP-interface friction hypothesized.
3. Task-management plumbing — 5.8 %
TaskUpdate / TaskCreate / todo_write / update_task_status. Overlaps with
(1); much of it is redundant status churn within a session.
4. Tool thrash — session-shape, watch only
11 sessions hammer a single tool 80–230× (usually Bash or Edit). Less an infra problem than a sign of missing higher-level tooling; low priority.
5. Budget overrun — 3 sessions
Token cost well above peers. Secondary; revisit once (1)–(2) are addressed.
Recommendations
The CLI/MCP-interface hypothesis is validated as a top-2 friction, not a minor issue. Two high-ROI moves:
- A. A State Hub skill (highest ROI). A skill (or a pre-loaded tool manifest)
that (i) front-loads the common hub tool schemas so agents stop
ToolSearch-ing for them — eliminates finding #2 almost entirely (81 % of sessions) — and (ii) teaches batched writes (sync N task statuses in one call, fewer progress events) to attack finding #1. Low effort, broad reach. - B. Coarser hub operations. Add bulk endpoints / a single "sync workplan statuses" op so a session doesn't make 200+ individual hub calls. This is the structural fix behind the skill's guidance.
- C. Measure the effect (Phase 4). After A/B land, compare infra-overhead share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %). This is precisely what the Measure phase is for — the loop closes here.
Content-level root causes (error-body mining)
Added 2026-06-07 from AGENTIC-WP-0006 — build_digest now mines normalized
error fingerprints into the durable digest, and sig_recurring_error clusters
them. This is the "why" the tool-mix view above could not see.
26 of 27 real sessions hit at least one error. Top recurring error fingerprints across the corpus (by # sessions affected):
| # sessions | occ | flavors | top sample |
|---|---|---|---|
| 12 | 32 | claude | <tool_use_error>File has not been read yet. Read it first before writing to it. |
| 6 | 13 | claude | <tool_use_error>File has been modified since read … |
| 4 | 9 | claude + grok | make: *** [Makefile:227: fix-consistency] Error 1 |
| 3 | 21 | claude | MCP error -32602: Invalid request parameters |
| 3 | 6 | claude | Error calling tool 'update_task_status': 'title' |
| 2 | 6 | claude | make: *** [Makefile:21: test] Error 1 |
Reading:
- #1 — Edit/Write-before-Read (12/27 sessions, 8 repos). The single most common error is agents trying to edit a file they haven't read into context. This is a workflow friction, highly addressable: a Read-before-Edit reflex in the agent instructions / a skill, or a harness affordance. (Observed live: the author hit this exact error twice while writing this workplan.)
- #2 — stale-read conflicts (6 sessions): "File has been modified since read" — same family, a re-read-before-edit discipline fixes both.
- #3 — cross-flavor
make fix-consistencyfailures (claude + grok, 3 repos): the consistency tooling itself fails across flavors — a shared infra issue worth a look on the state-hub side (cf. [STATE-WP-0058]). - State Hub MCP instability (
-32602,update_task_status 'title') shows up in 3 sessions each — corroborates the plumbing-overhead story and the live MCP flakiness seen during this work (REST fallback used).
Caveat — fingerprint noise: the fail-hint heuristic also catches non-failures
(successful hub JSON responses, source lines containing raise …Error, linter
"N errors" summaries). The top fingerprints above are real; a future refinement
should tighten _is_failed (e.g. skip valid-JSON success payloads and code-read
snapshots) before trusting the long tail.
What this assessment still can't see
Why a session was expensive at the content level.Now addressed (error-body mining, above), modulo the fingerprint-noise caveat.- Repeated failed approaches (as opposed to surfaced errors) — e.g. an agent silently retrying a wrong strategy without an error — are still invisible.
- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor friction claims are Claude-weighted for now.
[STATE-WP-0058]: handed off to the state-hub repo worker [detect/quality.py]: ../session_memory/detect/quality.py