generated from coulomb/repo-seed
Tighten _is_failed: exclude successful hub JSON responses (top-level no-error payloads) and file-read snapshots (numbered cat -n source lines) that were polluting error_snippets. JSON verdict classifies error vs success payloads directly. Cuts distinct fingerprints 444 -> 269 (~40%) over the real corpus with the top errors unchanged. Assessment caveat updated. 5 new tests; suite 102/102. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
145 lines
7.2 KiB
Markdown
145 lines
7.2 KiB
Markdown
# Infrastructure Friction Assessment
|
||
|
||
*Generated 2026-06-07 from captured coding-session data (Helix Forge session
|
||
memory), after the Detect-hardening pass ([AGENTIC-WP-0005]). First data-driven
|
||
assessment of where our agentic coding sessions spend effort on plumbing rather
|
||
than work.*
|
||
|
||
## Method & data quality
|
||
|
||
- **Corpus:** 72 sessions captured across Claude + Grok. A session-quality filter
|
||
([detect/quality.py]) drops health-checks, smoke-tests, and interrupted runs
|
||
(mostly `llm-connect` *"Say hello in one word"*). **27 are real coding sessions.**
|
||
- **Caveat:** the 41 % that were filtered out had been mislabeled `abandoned` by
|
||
the outcome heuristic and produced a *false-positive* "cross-flavor abandoned"
|
||
pattern in the first catalog — now purged. Treat any pre-hardening finding with
|
||
suspicion.
|
||
- **Key framing:** all 27 real sessions ended in `success`. So the friction here
|
||
is **cost/efficiency, not failure** — sessions get there, but pay an avoidable
|
||
tax to do it.
|
||
|
||
## The headline number
|
||
|
||
Across the 27 real sessions, tool-call activity breaks down as:
|
||
|
||
| Bucket | Share |
|
||
|--------|------:|
|
||
| shell (Bash / run_terminal) | 38.2 % |
|
||
| edit | 30.2 % |
|
||
| read | 12.9 % |
|
||
| **State Hub MCP** | **10.3 %** |
|
||
| **task-management plumbing** | **5.8 %** |
|
||
| **schema-loading (`ToolSearch`)** | **1.5 %** |
|
||
| other | 1.1 % |
|
||
|
||
**~17.6 % of all tool calls in real coding sessions are coordination plumbing
|
||
(hub + task + schema-loading), not touching the repo.** Per-session infra-overhead
|
||
share: median **11.7 %**, p90 **26.1 %**, max **43.3 %** — it concentrates badly.
|
||
|
||
## Ranked friction
|
||
|
||
### 1. State Hub call volume — *highest cost, addressable*
|
||
State Hub MCP is 10.3 % of all tool calls and dominates the worst sessions:
|
||
|
||
| Repo (one session) | total calls | State Hub calls | overhead share |
|
||
|--------------------|------:|------:|------:|
|
||
| vergabe-teilnahme | 570 | **231** | 43 % |
|
||
| activity-core | 488 | 98 | 23 % |
|
||
| flex-auth | 236 | 35 (+27 task) | 29 % |
|
||
| net-kingdom | 129 | 25 | 22 % |
|
||
|
||
Root cause: many **fine-grained** calls — per-task status updates, per-event
|
||
progress writes, repeated `get_domain_summary`. 231 hub calls in a single session
|
||
is coordination overhead, not work.
|
||
|
||
### 2. Schema-loading thrash (`ToolSearch`) — *low cost, near-zero-effort fix*
|
||
**106 `ToolSearch` calls across 22 of 27 sessions (81 %).** The State Hub MCP
|
||
tools are *deferred*, so nearly every session re-discovers and re-loads the same
|
||
tool schemas before it can call them. This is pure overhead with no work value —
|
||
and it is **exactly the CLI/MCP-interface friction hypothesized.**
|
||
|
||
### 3. Task-management plumbing — 5.8 %
|
||
`TaskUpdate` / `TaskCreate` / `todo_write` / `update_task_status`. Overlaps with
|
||
(1); much of it is redundant status churn within a session.
|
||
|
||
### 4. Tool thrash — *session-shape, watch only*
|
||
11 sessions hammer a single tool 80–230× (usually Bash or Edit). Less an infra
|
||
problem than a sign of missing higher-level tooling; low priority.
|
||
|
||
### 5. Budget overrun — 3 sessions
|
||
Token cost well above peers. Secondary; revisit once (1)–(2) are addressed.
|
||
|
||
## Recommendations
|
||
|
||
**The CLI/MCP-interface hypothesis is validated as a top-2 friction, not a minor
|
||
issue.** Two high-ROI moves:
|
||
|
||
- **A. A State Hub skill (highest ROI).** A skill (or a pre-loaded tool manifest)
|
||
that (i) **front-loads the common hub tool schemas** so agents stop
|
||
`ToolSearch`-ing for them — eliminates finding #2 almost entirely (81 % of
|
||
sessions) — and (ii) **teaches batched writes** (sync N task statuses in one
|
||
call, fewer progress events) to attack finding #1. Low effort, broad reach.
|
||
- **B. Coarser hub operations.** Add bulk endpoints / a single "sync workplan
|
||
statuses" op so a session doesn't make 200+ individual hub calls. This is the
|
||
structural fix behind the skill's guidance.
|
||
- **C. Measure the effect (Phase 4).** After A/B land, compare infra-overhead
|
||
share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
|
||
This is precisely what the Measure phase is for — the loop closes here.
|
||
|
||
## Content-level root causes (error-body mining)
|
||
|
||
*Added 2026-06-07 from [AGENTIC-WP-0006] — `build_digest` now mines normalized
|
||
error fingerprints into the durable digest, and `sig_recurring_error` clusters
|
||
them. This is the "why" the tool-mix view above could not see.*
|
||
|
||
**26 of 27 real sessions hit at least one error.** Top recurring error
|
||
fingerprints across the corpus (by # sessions affected):
|
||
|
||
| # sessions | occ | flavors | top sample |
|
||
|-----------:|----:|---------|------------|
|
||
| **12** | 32 | claude | `<tool_use_error>File has not been read yet. Read it first before writing to it.` |
|
||
| **6** | 13 | claude | `<tool_use_error>File has been modified since read …` |
|
||
| **4** | 9 | **claude + grok** | `make: *** [Makefile:227: fix-consistency] Error 1` |
|
||
| 3 | 21 | claude | `MCP error -32602: Invalid request parameters` |
|
||
| 3 | 6 | claude | `Error calling tool 'update_task_status': 'title'` |
|
||
| 2 | 6 | claude | `make: *** [Makefile:21: test] Error 1` |
|
||
|
||
Reading:
|
||
|
||
- **#1 — Edit/Write-before-Read (12/27 sessions, 8 repos).** The single most
|
||
common error is agents trying to edit a file they haven't read into context.
|
||
This is a *workflow* friction, highly addressable: a Read-before-Edit reflex in
|
||
the agent instructions / a skill, or a harness affordance. (Observed live: the
|
||
author hit this exact error twice while writing this workplan.)
|
||
- **#2 — stale-read conflicts (6 sessions):** "File has been modified since read"
|
||
— same family, a re-read-before-edit discipline fixes both.
|
||
- **#3 — cross-flavor `make fix-consistency` failures (claude + grok, 3 repos):**
|
||
the consistency tooling itself fails across flavors — a shared infra issue worth
|
||
a look on the state-hub side (cf. [STATE-WP-0058]).
|
||
- **State Hub MCP instability** (`-32602`, `update_task_status 'title'`) shows up
|
||
in 3 sessions each — corroborates the plumbing-overhead story and the live MCP
|
||
flakiness seen during this work (REST fallback used).
|
||
|
||
**Fingerprint noise — mostly handled.** `_is_failed` now excludes successful hub
|
||
JSON responses (top-level no-error payloads) and file-read snapshots (numbered
|
||
`cat -n` source lines), which cut distinct fingerprints **444 → 269 (~40 %)**
|
||
without touching the top entries. Residual low-value items remain in the long tail
|
||
(bare structural lines like `{`, linter "N errors" summaries); the *top*
|
||
fingerprints are real. Note several entries (`MCP error -32602`,
|
||
`update_task_status 'title'`) reflect the State Hub MCP instability hit live during
|
||
this work — genuine, if self-referential, friction.
|
||
|
||
## What this assessment still can't see
|
||
|
||
- ~~**Why** a session was expensive at the content level.~~ **Now addressed**
|
||
(error-body mining, above), modulo the fingerprint-noise caveat.
|
||
- Repeated *failed approaches* (as opposed to surfaced errors) — e.g. an agent
|
||
silently retrying a wrong strategy without an error — are still invisible.
|
||
- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
|
||
friction claims are Claude-weighted for now.
|
||
|
||
[AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
|
||
[AGENTIC-WP-0006]: ../workplans/AGENTIC-WP-0006-error-body-mining.md
|
||
[STATE-WP-0058]: handed off to the state-hub repo worker
|
||
[detect/quality.py]: ../session_memory/detect/quality.py
|