Files
agentic-resources/docs/ASSESSMENT-infra-friction.md
tegwick 48618293b0 session-memory: friction assessment + hardened catalog (WP-0005 T03)
Re-ran ingest->detect with the quality filter + infra signals over real local
sessions (72 captured -> 27 real). Purged the false-positive 'abandoned' catalog
entry and re-curated; catalog now carries tool_thrash/schema_thrash/infra_overhead
patterns. docs/ASSESSMENT-infra-friction.md ranks the friction: ~17.6% of real
tool activity is hub/task/schema plumbing (State Hub 10.3%, one session 231 calls;
ToolSearch in 81% of sessions). Validates the CLI/MCP-skill hypothesis as top-2;
recommends a State Hub skill (front-load schemas + batched writes) + bulk hub ops.
Workplan finished; suite 88/88.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 11:18:27 +02:00

101 lines
4.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Infrastructure Friction Assessment
*Generated 2026-06-07 from captured coding-session data (Helix Forge session
memory), after the Detect-hardening pass ([AGENTIC-WP-0005]). First data-driven
assessment of where our agentic coding sessions spend effort on plumbing rather
than work.*
## Method & data quality
- **Corpus:** 72 sessions captured across Claude + Grok. A session-quality filter
([detect/quality.py]) drops health-checks, smoke-tests, and interrupted runs
(mostly `llm-connect` *"Say hello in one word"*). **27 are real coding sessions.**
- **Caveat:** the 41 % that were filtered out had been mislabeled `abandoned` by
the outcome heuristic and produced a *false-positive* "cross-flavor abandoned"
pattern in the first catalog — now purged. Treat any pre-hardening finding with
suspicion.
- **Key framing:** all 27 real sessions ended in `success`. So the friction here
is **cost/efficiency, not failure** — sessions get there, but pay an avoidable
tax to do it.
## The headline number
Across the 27 real sessions, tool-call activity breaks down as:
| Bucket | Share |
|--------|------:|
| shell (Bash / run_terminal) | 38.2 % |
| edit | 30.2 % |
| read | 12.9 % |
| **State Hub MCP** | **10.3 %** |
| **task-management plumbing** | **5.8 %** |
| **schema-loading (`ToolSearch`)** | **1.5 %** |
| other | 1.1 % |
**~17.6 % of all tool calls in real coding sessions are coordination plumbing
(hub + task + schema-loading), not touching the repo.** Per-session infra-overhead
share: median **11.7 %**, p90 **26.1 %**, max **43.3 %** — it concentrates badly.
## Ranked friction
### 1. State Hub call volume — *highest cost, addressable*
State Hub MCP is 10.3 % of all tool calls and dominates the worst sessions:
| Repo (one session) | total calls | State Hub calls | overhead share |
|--------------------|------:|------:|------:|
| vergabe-teilnahme | 570 | **231** | 43 % |
| activity-core | 488 | 98 | 23 % |
| flex-auth | 236 | 35 (+27 task) | 29 % |
| net-kingdom | 129 | 25 | 22 % |
Root cause: many **fine-grained** calls — per-task status updates, per-event
progress writes, repeated `get_domain_summary`. 231 hub calls in a single session
is coordination overhead, not work.
### 2. Schema-loading thrash (`ToolSearch`) — *low cost, near-zero-effort fix*
**106 `ToolSearch` calls across 22 of 27 sessions (81 %).** The State Hub MCP
tools are *deferred*, so nearly every session re-discovers and re-loads the same
tool schemas before it can call them. This is pure overhead with no work value —
and it is **exactly the CLI/MCP-interface friction hypothesized.**
### 3. Task-management plumbing — 5.8 %
`TaskUpdate` / `TaskCreate` / `todo_write` / `update_task_status`. Overlaps with
(1); much of it is redundant status churn within a session.
### 4. Tool thrash — *session-shape, watch only*
11 sessions hammer a single tool 80230× (usually Bash or Edit). Less an infra
problem than a sign of missing higher-level tooling; low priority.
### 5. Budget overrun — 3 sessions
Token cost well above peers. Secondary; revisit once (1)(2) are addressed.
## Recommendations
**The CLI/MCP-interface hypothesis is validated as a top-2 friction, not a minor
issue.** Two high-ROI moves:
- **A. A State Hub skill (highest ROI).** A skill (or a pre-loaded tool manifest)
that (i) **front-loads the common hub tool schemas** so agents stop
`ToolSearch`-ing for them — eliminates finding #2 almost entirely (81 % of
sessions) — and (ii) **teaches batched writes** (sync N task statuses in one
call, fewer progress events) to attack finding #1. Low effort, broad reach.
- **B. Coarser hub operations.** Add bulk endpoints / a single "sync workplan
statuses" op so a session doesn't make 200+ individual hub calls. This is the
structural fix behind the skill's guidance.
- **C. Measure the effect (Phase 4).** After A/B land, compare infra-overhead
share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
This is precisely what the Measure phase is for — the loop closes here.
## What this assessment still can't see
- **Why** a session was expensive at the *content* level (specific error
messages, repeated failed approaches) — the digest captures tool histograms and
prompt/response snippets but not error-body text. Mining tool-result bodies for
recurring failure messages is the natural next extension if root-cause depth is
needed.
- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
friction claims are Claude-weighted for now.
[AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
[detect/quality.py]: ../session_memory/detect/quality.py