session-memory: friction assessment + hardened catalog (WP-0005 T03)

Re-ran ingest->detect with the quality filter + infra signals over real local sessions (72 captured -> 27 real). Purged the false-positive 'abandoned' catalog entry and re-curated; catalog now carries tool_thrash/schema_thrash/infra_overhead patterns. docs/ASSESSMENT-infra-friction.md ranks the friction: ~17.6% of real tool activity is hub/task/schema plumbing (State Hub 10.3%, one session 231 calls; ToolSearch in 81% of sessions). Validates the CLI/MCP-skill hypothesis as top-2; recommends a State Hub skill (front-load schemas + batched writes) + bulk hub ops. Workplan finished; suite 88/88. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 11:18:27 +02:00
parent 21c714e286
commit 48618293b0
8 changed files with 338 additions and 112 deletions
--- a/docs/ASSESSMENT-infra-friction.md
+++ b/docs/ASSESSMENT-infra-friction.md
@@ -0,0 +1,100 @@
+# Infrastructure Friction Assessment
+
+*Generated 2026-06-07 from captured coding-session data (Helix Forge session
+memory), after the Detect-hardening pass ([AGENTIC-WP-0005]). First data-driven
+assessment of where our agentic coding sessions spend effort on plumbing rather
+than work.*
+
+## Method & data quality
+
+- **Corpus:** 72 sessions captured across Claude + Grok. A session-quality filter
+  ([detect/quality.py]) drops health-checks, smoke-tests, and interrupted runs
+  (mostly `llm-connect` *"Say hello in one word"*). **27 are real coding sessions.**
+- **Caveat:** the 41 % that were filtered out had been mislabeled `abandoned` by
+  the outcome heuristic and produced a *false-positive* "cross-flavor abandoned"
+  pattern in the first catalog — now purged. Treat any pre-hardening finding with
+  suspicion.
+- **Key framing:** all 27 real sessions ended in `success`. So the friction here
+  is **cost/efficiency, not failure** — sessions get there, but pay an avoidable
+  tax to do it.
+
+## The headline number
+
+Across the 27 real sessions, tool-call activity breaks down as:
+
+| Bucket | Share |
+|--------|------:|
+| shell (Bash / run_terminal) | 38.2 % |
+| edit | 30.2 % |
+| read | 12.9 % |
+| **State Hub MCP** | **10.3 %** |
+| **task-management plumbing** | **5.8 %** |
+| **schema-loading (`ToolSearch`)** | **1.5 %** |
+| other | 1.1 % |
+
+**~17.6 % of all tool calls in real coding sessions are coordination plumbing
+(hub + task + schema-loading), not touching the repo.** Per-session infra-overhead
+share: median **11.7 %**, p90 **26.1 %**, max **43.3 %** — it concentrates badly.
+
+## Ranked friction
+
+### 1. State Hub call volume — *highest cost, addressable*
+State Hub MCP is 10.3 % of all tool calls and dominates the worst sessions:
+
+| Repo (one session) | total calls | State Hub calls | overhead share |
+|--------------------|------:|------:|------:|
+| vergabe-teilnahme | 570 | **231** | 43 % |
+| activity-core | 488 | 98 | 23 % |
+| flex-auth | 236 | 35 (+27 task) | 29 % |
+| net-kingdom | 129 | 25 | 22 % |
+
+Root cause: many **fine-grained** calls — per-task status updates, per-event
+progress writes, repeated `get_domain_summary`. 231 hub calls in a single session
+is coordination overhead, not work.
+
+### 2. Schema-loading thrash (`ToolSearch`) — *low cost, near-zero-effort fix*
+**106 `ToolSearch` calls across 22 of 27 sessions (81 %).** The State Hub MCP
+tools are *deferred*, so nearly every session re-discovers and re-loads the same
+tool schemas before it can call them. This is pure overhead with no work value —
+and it is **exactly the CLI/MCP-interface friction hypothesized.**
+
+### 3. Task-management plumbing — 5.8 %
+`TaskUpdate` / `TaskCreate` / `todo_write` / `update_task_status`. Overlaps with
+(1); much of it is redundant status churn within a session.
+
+### 4. Tool thrash — *session-shape, watch only*
+11 sessions hammer a single tool 80–230× (usually Bash or Edit). Less an infra
+problem than a sign of missing higher-level tooling; low priority.
+
+### 5. Budget overrun — 3 sessions
+Token cost well above peers. Secondary; revisit once (1)–(2) are addressed.
+
+## Recommendations
+
+**The CLI/MCP-interface hypothesis is validated as a top-2 friction, not a minor
+issue.** Two high-ROI moves:
+
+- **A. A State Hub skill (highest ROI).** A skill (or a pre-loaded tool manifest)
+  that (i) **front-loads the common hub tool schemas** so agents stop
+  `ToolSearch`-ing for them — eliminates finding #2 almost entirely (81 % of
+  sessions) — and (ii) **teaches batched writes** (sync N task statuses in one
+  call, fewer progress events) to attack finding #1. Low effort, broad reach.
+- **B. Coarser hub operations.** Add bulk endpoints / a single "sync workplan
+  statuses" op so a session doesn't make 200+ individual hub calls. This is the
+  structural fix behind the skill's guidance.
+- **C. Measure the effect (Phase 4).** After A/B land, compare infra-overhead
+  share on subsequent sessions against this baseline (median 11.7 %, p90 26.1 %).
+  This is precisely what the Measure phase is for — the loop closes here.
+
+## What this assessment still can't see
+
+- **Why** a session was expensive at the *content* level (specific error
+  messages, repeated failed approaches) — the digest captures tool histograms and
+  prompt/response snippets but not error-body text. Mining tool-result bodies for
+  recurring failure messages is the natural next extension if root-cause depth is
+  needed.
+- Grok/Codex are thin in the corpus (4 Grok, 0 Codex sessions), so cross-flavor
+  friction claims are Claude-weighted for now.
+
+[AGENTIC-WP-0005]: ../workplans/AGENTIC-WP-0005-detect-hardening.md
+[detect/quality.py]: ../session_memory/detect/quality.py