chore(ACTIVITY-WP-0016-T01): record root-cause findings + partial failure fixture

Local analysis of the 2026-06-26 daily-triage validation failure: the unbounded ~1-recommendation-per-workstream list (16 active workstreams; JSON break at char 5268, ~rank 8-9) is the structural cause; both the first attempt and the retry failed. The exact offending token and finish_reason are unrecoverable from activity-core data — complete() drops finish_reason/usage, the report sink caps raw output at 4000 chars (< 5268), and the log preview at 2000. Confirming the exact token needs llm-connect producer-side logs on railiance01 (operator-owned); mitigation (T02/T03) is identical regardless. Partial fixture captured. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 15:04:27 +02:00
parent 5eb33bd3bb
commit 0e9e18a59a
2 changed files with 39 additions and 0 deletions
--- a/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md
+++ b/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md
@@ -110,6 +110,40 @@ Done when:
  whether the schema param is load-bearing);
 - the failing payload is captured as a regression fixture under `tests/`.

+2026-06-26 findings (local analysis on the workstation):
+
+- **Mechanism confirmed structurally.** There are **16 active workstreams**
+  org-wide and the triage instruction emits ~one ranked recommendation per
+  candidate. The preserved preview holds 7 fully-formed recommendations; the JSON
+  break is at char 5268 (~rank 8–9). The unbounded one-per-workstream list is the
+  structural cause — more items = more tokens = higher odds of a mid-stream JSON
+  slip and/or truncation. This directly justifies T02's bounded top-N + per-item
+  framing.
+- **Both attempts failed.** `executor._execute` retries once
+  (`src/activity_core/rules/executor.py:166-171`); the recorded error is from the
+  **retry** output, so the model produced invalid JSON twice — not a one-off.
+- **activity-core discards the diagnostics needed to root-cause this.** Three
+  retention gaps mean the exact char-5268 token cannot be recovered from
+  activity-core data at all:
+  1. `LLMConnectClient.complete()` returns only `data["content"]`
+     (`llm_client.py:57-60`) — it drops `finish_reason`/`usage` from the
+     llm-connect HTTP response, so truncation-vs-structural cannot be
+     distinguished locally.
+  2. the report sink caps raw output at **4000 chars** (`_invalid_output_report`,
+     `executor.py:259`) — below the 5268 break.
+  3. the worker log caps the preview at **2000 chars** (`executor.py:175`).
+- **Remaining (remote, operator-owned).** Confirming the exact offending token
+  and `finish_reason` requires llm-connect's producer-side logs on `railiance01`
+  — cluster access, outside this repo's SCOPE for direct action. Truncation is
+  the leading hypothesis given the 16-item input, but the mitigation (T02/T03) is
+  identical either way, so T01 does not block the build work.
+- **Feeds T03/T04.** The retention gaps are themselves defects to fix: capture
+  `finish_reason`/`usage` and persist a larger bounded raw artifact on validation
+  failure so this class of failure is never un-debuggable again.
+- Partial fixture saved:
+  `tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`
+  (the 4000-char preview + validation error; full payload pending the remote pull).
+
 ## Schema + Prompt Redesign For Error Locality

 ```task