Files
activity-core/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md
tegwick 9be4ddbdb7 feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004
Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM,
agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate
postures, error-locality and quarantine-with-provenance principles, and the concrete
activity-core mechanisms.

Implement producer-agnostic guardrails in executor.py, applied uniformly on the
happy path and the recovery path via _partition_items: structural-type -> schema ->
structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap.
Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred
from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads
context["known_candidates"] via _allow_list_from_context; inert until a resolver
populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift.

New tests: happy-path count cap, oversized-string guardrail, allow-list rejection.
Full suite: 218 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00

361 lines
19 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: ACTIVITY-WP-0016
type: workplan
title: "LLM Output Robustness & The Producer Trust Boundary"
domain: custodian
repo: activity-core
status: active
owner: codex
topic_slug: custodian
created: "2026-06-26"
updated: "2026-06-26"
state_hub_workstream_id: "4ef0d53b-1777-41ae-80c6-1b69fdb34726"
---
# ACTIVITY-WP-0016 — LLM Output Robustness & The Producer Trust Boundary
## Context
On 2026-06-26 the scheduled `daily-statehub-wsjf-triage` instruction fired on
time (`daily_triage` event 05:20:57Z) but its output **failed schema
validation**: `Expecting ',' delimiter: line 136 column 22 (char 5268)`. The
model emitted a long ranked WSJF recommendation list (reached rank 7+ with
nested `wsjf` objects) and the JSON broke deep in that list. Because the report
is a single monolithic JSON document, one malformed delimiter discarded the
**entire** run. This reset the three-clean-consecutive-scheduled-runs streak in
`ACTIVITY-WP-0006-T03` (06-24 ✅, 06-25 ✅, 06-26 ✗-validation) and is the
LLM-output-quality surface deferred from `ACTIVITY-WP-0010`.
The scheduling/runtime layer is healthy — this is purely an output-robustness
and boundary-design problem. Today's code (`src/activity_core/rules/executor.py`)
already: passes the output schema to llm-connect as a `json_schema` model param
(`_llm_run_config`), retries once, runs a fenced/`raw_decode` tolerant parser
(`_parse_json_output`), and preserves a bounded 4000-char preview on hard
failure (`_invalid_output_report`). None of that helps when error locality is
zero: the failure unit is the whole document, not the offending item.
## Design Frame — The Producer Trust Boundary
This workplan is anchored to a deliberate architectural stance, not just a bug
fix. Capture it in an ADR (T04) so future work inherits it.
**Premise.** activity-core has a *trust boundary* where free-form producer
output meets strict deterministic consumers (JSON Schema validators, the task
emitter, classic compute pipelines). The producers are **LLMs and humans (and
agents acting for either)**. Both are *untrusted producers*: their output may be
- **erroneous** — hallucination, truncation (token-limit cutoff), drift,
type slips, typos; or
- **malicious** — prompt injection, crafted payloads, oversized/deeply-nested
structures aimed at exhausting or confusing the consumer.
The architecture should treat the boundary as an adversarial frontier and place
**guardrails + error-correction tooling there**, rather than letting raw
producer output flow into deterministic consumers and fail (or worse, partially
succeed) downstream.
**Two non-fail-fast postures.** When we do *not* want to hard-fail on a problem,
there are two sensible strategies — and they compose:
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the
happy path. Blast radius depends entirely on how granular the catch is. Good
when failures are rare and locally recoverable. Risk: failures surface late,
possibly after partial side effects.
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
and normalize the output to a known-good shape *before* it enters the pipeline
— drop bad items, coerce types, bound sizes/depth, allow-list references — so
the consumer only ever sees clean input. Higher upfront cost, smaller blast
radius, no partial side effects. Good when failures are common or
consequences are high.
**Governing principles for this repo:**
1. **Push verification to the boundary; keep the interior strict.** Apply
posture **B** at the producer→consumer boundary (verify+mitigate structure);
keep posture **A** for residual exceptions inside the verified core. Never
relax the interior schema to absorb producer sloppiness.
2. **Make error locality match the unit of work.** One bad recommendation must
cost one recommendation, not the whole report. Framing the payload so each
item is independently parseable is the single highest-leverage change.
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
provenance-tagged artifacts (index, error, raw snippet) so they can be
debugged or replayed — degraded-but-usable is distinct from total loss.
4. **Both human and agent input get the same rigor.** Guardrails are
producer-agnostic: the same size/depth/count caps, reference allow-lists, and
truncation detection apply whether the producer is an LLM, an agent, or a
human form submission.
## Reproduce & Root-Cause The Failure
```task
id: ACTIVITY-WP-0016-T01
status: wait
priority: high
state_hub_task_id: "74fd16a5-4ea5-4dfe-8526-dfa27cf76138"
```
Recover the **full** raw llm-connect response for the 06-26 failure (the State
Hub event keeps only a 4000-char preview; the break is at char 5268) and
establish the precise cause.
Done when:
- the full raw response is pulled from the runtime llm-connect log / response
store and the exact offending token at char 5268 is identified;
- `finish_reason` is captured to confirm or rule out token-limit **truncation**
vs a structural mid-stream glitch;
- it is confirmed whether llm-connect actually **enforced** the `json_schema`
constrained-decoding hint or merely accepted it as advisory (this determines
whether the schema param is load-bearing);
- the failing payload is captured as a regression fixture under `tests/`.
2026-06-26 findings (local analysis on the workstation):
- **Mechanism confirmed structurally.** There are **16 active workstreams**
org-wide and the triage instruction emits ~one ranked recommendation per
candidate. The preserved preview holds 7 fully-formed recommendations; the JSON
break is at char 5268 (~rank 89). The unbounded one-per-workstream list is the
structural cause — more items = more tokens = higher odds of a mid-stream JSON
slip and/or truncation. This directly justifies T02's bounded top-N + per-item
framing.
- **Both attempts failed.** `executor._execute` retries once
(`src/activity_core/rules/executor.py:166-171`); the recorded error is from the
**retry** output, so the model produced invalid JSON twice — not a one-off.
- **activity-core discards the diagnostics needed to root-cause this.** Three
retention gaps mean the exact char-5268 token cannot be recovered from
activity-core data at all:
1. `LLMConnectClient.complete()` returns only `data["content"]`
(`llm_client.py:57-60`) — it drops `finish_reason`/`usage` from the
llm-connect HTTP response, so truncation-vs-structural cannot be
distinguished locally.
2. the report sink caps raw output at **4000 chars** (`_invalid_output_report`,
`executor.py:259`) — below the 5268 break.
3. the worker log caps the preview at **2000 chars** (`executor.py:175`).
- **Remaining (remote, operator-owned).** Confirming the exact offending token
and `finish_reason` requires llm-connect's producer-side logs on `railiance01`
— cluster access, outside this repo's SCOPE for direct action. Truncation is
the leading hypothesis given the 16-item input, but the mitigation (T02/T03) is
identical either way, so T01 does not block the build work.
- **Feeds T03/T04.** The retention gaps are themselves defects to fix: capture
`finish_reason`/`usage` and persist a larger bounded raw artifact on validation
failure so this class of failure is never un-debuggable again.
- Partial fixture saved:
`tests/fixtures/wp0016/daily_triage_2026-06-26_validation_failure.partial.json`
(the 4000-char preview + validation error; full payload pending the remote pull).
## Schema + Prompt Redesign For Error Locality
```task
id: ACTIVITY-WP-0016-T02
status: progress
priority: high
state_hub_task_id: "ae67ca8c-ee01-4a8d-9e8a-a0a36c999758"
```
Redesign the daily-triage report contract so a single malformed item can no
longer discard the whole report (principle #2).
Done when:
- the recommendation list is **bounded** (configurable top-N, default 57) in
both the prompt and the output schema — long lists are where the model drifts;
- the report uses a **per-item-framed** shape (JSON Lines / NDJSON — one
recommendation object per line — or an equivalent delimited per-item form)
behind a minimal stable envelope (`summary` + framed items), so each item is
an independent parse unit;
- the prompt explicitly states the contract, the per-item framing, the cap, and
a "if uncertain, emit fewer well-formed items rather than more" instruction;
- `max_tokens` is set with headroom for the bounded list so truncation cannot
occur at the expected size;
- the output schema file (`_load_output_schema` target) is updated to match.
2026-06-26 progress (in-repo portion):
- **Strict, bounded schema written** — `schemas/daily-triage-report.json` went
from `recommendations.items: {type: object}` (accept-anything) to a strict
per-item contract: `required [rank, candidate, action, why]` with typed
`wsjf` sub-fields, plus `maxItems: 7`. The strict item shape is what lets the
T03 boundary parser validate each recommendation independently.
- **`maxItems` is a hint, not a hard reject** — the in-repo validator
(`_validate_schema_node`) only enforces `type`/`required`/`properties`/`items`
and ignores `maxItems`/`enum`. That is deliberate: a hard `maxItems` reject
would discard a whole 16-item report — the exact blast-radius bug WP-0016
removes. The bound is enforced via the prompt + the llm-connect `json_schema`
constraint hint + T03 mitigation (keep top-N by rank, quarantine extras).
- **DEPLOY COUPLING (important):** this schema file is consumed *both* as the
llm-connect hint *and* by the current whole-document validator. Tightening
per-item `required` fields makes the existing whole-doc validation hard-fail
**more** until T03 replaces it with per-item quarantine. Therefore the schema
change MUST ship together with T03 — do not deploy the strict schema to the
runtime bundle ahead of the T03 parser. Four executor/instruction tests that
asserted the old loose contract were updated to the strict contract; the
forwarded-schema test now reads the live file instead of hard-coding it.
- **Truncation hypothesis corroborated** — the instruction config carries
`max_tokens` on the order of ~1200 (per the wiring test fixture). 5268 chars ≈
~13001500 tokens, so a ~1200-token cap would truncate a 16-item list right at
the observed break. This strengthens T01's leading hypothesis and makes the
`max_tokens` headroom change below concrete.
**Bundle handoff (NOT in this repo — runtime-projected definition).** The triage
prompt and `max_tokens` live in the Railiance runtime bundle, not in repo files.
Apply there:
1. Instruct a **bounded top-N** (≤ 7) ranked recommendations, "if uncertain emit
fewer well-formed items rather than more."
2. Specify the **per-item framing** the T03 parser will consume (NDJSON: a
leading summary object, then one recommendation JSON object per line).
3. Raise **`max_tokens`** to give clear headroom for 7 framed items (eliminate
truncation at the expected size).
4. State the value vocabularies (`action`, `confidence`) the T04 guardrails will
check.
## Boundary Parser — Verify & Mitigate (Posture B)
```task
id: ACTIVITY-WP-0016-T03
status: done
priority: high
state_hub_task_id: "d65a6281-f1f9-4a9b-a835-da065411b709"
```
Implement item-granular parsing with a quarantine lane in
`src/activity_core/rules/executor.py`, applying posture **B** at the boundary
(principles #1#3).
Done when:
- the parser splits the envelope from the framed items, then parses **each item
independently**; a malformed item is routed to a bounded `quarantined_items`
artifact (index + validation error + raw snippet), not raised;
- a run with some valid and some invalid items emits a report over the surviving
valid items with `output_validated=true`, plus `partial=true` and
`quarantined_count` / `quarantined_items` markers — degraded-but-usable is
reported distinctly from total loss;
- a best-effort **repair** pass (close unterminated brackets/quotes, recover the
valid prefix) is attempted per item before quarantining it;
- truncation detected in T01 is handled as its own signal (recover whole items
emitted before the cutoff rather than failing the document);
- the existing monolithic-document path remains as the fallback when framing is
absent (backward compatible with task-only instructions).
2026-06-26 progress (implemented in `src/activity_core/rules/executor.py`):
- **Resilient recovery wired into `_execute`.** When the whole-document parse +
one retry still fail, report instructions (those with `report_sinks`) now run
`_resilient_report` *before* the total-loss `_invalid_output_report`. If it
recovers ≥1 valid item it returns a partial report; otherwise it returns None
and the prior total-loss path is preserved unchanged.
- **Brace/quote-aware object scanner, not line-splitting.** The real 06-26 output
was pretty-printed (multi-line objects), so naive NDJSON line recovery would
have failed. `_extract_object_spans` walks the `recommendations` array
brace-depth- and string-aware, so it recovers each recommendation object
whether pretty-printed across many lines *or* emitted one-per-line (NDJSON).
The truncated trailing object is returned with `complete=False`.
- **Layered mitigation per item:** `json.loads` → on failure for a truncated
tail, a best-effort `_try_repair` (balance open string/brackets/braces) →
then `_partition_items` validates each recovered object against the T02 item
schema. Valid items survive; malformed or over-`maxItems` items are
quarantined with provenance (`index`, `error`, `raw` snippet, `reason`).
- **Report shape on degradation:** `output_validated=True` over the survivors,
`review_required=True`, `partial=True`, `quarantined_count`, and a bounded
`quarantined_items` list (cap 20). Degraded-but-usable is now reported
distinctly from total loss.
- **Verified against the real failure shape.** New tests reconstruct a
pretty-printed report with 7 valid recommendations + a truncated tail (the
06-26 shape) and a one-bad-item-among-valid case. The 7-item run now recovers
all 7 and quarantines the broken tail (previously: whole run discarded);
log line `instruction_output_recovered: kept=7, quarantined=1`. The bad-item
run keeps 2 and quarantines the rank-less one.
- **Deferred to T04 (clean scope boundary):** enforcing `maxItems` top-N on the
*happy* path (valid JSON, all items schema-valid, but > N items) — the resilient
path only runs on failure, so over-limit-on-success is a guardrail/count-cap
concern, which is exactly T04's remit.
## Producer Guardrails + ADR-004
```task
id: ACTIVITY-WP-0016-T04
status: todo
priority: medium
state_hub_task_id: "f5c3af5b-9e28-42b0-9af5-4c99284e99b9"
```
Write the architecture decision record and add the producer-agnostic guardrails
(principle #4).
Done when:
- `docs/adr/adr-004-producer-trust-boundary.md` documents the trust boundary,
the untrusted-producer premise (erroneous **and** malicious; human and agent),
the A vs B taxonomy and where each applies, the error-locality principle, and
the quarantine-with-provenance rule;
- boundary guardrails are enforced at the consumer edge: max item **count**, max
string length, max nesting **depth**, and a **reference allow-list** (e.g. a
recommendation `candidate` / a task `target_repo` must resolve to a known
workstream/repo before it is acted on);
- guardrail rejections are quarantined with provenance, consistent with T03;
- SCOPE.md / INTENT.md are checked for drift and updated if the boundary stance
changes the documented contract.
2026-06-26 progress:
- **ADR-004 written** — `docs/adr/adr-004-producer-trust-boundary.md` documents
the untrusted-producer premise (erroneous + malicious; LLM/agent/human), the
A-vs-B posture taxonomy, the four governing principles, the concrete
activity-core mechanisms, a posture-by-layer table, consequences, and
alternatives considered. Accepted, scope cross-repo.
- **Producer guardrails implemented** in `executor.py`, applied uniformly on the
happy path *and* the recovery path via `_partition_items`: per-item order is
structural-type → schema → structural caps (`_MAX_DEPTH=8`,
`_MAX_STRING_LEN=4000`) → reference allow-list → count cap (`maxItems`). Each
quarantine carries a `reason` (`malformed`/`schema`/`guardrail`/`allow_list`/
`over_limit`).
- **Happy-path count cap closed** (the item deferred from T03): a syntactically
valid 9-item report now keeps 7 and quarantines 2 as `over_limit`, emitting a
`partial` report — without a retry.
- **Reference allow-list wired but inert.** `_allow_list_from_context` reads
`context["known_candidates"]`; when present, recommendations with an unknown
`candidate` are quarantined (`reason: allow_list`). Absent today → check is
inert; activation is a one-line context-resolver change. Keeps the guardrail
producer-agnostic (principle #4) and ready.
- **SCOPE.md updated** — instruction-executor bullet now names the quarantine
lane + guardrails; ADR-004 added to the Architecture Decisions list. No INTENT
drift: this hardens the existing output contract, it does not extend scope.
- New tests: happy-path count cap, oversized-string guardrail, allow-list
rejection (all green).
## Tests + Calibration Re-Entry
```task
id: ACTIVITY-WP-0016-T05
status: todo
priority: high
state_hub_task_id: "c881500b-5459-4620-81c0-b176971e989f"
```
Prove the new posture and hand back to the calibration gates.
Done when:
- regression tests cover: the captured 06-26 payload, a truncated-mid-list
payload, a one-bad-item-among-good payload (asserts quarantine + partial), an
oversized/over-deep payload (asserts guardrail rejection), and an
injection-shaped reference (asserts allow-list rejection);
- the full suite passes and the result is recorded here with the count;
- a daily-triage smoke against the live runtime shows a previously-failing
payload now **degrades gracefully** (valid items delivered, bad items
quarantined) instead of discarding the run;
- a progress note hands back to `ACTIVITY-WP-0010-T04` and `ACTIVITY-WP-0006-T03`
that the output-robustness blocker is cleared so the three-clean-run gate can
resume on its own.
## Relationships
- **Blocks / feeds:** `ACTIVITY-WP-0006-T03` (three clean scheduled runs) and
`ACTIVITY-WP-0010-T04` (collect three clean scheduled runs) — both stalled on
the same output-quality failure this workplan removes.
- **References:** `ACTIVITY-WP-0009` (scheduled-run trust gap).
- **Boundary discipline:** keeps activity-core inside its SCOPE — this hardens
the instruction-executor output contract; it does not move provider
credentials, cluster reconciliation, or task lifecycle into this repo.