feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004

Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM, agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate postures, error-locality and quarantine-with-provenance principles, and the concrete activity-core mechanisms. Implement producer-agnostic guardrails in executor.py, applied uniformly on the happy path and the recovery path via _partition_items: structural-type -> schema -> structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap. Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads context["known_candidates"] via _allow_list_from_context; inert until a resolver populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift. New tests: happy-path count cap, oversized-string guardrail, allow-list rejection. Full suite: 218 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00
parent c5440e8429
commit 9be4ddbdb7
5 changed files with 373 additions and 12 deletions
--- a/docs/adr/adr-004-producer-trust-boundary.md
+++ b/docs/adr/adr-004-producer-trust-boundary.md
@@ -0,0 +1,156 @@
+---
+id: ACT-ADR-004
+type: architecture-decision-record
+title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output"
+status: accepted
+decided_by: Bernd Worsch
+date: "2026-06-26"
+scope: cross-repo
+affects:
+  - activity-core
+  - rules-core (future extraction)
+tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"]
+---
+
+# ACT-ADR-004: The Producer Trust Boundary
+
+## Status
+
+Accepted.
+
+## Context
+
+On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called
+llm-connect successfully, and produced a long ranked recommendation list — but
+the JSON broke at char 5268 (~rank 8–9 of ~16), failing schema validation. Because
+the report was validated and consumed as a single monolithic JSON document, one
+malformed delimiter discarded the **entire** run, including the 7 perfectly good
+recommendations the model had already emitted. The scheduling and runtime layers
+were healthy; the failure was entirely at the seam where free-form model output
+meets a strict consumer.
+
+This is not a one-off bug, it is a recurring class. activity-core has a **trust
+boundary** wherever generative or human-authored output meets strict deterministic
+consumers: the JSON Schema validator, the task emitter, and any classic compute
+pipeline downstream. The producers on the other side of that boundary — **LLMs,
+agents, and humans** — are all *untrusted producers*. Their output may be:
+
+- **erroneous** — hallucination, truncation at a token limit, drift, type slips,
+  typos, a missing delimiter; or
+- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested
+  structures intended to exhaust or confuse the consumer.
+
+The pre-existing design treated producer output optimistically: parse the whole
+document, validate the whole document, and on any failure discard the whole
+document (preserving only a bounded diagnostic preview). That gives **zero error
+locality** — the blast radius of any single defect is the entire activation.
+
+## Decision
+
+Treat the producer→consumer seam as an explicit, adversarial **trust boundary**,
+and place guardrails plus error-correction tooling *at that boundary* rather than
+letting raw producer output flow into deterministic consumers.
+
+### Two non-fail-fast postures
+
+When hard-failing on a problem is undesirable, there are two sound strategies, and
+they **compose**:
+
+- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
+  as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy
+  path; blast radius depends entirely on how granular the catch is. Best when
+  failures are rare and locally recoverable. Risk: failures surface late, possibly
+  after partial side effects.
+- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
+  and normalize the output to a known-good shape *before* it enters the pipeline —
+  drop bad items, coerce types, bound sizes/depth, allow-list references — so the
+  consumer only ever sees clean input. Higher upfront cost, smaller blast radius,
+  no partial side effects. Best when failures are common or consequences are high.
+
+### Governing principles
+
+1. **Push verification to the boundary; keep the interior strict.** Apply posture
+   **B** at the producer→consumer boundary; keep posture **A** for residual
+   exceptions inside the verified core. Never relax the interior schema to absorb
+   producer sloppiness.
+2. **Make error locality match the unit of work.** One bad recommendation must
+   cost one recommendation, not the whole report. Structuring the payload so each
+   item is independently parseable and validatable is the highest-leverage change.
+3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
+   provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they
+   can be debugged or replayed. Degraded-but-usable is reported distinctly from
+   total loss.
+4. **Both human and agent input get the same rigor.** Guardrails are
+   producer-agnostic: the same count / length / depth caps and reference
+   allow-lists apply whether the producer is an LLM, an agent, or a human.
+
+### What this means concretely in activity-core
+
+Implemented in `src/activity_core/rules/executor.py`:
+
+- **Strict-structure-only schema.** The daily-triage output schema is strict on
+  per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`)
+  and carries `maxItems` as a producer *hint* — never as a hard whole-document
+  reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs
+  the schema format; `schemas/daily-triage-report.json`).
+- **Item-granular recovery (posture B).** When whole-document parse + one retry
+  fail, `_resilient_report` recovers individually-parseable recommendation objects
+  via a brace/quote-aware scanner (`_extract_object_spans`) that works for both
+  pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a
+  truncated tail, validates each recovered object against the item schema, and
+  keeps the valid ones. Survivors are emitted with `output_validated=true`,
+  `partial=true`, and `review_required=true`.
+- **Producer guardrails (`_partition_items`, applied on both the recovery and the
+  happy path).** Per recommendation: structural type → schema → structural caps
+  (`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by
+  `maxItems`). The first failing check quarantines the item with provenance and a
+  `reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`).
+- **Reference allow-list.** A recommendation whose `candidate` is not in the set of
+  known ids is quarantined. The set is sourced from resolved context
+  (`context["known_candidates"]`, via `_allow_list_from_context`); the check is
+  inert until a context resolver populates it, so the capability ships now and
+  activates with a one-line resolver change.
+
+### Where each posture sits
+
+| Layer | Posture | Mechanism |
+|-------|---------|-----------|
+| Schema / contract | B | strict per-item structure; `maxItems` as hint |
+| Whole-document parse | A | tolerant parse + single retry |
+| Failed parse | B | item-granular recovery + repair + quarantine |
+| Per-item screening | B | schema + depth/length caps + allow-list + count cap |
+| Emitted report | — | `partial` / `quarantined_*` provenance; never silent |
+
+## Consequences
+
+- A single malformed or oversized item no longer discards an entire activation;
+  the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid
+  recommendations and quarantine the broken tail.
+- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks
+  and reviewers can distinguish degraded-but-usable from total loss.
+- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the
+  allow-list) are policy knobs that will need tuning; they are intentionally
+  conservative defaults, not a finished calibration.
+- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns
+  only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact
+  caps raw output below realistic break points. Capturing those signals so
+  failures stay debuggable is tracked as a retention fix, not closed by this ADR.
+
+## Alternatives considered
+
+- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an
+  over-count document reproduces the whole-document blast radius. Mitigation (keep
+  top-N, quarantine the rest) is preferred.
+- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes
+  malformed data into downstream consumers.
+- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the
+  2026-06-26 failure recurred across both the initial attempt and the retry, so
+  retry alone does not bound the blast radius.
+
+## References
+
+- ACT-ADR-002 — markdown-as-definition format and output schema governance.
+- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection
+  surface this boundary complements on the output side.
+- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the
+  implementing workplan.