Files
activity-core/docs/adr/adr-004-producer-trust-boundary.md
tegwick 9be4ddbdb7 feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004
Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM,
agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate
postures, error-locality and quarantine-with-provenance principles, and the concrete
activity-core mechanisms.

Implement producer-agnostic guardrails in executor.py, applied uniformly on the
happy path and the recovery path via _partition_items: structural-type -> schema ->
structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap.
Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred
from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads
context["known_candidates"] via _allow_list_from_context; inert until a resolver
populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift.

New tests: happy-path count cap, oversized-string guardrail, allow-list rejection.
Full suite: 218 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00

157 lines
8.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: ACT-ADR-004
type: architecture-decision-record
title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output"
status: accepted
decided_by: Bernd Worsch
date: "2026-06-26"
scope: cross-repo
affects:
- activity-core
- rules-core (future extraction)
tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"]
---
# ACT-ADR-004: The Producer Trust Boundary
## Status
Accepted.
## Context
On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called
llm-connect successfully, and produced a long ranked recommendation list — but
the JSON broke at char 5268 (~rank 89 of ~16), failing schema validation. Because
the report was validated and consumed as a single monolithic JSON document, one
malformed delimiter discarded the **entire** run, including the 7 perfectly good
recommendations the model had already emitted. The scheduling and runtime layers
were healthy; the failure was entirely at the seam where free-form model output
meets a strict consumer.
This is not a one-off bug, it is a recurring class. activity-core has a **trust
boundary** wherever generative or human-authored output meets strict deterministic
consumers: the JSON Schema validator, the task emitter, and any classic compute
pipeline downstream. The producers on the other side of that boundary — **LLMs,
agents, and humans** — are all *untrusted producers*. Their output may be:
- **erroneous** — hallucination, truncation at a token limit, drift, type slips,
typos, a missing delimiter; or
- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested
structures intended to exhaust or confuse the consumer.
The pre-existing design treated producer output optimistically: parse the whole
document, validate the whole document, and on any failure discard the whole
document (preserving only a bounded diagnostic preview). That gives **zero error
locality** — the blast radius of any single defect is the entire activation.
## Decision
Treat the producer→consumer seam as an explicit, adversarial **trust boundary**,
and place guardrails plus error-correction tooling *at that boundary* rather than
letting raw producer output flow into deterministic consumers.
### Two non-fail-fast postures
When hard-failing on a problem is undesirable, there are two sound strategies, and
they **compose**:
- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output
as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy
path; blast radius depends entirely on how granular the catch is. Best when
failures are rare and locally recoverable. Risk: failures surface late, possibly
after partial side effects.
- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp,
and normalize the output to a known-good shape *before* it enters the pipeline —
drop bad items, coerce types, bound sizes/depth, allow-list references — so the
consumer only ever sees clean input. Higher upfront cost, smaller blast radius,
no partial side effects. Best when failures are common or consequences are high.
### Governing principles
1. **Push verification to the boundary; keep the interior strict.** Apply posture
**B** at the producer→consumer boundary; keep posture **A** for residual
exceptions inside the verified core. Never relax the interior schema to absorb
producer sloppiness.
2. **Make error locality match the unit of work.** One bad recommendation must
cost one recommendation, not the whole report. Structuring the payload so each
item is independently parseable and validatable is the highest-leverage change.
3. **Quarantine, never silently drop.** Invalid units are preserved as bounded,
provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they
can be debugged or replayed. Degraded-but-usable is reported distinctly from
total loss.
4. **Both human and agent input get the same rigor.** Guardrails are
producer-agnostic: the same count / length / depth caps and reference
allow-lists apply whether the producer is an LLM, an agent, or a human.
### What this means concretely in activity-core
Implemented in `src/activity_core/rules/executor.py`:
- **Strict-structure-only schema.** The daily-triage output schema is strict on
per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`)
and carries `maxItems` as a producer *hint* — never as a hard whole-document
reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs
the schema format; `schemas/daily-triage-report.json`).
- **Item-granular recovery (posture B).** When whole-document parse + one retry
fail, `_resilient_report` recovers individually-parseable recommendation objects
via a brace/quote-aware scanner (`_extract_object_spans`) that works for both
pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a
truncated tail, validates each recovered object against the item schema, and
keeps the valid ones. Survivors are emitted with `output_validated=true`,
`partial=true`, and `review_required=true`.
- **Producer guardrails (`_partition_items`, applied on both the recovery and the
happy path).** Per recommendation: structural type → schema → structural caps
(`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by
`maxItems`). The first failing check quarantines the item with provenance and a
`reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`).
- **Reference allow-list.** A recommendation whose `candidate` is not in the set of
known ids is quarantined. The set is sourced from resolved context
(`context["known_candidates"]`, via `_allow_list_from_context`); the check is
inert until a context resolver populates it, so the capability ships now and
activates with a one-line resolver change.
### Where each posture sits
| Layer | Posture | Mechanism |
|-------|---------|-----------|
| Schema / contract | B | strict per-item structure; `maxItems` as hint |
| Whole-document parse | A | tolerant parse + single retry |
| Failed parse | B | item-granular recovery + repair + quarantine |
| Per-item screening | B | schema + depth/length caps + allow-list + count cap |
| Emitted report | — | `partial` / `quarantined_*` provenance; never silent |
## Consequences
- A single malformed or oversized item no longer discards an entire activation;
the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid
recommendations and quarantine the broken tail.
- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks
and reviewers can distinguish degraded-but-usable from total loss.
- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the
allow-list) are policy knobs that will need tuning; they are intentionally
conservative defaults, not a finished calibration.
- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns
only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact
caps raw output below realistic break points. Capturing those signals so
failures stay debuggable is tracked as a retention fix, not closed by this ADR.
## Alternatives considered
- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an
over-count document reproduces the whole-document blast radius. Mitigation (keep
top-N, quarantine the rest) is preferred.
- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes
malformed data into downstream consumers.
- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the
2026-06-26 failure recurred across both the initial attempt and the retry, so
retry alone does not bound the blast radius.
## References
- ACT-ADR-002 — markdown-as-definition format and output schema governance.
- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection
surface this boundary complements on the output side.
- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the
implementing workplan.