Files
activity-core/docs/adr/adr-004-producer-trust-boundary.md
tegwick 9be4ddbdb7 feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004
Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM,
agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate
postures, error-locality and quarantine-with-provenance principles, and the concrete
activity-core mechanisms.

Implement producer-agnostic guardrails in executor.py, applied uniformly on the
happy path and the recovery path via _partition_items: structural-type -> schema ->
structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap.
Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred
from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads
context["known_candidates"] via _allow_list_from_context; inert until a resolver
populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift.

New tests: happy-path count cap, oversized-string guardrail, allow-list rejection.
Full suite: 218 passed, 1 skipped.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-26 18:10:17 +02:00

8.1 KiB
Raw Blame History

id, type, title, status, decided_by, date, scope, affects, tags
id type title status decided_by date scope affects tags
ACT-ADR-004 architecture-decision-record The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output accepted Bernd Worsch 2026-06-26 cross-repo
activity-core
rules-core (future extraction)
architecture
llm
safety
validation
guardrails
trust-boundary
resilience

ACT-ADR-004: The Producer Trust Boundary

Status

Accepted.

Context

On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called llm-connect successfully, and produced a long ranked recommendation list — but the JSON broke at char 5268 (~rank 89 of ~16), failing schema validation. Because the report was validated and consumed as a single monolithic JSON document, one malformed delimiter discarded the entire run, including the 7 perfectly good recommendations the model had already emitted. The scheduling and runtime layers were healthy; the failure was entirely at the seam where free-form model output meets a strict consumer.

This is not a one-off bug, it is a recurring class. activity-core has a trust boundary wherever generative or human-authored output meets strict deterministic consumers: the JSON Schema validator, the task emitter, and any classic compute pipeline downstream. The producers on the other side of that boundary — LLMs, agents, and humans — are all untrusted producers. Their output may be:

  • erroneous — hallucination, truncation at a token limit, drift, type slips, typos, a missing delimiter; or
  • malicious — prompt injection, crafted payloads, or oversized / deeply-nested structures intended to exhaust or confuse the consumer.

The pre-existing design treated producer output optimistically: parse the whole document, validate the whole document, and on any failure discard the whole document (preserving only a bounded diagnostic preview). That gives zero error locality — the blast radius of any single defect is the entire activation.

Decision

Treat the producer→consumer seam as an explicit, adversarial trust boundary, and place guardrails plus error-correction tooling at that boundary rather than letting raw producer output flow into deterministic consumers.

Two non-fail-fast postures

When hard-failing on a problem is undesirable, there are two sound strategies, and they compose:

  • A) Trust but handle exceptions (optimistic / reactive). Consume the output as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy path; blast radius depends entirely on how granular the catch is. Best when failures are rare and locally recoverable. Risk: failures surface late, possibly after partial side effects.
  • B) Verify and mitigate (defensive / proactive). Validate, sanitize, clamp, and normalize the output to a known-good shape before it enters the pipeline — drop bad items, coerce types, bound sizes/depth, allow-list references — so the consumer only ever sees clean input. Higher upfront cost, smaller blast radius, no partial side effects. Best when failures are common or consequences are high.

Governing principles

  1. Push verification to the boundary; keep the interior strict. Apply posture B at the producer→consumer boundary; keep posture A for residual exceptions inside the verified core. Never relax the interior schema to absorb producer sloppiness.
  2. Make error locality match the unit of work. One bad recommendation must cost one recommendation, not the whole report. Structuring the payload so each item is independently parseable and validatable is the highest-leverage change.
  3. Quarantine, never silently drop. Invalid units are preserved as bounded, provenance-tagged artifacts (index, error, raw snippet, reason) so they can be debugged or replayed. Degraded-but-usable is reported distinctly from total loss.
  4. Both human and agent input get the same rigor. Guardrails are producer-agnostic: the same count / length / depth caps and reference allow-lists apply whether the producer is an LLM, an agent, or a human.

What this means concretely in activity-core

Implemented in src/activity_core/rules/executor.py:

  • Strict-structure-only schema. The daily-triage output schema is strict on per-item structure (required [rank, candidate, action, why], typed wsjf) and carries maxItems as a producer hint — never as a hard whole-document reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs the schema format; schemas/daily-triage-report.json).
  • Item-granular recovery (posture B). When whole-document parse + one retry fail, _resilient_report recovers individually-parseable recommendation objects via a brace/quote-aware scanner (_extract_object_spans) that works for both pretty-printed and NDJSON output, attempts a best-effort _try_repair on a truncated tail, validates each recovered object against the item schema, and keeps the valid ones. Survivors are emitted with output_validated=true, partial=true, and review_required=true.
  • Producer guardrails (_partition_items, applied on both the recovery and the happy path). Per recommendation: structural type → schema → structural caps (_MAX_DEPTH, _MAX_STRING_LEN) → reference allow-list → count cap (top-N by maxItems). The first failing check quarantines the item with provenance and a reason (malformed / schema / guardrail / allow_list / over_limit).
  • Reference allow-list. A recommendation whose candidate is not in the set of known ids is quarantined. The set is sourced from resolved context (context["known_candidates"], via _allow_list_from_context); the check is inert until a context resolver populates it, so the capability ships now and activates with a one-line resolver change.

Where each posture sits

Layer Posture Mechanism
Schema / contract B strict per-item structure; maxItems as hint
Whole-document parse A tolerant parse + single retry
Failed parse B item-granular recovery + repair + quarantine
Per-item screening B schema + depth/length caps + allow-list + count cap
Emitted report partial / quarantined_* provenance; never silent

Consequences

  • A single malformed or oversized item no longer discards an entire activation; the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid recommendations and quarantine the broken tail.
  • Reports gain a partial / quarantined_* vocabulary; downstream report sinks and reviewers can distinguish degraded-but-usable from total loss.
  • Guardrail thresholds (_MAX_DEPTH, _MAX_STRING_LEN, maxItems, the allow-list) are policy knobs that will need tuning; they are intentionally conservative defaults, not a finished calibration.
  • Known retention gap (follow-on): LLMConnectClient.complete() still returns only content, discarding finish_reason/usage, and the total-loss artifact caps raw output below realistic break points. Capturing those signals so failures stay debuggable is tracked as a retention fix, not closed by this ADR.

Alternatives considered

  • Hard-enforce maxItems in the validator. Rejected: a hard reject of an over-count document reproduces the whole-document blast radius. Mitigation (keep top-N, quarantine the rest) is preferred.
  • Relax the schema to accept anything. Rejected: violates principle 1; pushes malformed data into downstream consumers.
  • Retry-until-valid only (pure posture A). Rejected as the sole strategy: the 2026-06-26 failure recurred across both the initial attempt and the retry, so retry alone does not bound the blast radius.

References

  • ACT-ADR-002 — markdown-as-definition format and output schema governance.
  • ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection surface this boundary complements on the output side.
  • workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md — the implementing workplan.