From 9be4ddbdb746143de0386148825e45cdc95b6e21 Mon Sep 17 00:00:00 2001 From: tegwick Date: Fri, 26 Jun 2026 18:10:17 +0200 Subject: [PATCH] feat(ACTIVITY-WP-0016-T04): producer trust-boundary guardrails + ADR-004 Add ADR-004 documenting the producer trust boundary: untrusted producers (LLM, agent, human; erroneous and malicious), the trust-but-handle vs verify-and-mitigate postures, error-locality and quarantine-with-provenance principles, and the concrete activity-core mechanisms. Implement producer-agnostic guardrails in executor.py, applied uniformly on the happy path and the recovery path via _partition_items: structural-type -> schema -> structural caps (_MAX_DEPTH, _MAX_STRING_LEN) -> reference allow-list -> count cap. Each quarantine carries a reason. Closes the happy-path maxItems count cap deferred from T03 (valid 9-item report keeps 7, quarantines 2). Reference allow-list reads context["known_candidates"] via _allow_list_from_context; inert until a resolver populates it. SCOPE.md updated (executor bullet + ADR list); no INTENT drift. New tests: happy-path count cap, oversized-string guardrail, allow-list rejection. Full suite: 218 passed, 1 skipped. Co-Authored-By: Claude Opus 4.8 --- SCOPE.md | 7 +- docs/adr/adr-004-producer-trust-boundary.md | 156 ++++++++++++++++++ src/activity_core/rules/executor.py | 139 ++++++++++++++-- tests/rules/test_executor.py | 56 +++++++ ...16-llm-output-robustness-trust-boundary.md | 27 +++ 5 files changed, 373 insertions(+), 12 deletions(-) create mode 100644 docs/adr/adr-004-producer-trust-boundary.md diff --git a/SCOPE.md b/SCOPE.md index 4c3850f..0b6660b 100644 --- a/SCOPE.md +++ b/SCOPE.md @@ -64,7 +64,9 @@ The two evaluation modes: `context.*` / `event.*` interpolation and explicit `for_each` per-item binding. No `exec()`. - **Instruction executor**: trusted-field prompt rendering, LLM call via - llm-connect, structured output validation, bounded validation-failure + llm-connect, structured output validation, item-granular recovery with a + quarantine lane and producer guardrails (count/length/depth caps, reference + allow-list) at the producer trust boundary, bounded validation-failure artifacts for report instructions, review-required audit metadata, and deterministic report sinks. A real downstream review queue is not implemented in this repo. @@ -320,6 +322,9 @@ new one-off control paths. governance model, event type schema, ActivityDefinition structure. - `docs/adr/adr-003-rule-instruction-model.md` — Rule DSL, Instruction safety model, evaluation semantics, audit trail, testing strategy. +- `docs/adr/adr-004-producer-trust-boundary.md` — untrusted-producer premise, + trust-but-handle vs verify-and-mitigate postures, error-locality and + quarantine-with-provenance, producer guardrails for LLM/agent/human output. --- diff --git a/docs/adr/adr-004-producer-trust-boundary.md b/docs/adr/adr-004-producer-trust-boundary.md new file mode 100644 index 0000000..6a58b87 --- /dev/null +++ b/docs/adr/adr-004-producer-trust-boundary.md @@ -0,0 +1,156 @@ +--- +id: ACT-ADR-004 +type: architecture-decision-record +title: "The Producer Trust Boundary — Guardrails and Error-Correction for Untrusted Output" +status: accepted +decided_by: Bernd Worsch +date: "2026-06-26" +scope: cross-repo +affects: + - activity-core + - rules-core (future extraction) +tags: ["architecture", "llm", "safety", "validation", "guardrails", "trust-boundary", "resilience"] +--- + +# ACT-ADR-004: The Producer Trust Boundary + +## Status + +Accepted. + +## Context + +On 2026-06-26 the scheduled daily WSJF triage instruction fired on time, called +llm-connect successfully, and produced a long ranked recommendation list — but +the JSON broke at char 5268 (~rank 8–9 of ~16), failing schema validation. Because +the report was validated and consumed as a single monolithic JSON document, one +malformed delimiter discarded the **entire** run, including the 7 perfectly good +recommendations the model had already emitted. The scheduling and runtime layers +were healthy; the failure was entirely at the seam where free-form model output +meets a strict consumer. + +This is not a one-off bug, it is a recurring class. activity-core has a **trust +boundary** wherever generative or human-authored output meets strict deterministic +consumers: the JSON Schema validator, the task emitter, and any classic compute +pipeline downstream. The producers on the other side of that boundary — **LLMs, +agents, and humans** — are all *untrusted producers*. Their output may be: + +- **erroneous** — hallucination, truncation at a token limit, drift, type slips, + typos, a missing delimiter; or +- **malicious** — prompt injection, crafted payloads, or oversized / deeply-nested + structures intended to exhaust or confuse the consumer. + +The pre-existing design treated producer output optimistically: parse the whole +document, validate the whole document, and on any failure discard the whole +document (preserving only a bounded diagnostic preview). That gives **zero error +locality** — the blast radius of any single defect is the entire activation. + +## Decision + +Treat the producer→consumer seam as an explicit, adversarial **trust boundary**, +and place guardrails plus error-correction tooling *at that boundary* rather than +letting raw producer output flow into deterministic consumers. + +### Two non-fail-fast postures + +When hard-failing on a problem is undesirable, there are two sound strategies, and +they **compose**: + +- **A) Trust but handle exceptions** (optimistic / reactive). Consume the output + as-is; on exception, catch → repair → retry → or quarantine. Cheap on the happy + path; blast radius depends entirely on how granular the catch is. Best when + failures are rare and locally recoverable. Risk: failures surface late, possibly + after partial side effects. +- **B) Verify and mitigate** (defensive / proactive). Validate, sanitize, clamp, + and normalize the output to a known-good shape *before* it enters the pipeline — + drop bad items, coerce types, bound sizes/depth, allow-list references — so the + consumer only ever sees clean input. Higher upfront cost, smaller blast radius, + no partial side effects. Best when failures are common or consequences are high. + +### Governing principles + +1. **Push verification to the boundary; keep the interior strict.** Apply posture + **B** at the producer→consumer boundary; keep posture **A** for residual + exceptions inside the verified core. Never relax the interior schema to absorb + producer sloppiness. +2. **Make error locality match the unit of work.** One bad recommendation must + cost one recommendation, not the whole report. Structuring the payload so each + item is independently parseable and validatable is the highest-leverage change. +3. **Quarantine, never silently drop.** Invalid units are preserved as bounded, + provenance-tagged artifacts (`index`, `error`, `raw` snippet, `reason`) so they + can be debugged or replayed. Degraded-but-usable is reported distinctly from + total loss. +4. **Both human and agent input get the same rigor.** Guardrails are + producer-agnostic: the same count / length / depth caps and reference + allow-lists apply whether the producer is an LLM, an agent, or a human. + +### What this means concretely in activity-core + +Implemented in `src/activity_core/rules/executor.py`: + +- **Strict-structure-only schema.** The daily-triage output schema is strict on + per-item *structure* (`required [rank, candidate, action, why]`, typed `wsjf`) + and carries `maxItems` as a producer *hint* — never as a hard whole-document + reject, which would reproduce the very blast-radius failure (ACT-ADR-002 governs + the schema format; `schemas/daily-triage-report.json`). +- **Item-granular recovery (posture B).** When whole-document parse + one retry + fail, `_resilient_report` recovers individually-parseable recommendation objects + via a brace/quote-aware scanner (`_extract_object_spans`) that works for both + pretty-printed and NDJSON output, attempts a best-effort `_try_repair` on a + truncated tail, validates each recovered object against the item schema, and + keeps the valid ones. Survivors are emitted with `output_validated=true`, + `partial=true`, and `review_required=true`. +- **Producer guardrails (`_partition_items`, applied on both the recovery and the + happy path).** Per recommendation: structural type → schema → structural caps + (`_MAX_DEPTH`, `_MAX_STRING_LEN`) → reference allow-list → count cap (top-N by + `maxItems`). The first failing check quarantines the item with provenance and a + `reason` (`malformed` / `schema` / `guardrail` / `allow_list` / `over_limit`). +- **Reference allow-list.** A recommendation whose `candidate` is not in the set of + known ids is quarantined. The set is sourced from resolved context + (`context["known_candidates"]`, via `_allow_list_from_context`); the check is + inert until a context resolver populates it, so the capability ships now and + activates with a one-line resolver change. + +### Where each posture sits + +| Layer | Posture | Mechanism | +|-------|---------|-----------| +| Schema / contract | B | strict per-item structure; `maxItems` as hint | +| Whole-document parse | A | tolerant parse + single retry | +| Failed parse | B | item-granular recovery + repair + quarantine | +| Per-item screening | B | schema + depth/length caps + allow-list + count cap | +| Emitted report | — | `partial` / `quarantined_*` provenance; never silent | + +## Consequences + +- A single malformed or oversized item no longer discards an entire activation; + the daily-triage run that failed on 2026-06-26 would now deliver its 7 valid + recommendations and quarantine the broken tail. +- Reports gain a `partial` / `quarantined_*` vocabulary; downstream report sinks + and reviewers can distinguish degraded-but-usable from total loss. +- Guardrail thresholds (`_MAX_DEPTH`, `_MAX_STRING_LEN`, `maxItems`, the + allow-list) are policy knobs that will need tuning; they are intentionally + conservative defaults, not a finished calibration. +- **Known retention gap (follow-on):** `LLMConnectClient.complete()` still returns + only `content`, discarding `finish_reason`/`usage`, and the total-loss artifact + caps raw output below realistic break points. Capturing those signals so + failures stay debuggable is tracked as a retention fix, not closed by this ADR. + +## Alternatives considered + +- **Hard-enforce `maxItems` in the validator.** Rejected: a hard reject of an + over-count document reproduces the whole-document blast radius. Mitigation (keep + top-N, quarantine the rest) is preferred. +- **Relax the schema to accept anything.** Rejected: violates principle 1; pushes + malformed data into downstream consumers. +- **Retry-until-valid only (pure posture A).** Rejected as the sole strategy: the + 2026-06-26 failure recurred across both the initial attempt and the retry, so + retry alone does not bound the blast radius. + +## References + +- ACT-ADR-002 — markdown-as-definition format and output schema governance. +- ACT-ADR-003 — Rule vs. Instruction model; the Instruction prompt-injection + surface this boundary complements on the output side. +- `workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md` — the + implementing workplan. diff --git a/src/activity_core/rules/executor.py b/src/activity_core/rules/executor.py index 20f9f5d..8d04f0b 100644 --- a/src/activity_core/rules/executor.py +++ b/src/activity_core/rules/executor.py @@ -160,15 +160,20 @@ def _execute( prompt_hash = hashlib.sha256(rendered.encode()).hexdigest() llm_config = _llm_run_config(instr) + # Reference allow-list (WP-0016-T04): if a context resolver supplied the set + # of known candidate ids, recommendations pointing at anything else are + # quarantined. Absent (None) today → the check is inert until wired. + allow_list = _allow_list_from_context(context) + # Step 3 — call LLM raw_output = llm_client.complete(rendered, model=instr.model, config=llm_config) # Step 4 — validate and optionally retry - task_specs, report, error = _validate_output(raw_output, instr) + task_specs, report, error = _validate_output(raw_output, instr, allow_list) if error: retry_prompt = rendered + f"\n\nPrevious output was invalid: {error}\nPlease fix." raw_output = llm_client.complete(retry_prompt, model=instr.model, config=llm_config) - task_specs, report, error = _validate_output(raw_output, instr) + task_specs, report, error = _validate_output(raw_output, instr, allow_list) if error: # Truncate to keep log volume bounded but long enough to see the # actual JSON shape mismatch (typical reports are <2KB). @@ -181,7 +186,9 @@ def _execute( # Posture B (WP-0016-T03): try to recover a partial-but-usable # report from individually-parseable items before declaring total # loss. One bad item should cost one item, not the whole report. - recovered = _resilient_report(instr, raw_output, error, prompt_hash) + recovered = _resilient_report( + instr, raw_output, error, prompt_hash, allow_list, + ) if recovered is not None: return recovered failure_report = _invalid_output_report(instr, error, raw_output) @@ -297,6 +304,12 @@ def _invalid_output_report( _QUARANTINE_LIMIT = 20 _SNIPPET_LIMIT = 200 +# Producer guardrails (ACTIVITY-WP-0016-T04): structural bounds applied to every +# recommendation regardless of producer (LLM, agent, or human). These are +# verify-and-mitigate limits — an offending item is quarantined, never allowed to +# fail the whole report or flow unbounded into a downstream consumer. +_MAX_STRING_LEN = 4000 +_MAX_DEPTH = 8 _SUMMARY_RE = re.compile(r'"summary"\s*:\s*"((?:[^"\\]|\\.)*)"') @@ -305,6 +318,51 @@ def _snippet(value: Any) -> str: return text[:_SNIPPET_LIMIT] +def _json_depth(value: Any, depth: int = 1) -> int: + if depth > _MAX_DEPTH: + return depth + if isinstance(value, dict): + return max((_json_depth(v, depth + 1) for v in value.values()), default=depth) + if isinstance(value, list): + return max((_json_depth(v, depth + 1) for v in value), default=depth) + return depth + + +def _has_oversized_string(value: Any) -> bool: + if isinstance(value, str): + return len(value) > _MAX_STRING_LEN + if isinstance(value, dict): + return any(_has_oversized_string(v) for v in value.values()) + if isinstance(value, list): + return any(_has_oversized_string(v) for v in value) + return False + + +def _item_structure_error(item: Any) -> str | None: + """Producer-agnostic structural guardrail: depth and string-length caps.""" + if _json_depth(item) > _MAX_DEPTH: + return f"exceeds max nesting depth {_MAX_DEPTH}" + if _has_oversized_string(item): + return f"contains a string longer than {_MAX_STRING_LEN} chars" + return None + + +def _allow_list_from_context(context: dict | None) -> set[str] | None: + """Build the recommendation-candidate allow-list from resolved context. + + Looks for `context["known_candidates"]` (a list/set of valid candidate ids). + Returns None when absent so the allow-list check stays inert until a context + resolver populates it — the guardrail capability ships now; activation is a + one-line resolver change. + """ + if not isinstance(context, dict): + return None + known = context.get("known_candidates") + if isinstance(known, (list, set, tuple)): + return {str(item) for item in known} + return None + + def _report_contract(instr: Any) -> tuple[dict[str, Any] | None, int | None]: """Extract (item_schema, max_items) for the recommendations list, if any.""" try: @@ -440,20 +498,53 @@ def _partition_items( items: list[dict[str, Any]], item_schema: dict[str, Any] | None, max_items: int | None, + *, + run_schema: bool = True, + allow_list: set[str] | None = None, ) -> tuple[list[dict[str, Any]], list[dict[str, Any]]]: - """Split items into (valid, quarantined): schema-invalid then over-limit.""" + """Screen items into (valid, quarantined). + + Applied uniformly to recovered items (run_schema=True) and to already + schema-valid happy-path items (run_schema=False). Order of checks: structural + type → schema → producer guardrails (depth/length) → reference allow-list → + count cap. The first failing check quarantines the item with provenance. + """ valid: list[dict[str, Any]] = [] quarantined: list[dict[str, Any]] = [] for index, item in enumerate(items): - error = ( + if not isinstance(item, dict): + quarantined.append( + {"index": index, "error": "item is not a JSON object", + "raw": _snippet(item), "reason": "malformed"} + ) + continue + schema_error = ( _validate_schema_node(item, item_schema, f"recommendations[{index}]") - if item_schema + if (run_schema and item_schema) else None ) - if error: - quarantined.append({"index": index, "error": error, "raw": _snippet(item)}) - else: - valid.append(item) + if schema_error: + quarantined.append( + {"index": index, "error": schema_error, "raw": _snippet(item), + "reason": "schema"} + ) + continue + structure_error = _item_structure_error(item) + if structure_error: + quarantined.append( + {"index": index, "error": structure_error, "raw": _snippet(item), + "reason": "guardrail"} + ) + continue + if allow_list is not None: + candidate = item.get("candidate") + if not isinstance(candidate, str) or candidate not in allow_list: + quarantined.append( + {"index": index, "error": f"candidate {candidate!r} not in allow-list", + "raw": _snippet(item), "reason": "allow_list"} + ) + continue + valid.append(item) if max_items is not None and len(valid) > max_items: for item in valid[max_items:]: quarantined.append( @@ -469,6 +560,7 @@ def _resilient_report( raw_output: Any, original_error: str, prompt_hash: str | None, + allow_list: set[str] | None = None, ) -> InstructionResult | None: """Recover a partial-but-usable report from output that failed validation. @@ -481,7 +573,9 @@ def _resilient_report( summary, items, quarantined = _recover_recommendations(raw_output) if not items: return None - valid, item_quarantine = _partition_items(items, item_schema, max_items) + valid, item_quarantine = _partition_items( + items, item_schema, max_items, allow_list=allow_list, + ) quarantined.extend(item_quarantine) if not valid: return None @@ -528,6 +622,7 @@ def _execution_failure_report(instr: Any, error: str) -> dict[str, Any] | None: def _validate_output( raw_output: Any, instr: Any, + allow_list: set[str] | None = None, ) -> tuple[list[TaskSpec], dict[str, Any] | None, str | None]: """Parse raw LLM output into TaskSpecs and optional report payload. @@ -582,6 +677,28 @@ def _validate_output( source_type="instruction", source_id=instr.id, )) + + # Happy-path producer guardrails (WP-0016-T04): the whole document already + # passed schema validation, so recommendations are schema-valid; still apply + # the count cap, structural caps, and reference allow-list, quarantining any + # offenders rather than emitting them. Report shape only changes when an item + # is actually quarantined. + if isinstance(report, dict) and isinstance(report.get("recommendations"), list): + item_schema, max_items = _report_contract(instr) + kept, quarantined = _partition_items( + report["recommendations"], item_schema, max_items, + run_schema=False, allow_list=allow_list, + ) + if quarantined: + report = { + **report, + "recommendations": kept, + "status": "partial", + "partial": True, + "quarantined_count": len(quarantined), + "quarantined_items": quarantined[:_QUARANTINE_LIMIT], + } + return specs, report, None except (json.JSONDecodeError, AttributeError, KeyError, TypeError) as exc: return [], None, str(exc) diff --git a/tests/rules/test_executor.py b/tests/rules/test_executor.py index 81d361d..d7c4837 100644 --- a/tests/rules/test_executor.py +++ b/tests/rules/test_executor.py @@ -475,6 +475,62 @@ def test_resilient_report_quarantines_one_bad_item_among_valid(): assert "rank" in result.report["quarantined_items"][0]["error"] +# ── WP-0016-T04 producer guardrails ─────────────────────────────────────────── + +def _triage_instr() -> SimpleNamespace: + return _instr( + id="daily-triage-report", + prompt="Report.", + trusted_fields=[], + output_schema="schemas/daily-triage-report.json", + report_sinks=[{"type": "working-memory"}], + ) + + +def test_guardrail_count_cap_on_valid_happy_path(): + # 9 fully-valid recommendations in a syntactically valid document: schema + # validation passes, but the maxItems=7 count cap must keep 7 and quarantine 2. + recs = [_valid_rec(i) for i in range(1, 10)] + raw = json.dumps({"summary": "Triage.", "recommendations": recs}) + llm = _CountingLLM([raw]) + + result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm) + + assert llm.call_count == 1 # no retry — the document was valid + assert result.report["partial"] is True + assert len(result.report["recommendations"]) == 7 + assert result.report["quarantined_count"] == 2 + assert all(q["reason"] == "over_limit" for q in result.report["quarantined_items"]) + + +def test_guardrail_oversized_string_quarantined(): + big = _valid_rec(2) + big["why"] = "x" * 5000 # exceeds _MAX_STRING_LEN + raw = json.dumps({"summary": "Triage.", "recommendations": [_valid_rec(1), big]}) + llm = _CountingLLM([raw]) + + result = execute_instruction_with_audit(_triage_instr(), _Event(), {}, llm) + + assert len(result.report["recommendations"]) == 1 + assert result.report["quarantined_count"] == 1 + assert result.report["quarantined_items"][0]["reason"] == "guardrail" + + +def test_guardrail_allow_list_rejects_unknown_candidate(): + raw = json.dumps({ + "summary": "Triage.", + "recommendations": [_valid_rec(1), _valid_rec(2)], # candidates WS-1, WS-2 + }) + llm = _CountingLLM([raw]) + context = {"known_candidates": ["WS-1"]} + + result = execute_instruction_with_audit(_triage_instr(), _Event(), context, llm) + + assert len(result.report["recommendations"]) == 1 + assert result.report["recommendations"][0]["candidate"] == "WS-1" + assert result.report["quarantined_items"][0]["reason"] == "allow_list" + + def test_execute_instruction_with_audit_preserves_invalid_report_with_sinks( tmp_path, monkeypatch, diff --git a/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md b/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md index 90a860f..cda146d 100644 --- a/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md +++ b/workplans/ACTIVITY-WP-0016-llm-output-robustness-trust-boundary.md @@ -297,6 +297,33 @@ Done when: - SCOPE.md / INTENT.md are checked for drift and updated if the boundary stance changes the documented contract. +2026-06-26 progress: + +- **ADR-004 written** — `docs/adr/adr-004-producer-trust-boundary.md` documents + the untrusted-producer premise (erroneous + malicious; LLM/agent/human), the + A-vs-B posture taxonomy, the four governing principles, the concrete + activity-core mechanisms, a posture-by-layer table, consequences, and + alternatives considered. Accepted, scope cross-repo. +- **Producer guardrails implemented** in `executor.py`, applied uniformly on the + happy path *and* the recovery path via `_partition_items`: per-item order is + structural-type → schema → structural caps (`_MAX_DEPTH=8`, + `_MAX_STRING_LEN=4000`) → reference allow-list → count cap (`maxItems`). Each + quarantine carries a `reason` (`malformed`/`schema`/`guardrail`/`allow_list`/ + `over_limit`). +- **Happy-path count cap closed** (the item deferred from T03): a syntactically + valid 9-item report now keeps 7 and quarantines 2 as `over_limit`, emitting a + `partial` report — without a retry. +- **Reference allow-list wired but inert.** `_allow_list_from_context` reads + `context["known_candidates"]`; when present, recommendations with an unknown + `candidate` are quarantined (`reason: allow_list`). Absent today → check is + inert; activation is a one-line context-resolver change. Keeps the guardrail + producer-agnostic (principle #4) and ready. +- **SCOPE.md updated** — instruction-executor bullet now names the quarantine + lane + guardrails; ADR-004 added to the Architecture Decisions list. No INTENT + drift: this hardens the existing output contract, it does not extend scope. +- New tests: happy-path count cap, oversized-string guardrail, allow-list + rejection (all green). + ## Tests + Calibration Re-Entry ```task