Implement llm-connect ADHOC diagnostics

2026-06-03 11:56:21 +02:00
parent 79c899b694
commit 24f4c09d42
17 changed files with 1618 additions and 611 deletions
--- a/ARCHITECTURE-LAYERS.md
+++ b/ARCHITECTURE-LAYERS.md
@@ -32,6 +32,9 @@ Maturity states: **Experimental → Beta → Stable → Deprecated**
 | `gemini.py` | `GeminiAdapter` — Google Generative Language API | Beta |
 | `openrouter.py` | `OpenRouterAdapter` — OpenAI-compatible multi-model routing | Beta |
 | `claude_code.py` | `ClaudeCodeAdapter` — `claude --print` subprocess | Beta |
+| `_payload.py` | Shared adapter payload translation for `RunConfig.model_params` | Beta |
+| `_diagnostics.py` | Opt-in per-call diagnostics capture for server debug and audit modes | Beta |
+| `replay.py` | Audit replay parser CLI (`python -m llm_connect.replay`) | Beta |
 | `embedding_adapter.py` | `EmbeddingAdapter` ABC | Beta |
 | `embedding_openai.py` | `OpenAICompatibleEmbeddingAdapter` | Beta |
 | `embedding_cache.py` | `EmbeddingCache` — disk-backed embedding cache | Beta |
--- a/README.md
+++ b/README.md
@@ -78,7 +78,7 @@ config = RunConfig(
 | `model_name` | `"gpt-4"` | Model identifier (adapter may override) |
 | `temperature` | `0.7` | Sampling temperature |
 | `max_tokens` | `2000` | Maximum output tokens |
-| `model_params` | `{}` | Extra provider-specific parameters |
+| `model_params` | `{}` | Portable extras translated by each adapter; see `docs/adapter-model-params.md` |
 | `max_depth` | `3` | Max nesting depth for recursive calls |
 | `skip_if_exists` | `True` | Skip if identical input hash already processed |
 | `timeout_seconds` | `300` | Request timeout |
@@ -95,6 +95,22 @@ print(response.usage)         # {"prompt_tokens": …, "completion_tokens": …,
 print(response.finish_reason) # "stop", "length", etc.
 ```

+## Server diagnostics
+
+Serve mode can include a debug envelope without changing normal responses:
+
+```bash
+LLM_CONNECT_DEBUG=1 python -m llm_connect.server --provider openrouter
+curl 'http://127.0.0.1:8080/execute?debug=1' -d '{"prompt":"hi"}'
+```
+
+Set `LLM_CONNECT_AUDIT_DIR=/path/to/audit` to write per-call replay records,
+then parse one without another provider call:
+
+```bash
+python -m llm_connect.replay /path/to/audit/record.json --json
+```
+
 ## Writing your own adapter

 ```python
--- a/docs/adapter-model-params.md
+++ b/docs/adapter-model-params.md
@@ -0,0 +1,102 @@
+# Adapter `model_params` contract
+
+`RunConfig.model_params` is a portability layer, not a blind provider payload
+escape hatch. Adapters must translate the shared keys they understand, pass
+through only provider-valid keys, and drop provider-specific keys that would
+make another provider reject the request.
+
+## Shared structured output
+
+Callers may request structured output with:
+
+```python
+RunConfig(
+    model_params={
+        "json_schema": {
+            "type": "object",
+            "properties": {
+                "summary": {"type": "string"},
+                "recommendations": {"type": "array", "items": {"type": "string"}},
+            },
+            "required": ["summary", "recommendations"],
+        }
+    }
+)
+```
+
+Adapters translate that key into the provider's native shape:
+
+| Adapter | Translation |
+|---|---|
+| OpenAI | `response_format = {"type": "json_schema", "json_schema": ...}` |
+| OpenRouter | Same OpenAI-compatible `response_format` wrapper |
+| Gemini | `generationConfig.responseMimeType = "application/json"` and `generationConfig.responseSchema = ...` |
+| Claude Code CLI | `--json-schema <schema>` plus `--output-format json`, then envelope unwrap |
+
+OpenAI-compatible adapters default `json_schema.strict` to `False`. Strict mode
+requires schemas to meet provider-specific constraints such as
+`additionalProperties: false` on object nodes and complete `required` lists.
+Callers that need strict behavior can pass an explicit provider-native
+`response_format` in `model_params`.
+
+## Pass-through keys
+
+OpenAI and OpenRouter pass through known Chat Completions fields:
+
+`top_p`, `n`, `stream`, `stop`, `presence_penalty`, `frequency_penalty`,
+`logit_bias`, `user`, `seed`, `tools`, `tool_choice`, `response_format`,
+`logprobs`, `top_logprobs`, and `parallel_tool_calls`.
+
+Gemini passes through valid `generateContent` top-level fields:
+
+`safetySettings`, `tools`, `toolConfig`, `systemInstruction`, and
+`cachedContent`.
+
+Gemini also accepts generation config fields directly or via snake-case aliases:
+
+`candidateCount`, `candidate_count`, `stopSequences`, `stop_sequences`,
+`maxOutputTokens`, `max_output_tokens`, `temperature`, `topP`, `top_p`, `topK`,
+`top_k`, `responseMimeType`, `response_mime_type`, `responseSchema`, and
+`response_schema`.
+
+## Dropped keys
+
+Adapters must drop keys that are meaningful to another adapter or to
+llm-connect itself but invalid for the target provider. The current shared drop
+set includes:
+
+`reasoning_effort`, `max_depth`, `claude_cli_path`, and raw `json_schema` after
+translation.
+
+Unknown keys are ignored by default. This keeps activity-specific configs from
+causing provider HTTP 400 errors when a caller switches providers.
+
+## Diagnostics and replay
+
+Server mode supports opt-in diagnostics for `/execute`:
+
+```bash
+LLM_CONNECT_DEBUG=1 python -m llm_connect.server --provider openrouter
+curl 'http://127.0.0.1:8080/execute?debug=1' -d '{"prompt":"hi"}'
+```
+
+Debug responses include a `debug` field with the redacted provider request, raw
+provider response body, and adapter transformations such as `merge_model_params`
+or `unwrap_cli_envelope`. Normal responses omit `debug`.
+
+Set `LLM_CONNECT_AUDIT_DIR=/path/to/audit` to write one JSON audit record per
+`/execute` call. Audit records include the prompt, config, redacted provider
+request, provider response, parsed content, and latency. Re-run parsing without
+another provider call with:
+
+```bash
+python -m llm_connect.replay /path/to/audit/record.json --json
+```
+
+## Server concurrency
+
+`llm_connect.server.LLMServer` uses `ThreadingHTTPServer`. Adapter instances
+used in server mode must be safe to call concurrently. The bundled HTTP and
+subprocess adapters keep per-call state local; custom adapters should avoid
+mutating shared instance attributes during `execute_prompt` unless they use
+their own locks.
--- a/llm_connect/_diagnostics.py
+++ b/llm_connect/_diagnostics.py
@@ -0,0 +1,153 @@
+"""Per-call diagnostics capture for server debug and audit modes."""
+
+from __future__ import annotations
+
+import copy
+import json
+from contextlib import contextmanager
+from contextvars import ContextVar
+from dataclasses import dataclass, field
+from typing import Any, Iterator, Mapping
+from urllib.parse import parse_qsl, urlencode, urlsplit, urlunsplit
+
+
+_SECRET_QUERY_KEYS = {"key", "api_key", "apikey", "access_token", "token"}
+_SECRET_HEADER_TOKENS = ("authorization", "api-key", "apikey", "token", "secret", "key")
+
+
+@dataclass
+class Diagnostics:
+    """Captured provider request/response details for one logical LLM call."""
+
+    provider_request: dict[str, Any] | None = None
+    provider_response: dict[str, Any] | None = None
+    adapter_transformations: list[dict[str, Any]] = field(default_factory=list)
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "provider_request": self.provider_request,
+            "provider_response": self.provider_response,
+            "adapter_transformations": self.adapter_transformations,
+        }
+
+
+_CURRENT: ContextVar[Diagnostics | None] = ContextVar(
+    "llm_connect_diagnostics",
+    default=None,
+)
+
+
+@contextmanager
+def capture_diagnostics(enabled: bool = True) -> Iterator[Diagnostics | None]:
+    """Capture diagnostics within this context when *enabled* is true."""
+
+    if not enabled:
+        yield None
+        return
+
+    diagnostics = Diagnostics()
+    token = _CURRENT.set(diagnostics)
+    try:
+        yield diagnostics
+    finally:
+        _CURRENT.reset(token)
+
+
+def diagnostics_enabled() -> bool:
+    return _CURRENT.get() is not None
+
+
+def current_diagnostics() -> Diagnostics | None:
+    return _CURRENT.get()
+
+
+def record_provider_request(
+    *,
+    url: str | None = None,
+    payload: Any | None = None,
+    headers: Mapping[str, Any] | None = None,
+    command: list[str] | None = None,
+) -> None:
+    diagnostics = _CURRENT.get()
+    if diagnostics is None:
+        return
+
+    request: dict[str, Any] = {}
+    if url is not None:
+        request["url"] = redact_url(url)
+    if payload is not None:
+        request["payload"] = json_safe(payload)
+    if headers is not None:
+        request["headers_redacted"] = redact_headers(headers)
+    if command is not None:
+        request["command"] = list(command)
+    diagnostics.provider_request = request
+
+
+def record_provider_response(*, status: int | None = None, body: Any | None = None) -> None:
+    diagnostics = _CURRENT.get()
+    if diagnostics is None:
+        return
+
+    response: dict[str, Any] = {}
+    if status is not None:
+        response["status"] = status
+    if body is not None:
+        response["body"] = json_safe(body)
+    diagnostics.provider_response = response
+
+
+def record_adapter_transformation(step: str, before: Any, after: Any) -> None:
+    diagnostics = _CURRENT.get()
+    if diagnostics is None:
+        return
+
+    diagnostics.adapter_transformations.append(
+        {
+            "step": step,
+            "before": json_safe(before),
+            "after": json_safe(after),
+        }
+    )
+
+
+def json_safe(value: Any) -> Any:
+    """Return a JSON-serializable snapshot of *value* without mutating it."""
+
+    try:
+        return json.loads(json.dumps(value))
+    except (TypeError, ValueError):
+        try:
+            return copy.deepcopy(value)
+        except Exception:
+            return repr(value)
+
+
+def redact_headers(headers: Mapping[str, Any]) -> dict[str, Any]:
+    redacted: dict[str, Any] = {}
+    for key, value in headers.items():
+        lowered = str(key).lower()
+        if any(token in lowered for token in _SECRET_HEADER_TOKENS):
+            redacted[str(key)] = _redact_header_value(value)
+        else:
+            redacted[str(key)] = json_safe(value)
+    return redacted
+
+
+def redact_url(url: str) -> str:
+    parts = urlsplit(url)
+    query = []
+    for key, value in parse_qsl(parts.query, keep_blank_values=True):
+        if key.lower() in _SECRET_QUERY_KEYS:
+            query.append((key, "<redacted>"))
+        else:
+            query.append((key, value))
+    return urlunsplit((parts.scheme, parts.netloc, parts.path, urlencode(query), parts.fragment))
+
+
+def _redact_header_value(value: Any) -> str:
+    text = str(value)
+    if " " in text:
+        scheme = text.split(" ", 1)[0]
+        return f"{scheme} <redacted>"
+    return "<redacted>"
--- a/llm_connect/_http.py
+++ b/llm_connect/_http.py
@@ -5,10 +5,11 @@ Translates HTTP errors into typed :mod:`markitect.llm.exceptions`.
 """

 import json
-import urllib.request
 import urllib.error
-from typing import Dict, Any, Optional
+import urllib.request
+from typing import Any, Dict, Optional

+from llm_connect._diagnostics import record_provider_request, record_provider_response
 from llm_connect.exceptions import (
    LLMAPIError,
    LLMRateLimitError,
@@ -29,6 +30,7 @@ def post_json(
        LLMAPIError: on other non-2xx responses
        LLMTimeoutError: on socket / read timeout
    """
+    record_provider_request(url=url, payload=payload, headers=headers or {})
    data = json.dumps(payload).encode()
    req = urllib.request.Request(
        url,
@@ -41,11 +43,14 @@ def post_json(
        with urllib.request.urlopen(req, timeout=timeout) as resp:
            body = resp.read().decode()
            try:
-                return json.loads(body)
+                parsed = json.loads(body)
+                record_provider_response(status=resp.status, body=parsed)
+                return parsed
            except json.JSONDecodeError as exc:
+                record_provider_response(status=resp.status, body=body)
                preview = body[:300].replace("\n", "\\n")
                raise LLMAPIError(
-                    f"Invalid JSON response from {url}: {exc} — body preview: {preview!r}",
+                    f"Invalid JSON response from {url}: {exc} - body preview: {preview!r}",
                    cause=exc,
                ) from exc
    except urllib.error.HTTPError as exc:
@@ -54,6 +59,7 @@ def post_json(
            body = exc.read().decode()
        except Exception:
            pass
+        record_provider_response(status=exc.code, body=_json_or_text(body))

        if exc.code == 429:
            raise LLMRateLimitError(
@@ -70,6 +76,7 @@ def post_json(
            cause=exc,
        ) from exc
    except urllib.error.URLError as exc:
+        record_provider_response(body={"error": str(exc.reason)})
        if "timed out" in str(exc.reason):
            raise LLMTimeoutError(
                f"Request to {url} timed out after {timeout}s",
@@ -80,7 +87,15 @@ def post_json(
            cause=exc,
        ) from exc
    except TimeoutError as exc:
+        record_provider_response(body={"error": "timeout"})
        raise LLMTimeoutError(
            f"Request to {url} timed out after {timeout}s",
            cause=exc,
        ) from exc
+
+
+def _json_or_text(body: str) -> Any:
+    try:
+        return json.loads(body)
+    except (TypeError, ValueError):
+        return body
--- a/llm_connect/_payload.py
+++ b/llm_connect/_payload.py
@@ -0,0 +1,154 @@
+"""Provider payload helpers for translating ``RunConfig.model_params``."""
+
+from __future__ import annotations
+
+import json
+from typing import Any
+
+from llm_connect._diagnostics import (
+    diagnostics_enabled,
+    json_safe,
+    record_adapter_transformation,
+)
+
+
+# OpenAI Chat Completions fields that map straight through from model_params.
+# Anything not in this set is provider-specific and must be either translated
+# or dropped. Blind merges are deliberately avoided because OpenAI-compatible
+# providers commonly reject unknown top-level fields with HTTP 400.
+OPENAI_CHAT_PASSTHROUGH_FIELDS = frozenset(
+    {
+        "top_p",
+        "n",
+        "stream",
+        "stop",
+        "presence_penalty",
+        "frequency_penalty",
+        "logit_bias",
+        "user",
+        "seed",
+        "tools",
+        "tool_choice",
+        "response_format",
+        "logprobs",
+        "top_logprobs",
+        "parallel_tool_calls",
+    }
+)
+
+
+DROPPED_NON_OPENAI_FIELDS = frozenset(
+    {
+        "reasoning_effort",
+        "max_depth",
+        "claude_cli_path",
+        "json_schema",
+    }
+)
+
+
+GEMINI_TOP_LEVEL_FIELDS = frozenset(
+    {
+        "safetySettings",
+        "tools",
+        "toolConfig",
+        "systemInstruction",
+        "cachedContent",
+    }
+)
+
+
+GEMINI_GENERATION_CONFIG_FIELDS = frozenset(
+    {
+        "candidateCount",
+        "stopSequences",
+        "maxOutputTokens",
+        "temperature",
+        "topP",
+        "topK",
+        "responseMimeType",
+        "responseSchema",
+    }
+)
+
+
+GEMINI_GENERATION_CONFIG_ALIASES = {
+    "candidate_count": "candidateCount",
+    "stop_sequences": "stopSequences",
+    "max_output_tokens": "maxOutputTokens",
+    "top_p": "topP",
+    "top_k": "topK",
+    "response_mime_type": "responseMimeType",
+    "response_schema": "responseSchema",
+}
+
+
+def merge_openai_chat_model_params(payload: dict[str, Any], model_params: dict[str, Any]) -> None:
+    """Merge model_params into an OpenAI Chat Completions-style payload.
+
+    Translates ``json_schema`` to ``response_format``, passes known OpenAI
+    fields through, and drops Claude/llm-connect-only knobs.
+    """
+
+    before = json_safe(payload) if diagnostics_enabled() else None
+
+    schema = _coerce_json_schema(model_params.get("json_schema"))
+    caller_response_format = model_params.get("response_format")
+    if schema is not None and caller_response_format is None and "response_format" not in payload:
+        payload["response_format"] = {
+            "type": "json_schema",
+            "json_schema": {
+                "name": "structured_output",
+                "schema": schema,
+                "strict": False,
+            },
+        }
+
+    for key, value in model_params.items():
+        if key in DROPPED_NON_OPENAI_FIELDS:
+            continue
+        if key in OPENAI_CHAT_PASSTHROUGH_FIELDS:
+            payload[key] = value
+
+    if before is not None:
+        record_adapter_transformation("merge_model_params.openai_chat", before, payload)
+
+
+def merge_gemini_model_params(payload: dict[str, Any], model_params: dict[str, Any]) -> None:
+    """Merge model_params into a Gemini ``generateContent`` payload."""
+
+    before = json_safe(payload) if diagnostics_enabled() else None
+    generation_config = payload.setdefault("generationConfig", {})
+
+    schema = _coerce_json_schema(model_params.get("json_schema"))
+    if schema is not None and "responseSchema" not in generation_config:
+        generation_config["responseMimeType"] = "application/json"
+        generation_config["responseSchema"] = schema
+
+    explicit_generation_config = model_params.get("generationConfig")
+    if isinstance(explicit_generation_config, dict):
+        generation_config.update(explicit_generation_config)
+
+    for key, value in model_params.items():
+        if key in {"json_schema", "generationConfig", "reasoning_effort", "max_depth"}:
+            continue
+        if key in GEMINI_TOP_LEVEL_FIELDS:
+            payload[key] = value
+            continue
+        gemini_key = GEMINI_GENERATION_CONFIG_ALIASES.get(key, key)
+        if gemini_key in GEMINI_GENERATION_CONFIG_FIELDS:
+            generation_config[gemini_key] = value
+
+    if before is not None:
+        record_adapter_transformation("merge_model_params.gemini", before, payload)
+
+
+def _coerce_json_schema(schema: Any) -> dict[str, Any] | None:
+    if isinstance(schema, str):
+        try:
+            schema = json.loads(schema)
+        except (TypeError, ValueError):
+            return None
+    if isinstance(schema, dict):
+        return schema
+    return None
--- a/llm_connect/claude_code.py
+++ b/llm_connect/claude_code.py
@@ -1,5 +1,5 @@
 """
-Claude Code CLI adapter — runs the ``claude`` CLI as a subprocess.
+Claude Code CLI adapter - runs the ``claude`` CLI as a subprocess.
 """

 import asyncio
@@ -9,21 +9,23 @@ import subprocess
 from pathlib import Path
 from typing import Optional

-from llm_connect.adapter import LLMAdapter
-from llm_connect.models import RunConfig, LLMResponse
-from llm_connect.config import LLMConfig
-from llm_connect._token_estimator import estimate_tokens
-from llm_connect.exceptions import (
-    LLMSubprocessError,
-    LLMTimeoutError,
+from llm_connect._diagnostics import (
+    record_adapter_transformation,
+    record_provider_request,
+    record_provider_response,
 )
+from llm_connect._token_estimator import estimate_tokens
+from llm_connect.adapter import LLMAdapter
+from llm_connect.config import LLMConfig
+from llm_connect.exceptions import LLMSubprocessError, LLMTimeoutError
+from llm_connect.models import LLMResponse, RunConfig


 class ClaudeCodeAdapter(LLMAdapter):
    """LLM adapter that shells out to the ``claude`` CLI with ``--print``.

-    The compiled prompt is piped via **stdin** to avoid shell argument
-    length limits (compiled prompts can exceed 30 KB).
+    The compiled prompt is piped via stdin to avoid shell argument length
+    limits. Compiled prompts can exceed 30 KB.
    """

    def __init__(
@@ -36,13 +38,14 @@ class ClaudeCodeAdapter(LLMAdapter):
        self._cli_path = cli_path or self._resolve_cli_path()
        self._model = model

-    # ── LLMAdapter interface ────────────────────────────────────────
+    # LLMAdapter interface

    def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse:
        self._preflight_budget(config)
        cmd = self._build_command(config)

        timeout = config.timeout_seconds or self._config.timeout_seconds
+        record_provider_request(command=cmd, payload={"stdin": prompt})

        try:
            result = subprocess.run(
@@ -58,6 +61,10 @@ class ClaudeCodeAdapter(LLMAdapter):
                cause=exc,
            ) from exc

+        record_provider_response(
+            status=result.returncode,
+            body={"stdout": result.stdout, "stderr": result.stderr},
+        )
        if result.returncode != 0:
            raise LLMSubprocessError(
                f"claude CLI exited with code {result.returncode}",
@@ -92,6 +99,7 @@ class ClaudeCodeAdapter(LLMAdapter):
        cmd = self._build_command(config)

        timeout = config.timeout_seconds or self._config.timeout_seconds
+        record_provider_request(command=cmd, payload={"stdin": prompt})

        try:
            proc = await asyncio.create_subprocess_exec(
@@ -110,14 +118,20 @@ class ClaudeCodeAdapter(LLMAdapter):
                cause=exc,
            ) from exc

+        stdout = stdout_bytes.decode()
+        stderr = stderr_bytes.decode()
+        record_provider_response(
+            status=proc.returncode,
+            body={"stdout": stdout, "stderr": stderr},
+        )
        if proc.returncode != 0:
            raise LLMSubprocessError(
                f"claude CLI exited with code {proc.returncode}",
                return_code=proc.returncode,
-                stderr=stderr_bytes.decode(),
+                stderr=stderr,
            )

-        content = _unwrap_cli_json_envelope(stdout_bytes.decode(), config)
+        content = _unwrap_cli_json_envelope(stdout, config)
        prompt_tokens = estimate_tokens(prompt)
        completion_tokens = estimate_tokens(content)

@@ -192,33 +206,17 @@ def _json_schema_arg(config: RunConfig) -> str | None:
    return None


-# Envelope field names Claude Code's `--output-format json` is known to use
-# for the model's primary textual response. Used as a fall-back when no field
-# carries a JSON-parseable payload (e.g. plain prose generation).
+# Envelope field names Claude Code's --output-format json is known to use for
+# the model's primary textual response. Used as a fallback when no field carries
+# a JSON-parseable payload, such as plain prose generation.
 _ENVELOPE_TEXT_FIELDS = ("result", "result_text", "content", "text", "output")


 def _unwrap_cli_json_envelope(stdout: str, config: RunConfig) -> str:
    """Extract the model's payload from Claude CLI's --output-format json envelope.

-    Only runs when --json-schema was set (the only code path that adds
-    --output-format json to the CLI invocation). Other callers keep the raw
-    stdout behavior unchanged.
-
-    Strategy: when --json-schema is set the caller wants JSON back, so prefer
-    any envelope field whose value is itself valid JSON (dict, list, or a
-    string that parses as JSON). This handles two observed envelope shapes:
-
-    1. Short prompts where the model emits the structured payload directly
-       in the `result` field as a JSON-encoded string.
-    2. Longer prompts where the model emits a conversational preamble in
-       `result` and the schema-enforced JSON in a separate field (the exact
-       field name varies across CLI versions).
-
-    Fall back to the first text field only when no JSON-bearing field exists,
-    so non-schema callers via this code path still see the model's prose.
-    Surface the raw envelope as a last resort so the operator can see what
-    shape arrived and extend the strategy.
+    Only runs when --json-schema was set. Other callers keep the raw stdout
+    behavior unchanged.
    """
    if not _json_schema_arg(config):
        return stdout
@@ -234,25 +232,20 @@ def _unwrap_cli_json_envelope(stdout: str, config: RunConfig) -> str:

    json_payload = _find_json_payload(envelope)
    if json_payload is not None:
-        return json_payload
+        return _record_unwrap(stdout, json_payload)

    for key in _ENVELOPE_TEXT_FIELDS:
        value = envelope.get(key)
        if isinstance(value, str):
-            return value
+            return _record_unwrap(stdout, value)
        if isinstance(value, (dict, list)):
-            return json.dumps(value)
+            return _record_unwrap(stdout, json.dumps(value))

    return stdout


 def _find_json_payload(envelope: dict) -> str | None:
-    """Return the first envelope value that represents valid JSON.
-
-    Insertion order is preserved by Python dicts, so this prefers fields the
-    CLI lists earliest in its envelope. Skips obvious metadata keys (cost,
-    usage, timing) so we never accidentally pick a numeric or telemetry value.
-    """
+    """Return the first envelope value that represents valid JSON."""
    for key, value in envelope.items():
        if key in _ENVELOPE_METADATA_KEYS:
            continue
@@ -270,8 +263,27 @@ def _find_json_payload(envelope: dict) -> str | None:


 # Envelope keys that carry telemetry, never the model payload.
-_ENVELOPE_METADATA_KEYS = frozenset({
-    "type", "subtype", "model", "usage", "total_cost_usd", "cost_usd",
-    "duration_ms", "duration_api_ms", "num_turns", "session_id",
-    "is_error", "stop_reason", "permission_denials", "uuid",
-})
+_ENVELOPE_METADATA_KEYS = frozenset(
+    {
+        "type",
+        "subtype",
+        "model",
+        "usage",
+        "total_cost_usd",
+        "cost_usd",
+        "duration_ms",
+        "duration_api_ms",
+        "num_turns",
+        "session_id",
+        "is_error",
+        "stop_reason",
+        "permission_denials",
+        "uuid",
+    }
+)
+
+
+def _record_unwrap(stdout: str, content: str) -> str:
+    if content != stdout:
+        record_adapter_transformation("unwrap_cli_envelope", stdout, content)
+    return content
--- a/llm_connect/gemini.py
+++ b/llm_connect/gemini.py
@@ -9,6 +9,7 @@ from llm_connect.adapter import LLMAdapter
 from llm_connect.models import RunConfig, LLMResponse
 from llm_connect.config import resolve_api_key, find_project_root
 from llm_connect._http import post_json
+from llm_connect._payload import merge_gemini_model_params
 from llm_connect.exceptions import LLMConfigurationError

 _DEFAULT_MODEL = "gemini-2.5-flash"
@@ -74,6 +75,8 @@ class GeminiAdapter(LLMAdapter):
                "maxOutputTokens": config.max_tokens,
            },
        }
+        if config.model_params:
+            merge_gemini_model_params(payload, config.model_params)

        url = f"{_API_BASE}/models/{model}:generateContent?key={self._api_key}"

--- a/llm_connect/openai.py
+++ b/llm_connect/openai.py
@@ -9,6 +9,7 @@ from llm_connect.adapter import LLMAdapter
 from llm_connect.models import RunConfig, LLMResponse
 from llm_connect.config import resolve_api_key, find_project_root
 from llm_connect._http import post_json
+from llm_connect._payload import merge_openai_chat_model_params
 from llm_connect.exceptions import (
    LLMConfigurationError,
    LLMAPIError,
@@ -65,6 +66,8 @@ class OpenAIAdapter(LLMAdapter):
            "temperature": config.temperature,
            "max_tokens": config.max_tokens,
        }
+        if config.model_params:
+            merge_openai_chat_model_params(payload, config.model_params)

        headers = {
            "Authorization": f"Bearer {self._api_key}",
--- a/llm_connect/openrouter.py
+++ b/llm_connect/openrouter.py
@@ -1,19 +1,16 @@
 """
-OpenRouter adapter — calls the OpenAI-compatible chat completions API.
+OpenRouter adapter - calls the OpenAI-compatible chat completions API.
 """

 import time
-from typing import Optional, Dict, Any
+from typing import Any, Dict, Optional

-from llm_connect.adapter import LLMAdapter
-from llm_connect.models import RunConfig, LLMResponse
-from llm_connect.config import LLMConfig, resolve_api_key, find_project_root
 from llm_connect._http import post_json
-from llm_connect.exceptions import (
-    LLMConfigurationError,
-    LLMAPIError,
-    LLMRateLimitError,
-)
+from llm_connect._payload import merge_openai_chat_model_params
+from llm_connect.adapter import LLMAdapter
+from llm_connect.config import LLMConfig, find_project_root, resolve_api_key
+from llm_connect.exceptions import LLMAPIError, LLMRateLimitError
+from llm_connect.models import LLMResponse, RunConfig

 _DEFAULT_MODEL = "anthropic/claude-sonnet-4"

@@ -38,10 +35,10 @@ class OpenRouterAdapter(LLMAdapter):
    ):
        self._config = config or LLMConfig()
        # Track whether the model was explicitly supplied (constructor or
-        # LLMConfig). Comparing self._model to _DEFAULT_MODEL is not enough —
+        # LLMConfig). Comparing self._model to _DEFAULT_MODEL is not enough:
        # callers who pass --model anthropic/claude-sonnet-4 happen to match
        # the default and would otherwise be misrouted to RunConfig.model_name
-        # (which defaults to "gpt-4" — quietly sending every call to OpenAI's
+        # (which defaults to "gpt-4", quietly sending every call to OpenAI's
        # gpt-4 model, which is what broke the activity-core CUST-WP-0045
        # canary on 2026-06-02).
        self._explicit_model = model is not None or self._config.model is not None
@@ -51,7 +48,6 @@ class OpenRouterAdapter(LLMAdapter):
        self._extra_headers = extra_headers or {}
        self._max_retries = max_retries if max_retries is not None else self._config.max_retries

-        # Resolve API key
        root = find_project_root()
        key_file_paths = [root / "apikey-openrouter.txt"] if root else []
        self._api_key = resolve_api_key(
@@ -60,12 +56,12 @@ class OpenRouterAdapter(LLMAdapter):
            key_file_paths=key_file_paths,
        )

-    # ── LLMAdapter interface ────────────────────────────────────────
+    # LLMAdapter interface

    def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse:
        self._preflight_budget(config)
        # Explicit constructor/LLMConfig model wins; only fall back to the
-        # per-call RunConfig.model_name when the adapter wasn't told what to
+        # per-call RunConfig.model_name when the adapter was not told what to
        # use. RunConfig.model_name defaults to "gpt-4", so falling back
        # unconditionally would silently misroute callers.
        if self._explicit_model:
@@ -85,7 +81,7 @@ class OpenRouterAdapter(LLMAdapter):
            "max_tokens": config.max_tokens,
        }
        if config.model_params:
-            _merge_model_params(payload, config.model_params)
+            merge_openai_chat_model_params(payload, config.model_params)

        headers = {
            "Authorization": f"Bearer {self._api_key}",
@@ -97,7 +93,6 @@ class OpenRouterAdapter(LLMAdapter):
        data = self._post_with_retries(url, payload, headers, config.timeout_seconds)
        latency = time.time() - start

-        # Parse response
        choice = data.get("choices", [{}])[0]
        content = choice.get("message", {}).get("content", "")
        finish_reason = choice.get("finish_reason", "stop")
@@ -130,7 +125,7 @@ class OpenRouterAdapter(LLMAdapter):
            return False
        return True

-    # ── Internals ───────────────────────────────────────────────────
+    # Internals

    def _post_with_retries(
        self,
@@ -154,68 +149,3 @@ class OpenRouterAdapter(LLMAdapter):
                else:
                    raise
        raise last_exc  # type: ignore[misc]
-
-
-# OpenAI Chat Completions fields that map straight through from model_params.
-# Anything not in this set is provider-specific and must be either translated
-# or dropped — we never blind-merge into the payload, because OpenRouter
-# rejects unknown top-level fields with HTTP 400.
-_OPENAI_PASSTHROUGH_FIELDS = frozenset({
-    "top_p", "n", "stream", "stop", "presence_penalty",
-    "frequency_penalty", "logit_bias", "user", "seed",
-    "tools", "tool_choice", "response_format",
-    "logprobs", "top_logprobs", "parallel_tool_calls",
-})
-
-# Provider-specific model_params keys that have no OpenAI Chat Completions
-# equivalent and must be silently dropped to keep payloads valid.
-_DROPPED_NON_OPENAI_FIELDS = frozenset({
-    "reasoning_effort",  # Claude CLI / Anthropic-specific
-    "max_depth",         # llm-connect's own depth knob
-    "claude_cli_path",   # adapter wiring leak
-    "json_schema",       # translated below into response_format
-})
-
-
-def _merge_model_params(payload: Dict[str, Any], model_params: Dict[str, Any]) -> None:
-    """Merge RunConfig.model_params into an OpenAI Chat Completions payload.
-
-    Pass-through whitelisted OpenAI keys, translate json_schema into the
-    proper response_format wrapper, drop known provider-specific fields,
-    and ignore anything else rather than letting it through and triggering
-    a 400 from OpenRouter (the failure mode that hit CUST-WP-0045 on
-    2026-06-02 — reasoning_effort and a top-level json_schema were merged
-    into the body and the API rejected both).
-    """
-    schema = model_params.get("json_schema")
-    if schema is not None and "response_format" not in payload:
-        if isinstance(schema, str):
-            try:
-                import json as _json
-                schema = _json.loads(schema)
-            except (ValueError, TypeError):
-                schema = None
-        if isinstance(schema, dict):
-            # strict=False: OpenAI's strict mode requires additionalProperties
-            # to be false on every object and every property in the required
-            # list. Most application-supplied schemas are not written that
-            # way (the activity-core daily-triage schema, for example, has
-            # neither). With strict=False, OpenRouter still honours the
-            # schema as a soft constraint and the model's output remains
-            # structured. Callers can opt back into strict by including
-            # `strict: true` themselves in a custom `response_format`.
-            payload["response_format"] = {
-                "type": "json_schema",
-                "json_schema": {
-                    "name": "structured_output",
-                    "schema": schema,
-                    "strict": False,
-                },
-            }
-
-    for key, value in model_params.items():
-        if key in _DROPPED_NON_OPENAI_FIELDS:
-            continue
-        if key in _OPENAI_PASSTHROUGH_FIELDS:
-            payload[key] = value
-        # else: silently drop unknown keys rather than risk a 400.
--- a/llm_connect/replay.py
+++ b/llm_connect/replay.py
@@ -0,0 +1,121 @@
+"""Replay llm-connect audit records without making provider calls."""
+
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+from typing import Any
+
+from llm_connect.claude_code import _unwrap_cli_json_envelope
+from llm_connect.models import RunConfig
+
+
+def parse_audit_record(record: dict[str, Any]) -> dict[str, Any]:
+    """Parse the recorded provider response and compare it to saved content."""
+
+    config = RunConfig.from_dict(record.get("config", {}))
+    provider = record.get("provider") or _infer_provider(record)
+    provider_response = record.get("provider_response") or {}
+    body = provider_response.get("body")
+    parsed_content = _parse_provider_response(provider, body, config)
+    recorded_content = record.get("parsed_content")
+    schema_check = _check_structured_output(parsed_content, config.model_params.get("json_schema"))
+
+    return {
+        "provider": provider,
+        "parsed_content": parsed_content,
+        "matches_recorded_content": parsed_content == recorded_content,
+        "structured_output": schema_check,
+    }
+
+
+def main(argv: list[str] | None = None) -> None:
+    parser = argparse.ArgumentParser(
+        prog="python -m llm_connect.replay",
+        description="Replay parsing for a llm-connect audit JSON file.",
+    )
+    parser.add_argument("audit_file", help="Path to an audit JSON file")
+    parser.add_argument("--json", action="store_true", help="Print the full replay report")
+    args = parser.parse_args(argv)
+
+    record = json.loads(Path(args.audit_file).read_text(encoding="utf-8"))
+    report = parse_audit_record(record)
+    if args.json:
+        print(json.dumps(report, indent=2, sort_keys=True))
+    else:
+        print(report["parsed_content"])
+
+
+def _parse_provider_response(provider: str | None, body: Any, config: RunConfig) -> str:
+    if provider in {"openai", "openrouter"}:
+        if isinstance(body, dict):
+            choice = (body.get("choices") or [{}])[0]
+            return choice.get("message", {}).get("content", "")
+        return ""
+
+    if provider == "gemini":
+        if isinstance(body, dict):
+            candidates = body.get("candidates") or []
+            if not candidates:
+                return ""
+            parts = candidates[0].get("content", {}).get("parts", [])
+            return "".join(part.get("text", "") for part in parts)
+        return ""
+
+    if provider == "claude-code":
+        if isinstance(body, dict):
+            return _unwrap_cli_json_envelope(body.get("stdout", ""), config)
+        return ""
+
+    if isinstance(body, str):
+        return body
+    if body is None:
+        return ""
+    return json.dumps(body)
+
+
+def _infer_provider(record: dict[str, Any]) -> str | None:
+    request = record.get("provider_request") or {}
+    url = request.get("url", "")
+    if "openrouter.ai" in url:
+        return "openrouter"
+    if "api.openai.com" in url:
+        return "openai"
+    if "generativelanguage.googleapis.com" in url:
+        return "gemini"
+    if request.get("command"):
+        return "claude-code"
+    return None
+
+
+def _check_structured_output(content: str, schema: Any) -> dict[str, Any]:
+    if not schema:
+        return {"checked": False}
+    if isinstance(schema, str):
+        try:
+            schema = json.loads(schema)
+        except ValueError as exc:
+            return {"checked": True, "valid": False, "error": f"invalid schema JSON: {exc}"}
+    if not isinstance(schema, dict):
+        return {"checked": True, "valid": False, "error": "schema must be an object"}
+
+    try:
+        parsed = json.loads(content)
+    except ValueError as exc:
+        return {"checked": True, "valid": False, "error": f"invalid output JSON: {exc}"}
+
+    missing = []
+    if schema.get("type") == "object":
+        if not isinstance(parsed, dict):
+            return {"checked": True, "valid": False, "error": "output is not an object"}
+        for key in schema.get("required", []):
+            if key not in parsed:
+                missing.append(key)
+    if missing:
+        return {"checked": True, "valid": False, "missing_required": missing}
+    return {"checked": True, "valid": True}
+
+
+if __name__ == "__main__":
+    main()
--- a/llm_connect/server.py
+++ b/llm_connect/server.py
@@ -21,13 +21,21 @@ Usage (CLI)::
 """

 import argparse
+import datetime as _dt
 import json
+import os
+import re
 import threading
-from http.server import BaseHTTPRequestHandler, HTTPServer
+import time
+import uuid
+from http.server import BaseHTTPRequestHandler, ThreadingHTTPServer
+from pathlib import Path
 from typing import Optional
+from urllib.parse import parse_qs, urlsplit

+from llm_connect._diagnostics import capture_diagnostics
 from llm_connect.adapter import LLMAdapter
-from llm_connect.models import RunConfig
+from llm_connect.models import LLMResponse, RunConfig


 class _Handler(BaseHTTPRequestHandler):
@@ -39,7 +47,8 @@ class _Handler(BaseHTTPRequestHandler):
    # ── GET ────────────────────────────────────────────────────────

    def do_GET(self):
-        if self.path == "/health":
+        parsed = urlsplit(self.path)
+        if parsed.path == "/health":
            self._respond(200, {"status": "ok"})
        else:
            self._respond(404, {"error": "not found"})
@@ -47,10 +56,13 @@ class _Handler(BaseHTTPRequestHandler):
    # ── POST ───────────────────────────────────────────────────────

    def do_POST(self):
-        if self.path != "/execute":
+        parsed = urlsplit(self.path)
+        if parsed.path != "/execute":
            self._respond(404, {"error": "not found"})
            return

+        debug_enabled = _debug_requested(parsed.query)
+        audit_dir = os.environ.get("LLM_CONNECT_AUDIT_DIR")
        length = int(self.headers.get("Content-Length", 0))
        raw = self.rfile.read(length)
        try:
@@ -70,9 +82,19 @@ class _Handler(BaseHTTPRequestHandler):
            return
        config = RunConfig.from_dict(cfg)

+        start = time.time()
+        diagnostics_enabled = debug_enabled or bool(audit_dir)
        try:
+            with capture_diagnostics(diagnostics_enabled) as diagnostics:
                response = self.server.adapter.execute_prompt(prompt, config)  # type: ignore[attr-defined]
-            self._respond(200, response.to_dict())
+            latency = time.time() - start
+            body = response.to_dict()
+            debug = diagnostics.to_dict() if diagnostics is not None else None
+            if debug_enabled and debug is not None:
+                body["debug"] = debug
+            if audit_dir:
+                _write_audit_record(audit_dir, prompt, config, response, debug, latency)
+            self._respond(200, body)
        except Exception as exc:
            self._respond(500, {"error": str(exc)})

@@ -102,7 +124,7 @@ class LLMServer:
        host: str = "127.0.0.1",
        port: int = 8080,
    ) -> None:
-        self._httpd = HTTPServer((host, port), _Handler)
+        self._httpd = ThreadingHTTPServer((host, port), _Handler)
        self._httpd.adapter = adapter  # type: ignore[attr-defined]
        self._thread: Optional[threading.Thread] = None

@@ -138,6 +160,55 @@ def _build_adapter(provider: str, model: Optional[str]) -> LLMAdapter:
    return create_adapter(provider, model=model)


+def _debug_requested(query: str) -> bool:
+    env = os.environ.get("LLM_CONNECT_DEBUG", "")
+    if _truthy(env):
+        return True
+    values = parse_qs(query).get("debug", [])
+    return any(_truthy(value) for value in values)
+
+
+def _truthy(value: str) -> bool:
+    return value.strip().lower() in {"1", "true", "yes", "on"}
+
+
+def _write_audit_record(
+    audit_dir: str,
+    prompt: str,
+    config: RunConfig,
+    response: LLMResponse,
+    debug: dict | None,
+    latency_seconds: float,
+) -> None:
+    target_dir = Path(audit_dir)
+    target_dir.mkdir(parents=True, exist_ok=True)
+
+    now = _dt.datetime.now(_dt.timezone.utc)
+    response_id = str(response.metadata.get("response_id") or uuid.uuid4().hex)
+    filename = f"{now.strftime('%Y%m%dT%H%M%S%fZ')}-{_safe_filename(response_id)}.json"
+    diagnostics = debug or {}
+    record = {
+        "timestamp": now.isoformat().replace("+00:00", "Z"),
+        "prompt": prompt,
+        "config": config.to_dict(),
+        "provider": response.metadata.get("provider"),
+        "provider_request": diagnostics.get("provider_request"),
+        "provider_response": diagnostics.get("provider_response"),
+        "adapter_transformations": diagnostics.get("adapter_transformations", []),
+        "parsed_content": response.content,
+        "latency_seconds": round(latency_seconds, 3),
+        "response": response.to_dict(),
+    }
+    (target_dir / filename).write_text(
+        json.dumps(record, indent=2, sort_keys=True),
+        encoding="utf-8",
+    )
+
+
+def _safe_filename(value: str) -> str:
+    return re.sub(r"[^A-Za-z0-9_.-]+", "-", value).strip("-") or "response"
+
+
 def main(argv=None) -> None:
    parser = argparse.ArgumentParser(
        prog="python -m llm_connect.server",
--- a/tests/test_payload.py
+++ b/tests/test_payload.py
@@ -0,0 +1,81 @@
+from llm_connect._payload import merge_gemini_model_params, merge_openai_chat_model_params
+
+
+STRUCTURED_SCHEMA = {
+    "type": "object",
+    "properties": {
+        "summary": {"type": "string"},
+        "recommendations": {"type": "array", "items": {"type": "string"}},
+    },
+    "required": ["summary", "recommendations"],
+}
+
+
+ACTIVITY_CORE_MODEL_PARAMS = {
+    "reasoning_effort": "medium",
+    "max_depth": 4,
+    "json_schema": STRUCTURED_SCHEMA,
+    "top_p": 0.8,
+}
+
+
+def test_openai_chat_model_params_translate_activity_core_shape():
+    payload = {
+        "model": "gpt-4.1-mini",
+        "messages": [{"role": "user", "content": "triage"}],
+        "temperature": 0.2,
+        "max_tokens": 200,
+    }
+
+    merge_openai_chat_model_params(payload, ACTIVITY_CORE_MODEL_PARAMS)
+
+    assert payload["response_format"] == {
+        "type": "json_schema",
+        "json_schema": {
+            "name": "structured_output",
+            "schema": STRUCTURED_SCHEMA,
+            "strict": False,
+        },
+    }
+    assert payload["top_p"] == 0.8
+    assert "reasoning_effort" not in payload
+    assert "max_depth" not in payload
+    assert "json_schema" not in payload
+
+
+def test_openai_chat_model_params_preserve_explicit_response_format():
+    explicit = {
+        "type": "json_schema",
+        "json_schema": {
+            "name": "custom",
+            "schema": STRUCTURED_SCHEMA,
+            "strict": True,
+        },
+    }
+    payload = {"model": "gpt-4.1-mini", "messages": []}
+
+    merge_openai_chat_model_params(
+        payload,
+        {"json_schema": STRUCTURED_SCHEMA, "response_format": explicit},
+    )
+
+    assert payload["response_format"] == explicit
+
+
+def test_gemini_model_params_translate_activity_core_shape():
+    payload = {
+        "contents": [{"role": "user", "parts": [{"text": "triage"}]}],
+        "generationConfig": {
+            "temperature": 0.2,
+            "maxOutputTokens": 200,
+        },
+    }
+
+    merge_gemini_model_params(payload, ACTIVITY_CORE_MODEL_PARAMS)
+
+    assert payload["generationConfig"]["responseMimeType"] == "application/json"
+    assert payload["generationConfig"]["responseSchema"] == STRUCTURED_SCHEMA
+    assert payload["generationConfig"]["topP"] == 0.8
+    assert "reasoning_effort" not in payload
+    assert "max_depth" not in payload
+    assert "json_schema" not in payload
--- a/tests/test_replay.py
+++ b/tests/test_replay.py
@@ -0,0 +1,62 @@
+from llm_connect.replay import parse_audit_record
+
+
+STRUCTURED_SCHEMA = {
+    "type": "object",
+    "properties": {
+        "summary": {"type": "string"},
+        "recommendations": {"type": "array", "items": {"type": "string"}},
+    },
+    "required": ["summary", "recommendations"],
+}
+
+
+def test_replay_parses_openai_style_provider_response():
+    record = {
+        "provider": "openrouter",
+        "config": {"model_params": {"json_schema": STRUCTURED_SCHEMA}},
+        "provider_response": {
+            "status": 200,
+            "body": {
+                "choices": [
+                    {
+                        "message": {
+                            "content": '{"summary":"ok","recommendations":[]}'
+                        }
+                    }
+                ]
+            },
+        },
+        "parsed_content": '{"summary":"ok","recommendations":[]}',
+    }
+
+    report = parse_audit_record(record)
+
+    assert report["parsed_content"] == '{"summary":"ok","recommendations":[]}'
+    assert report["matches_recorded_content"] is True
+    assert report["structured_output"] == {"checked": True, "valid": True}
+
+
+def test_replay_reuses_claude_code_envelope_unwrapper():
+    record = {
+        "provider": "claude-code",
+        "config": {"model_params": {"json_schema": STRUCTURED_SCHEMA}},
+        "provider_response": {
+            "status": 0,
+            "body": {
+                "stdout": (
+                    '{"type":"result","result":"prose",'
+                    '"structured_result":"{\\"summary\\":\\"ok\\",'
+                    '\\"recommendations\\":[]}"}'
+                ),
+                "stderr": "",
+            },
+        },
+        "parsed_content": '{"summary":"ok","recommendations":[]}',
+    }
+
+    report = parse_audit_record(record)
+
+    assert report["parsed_content"] == '{"summary":"ok","recommendations":[]}'
+    assert report["matches_recorded_content"] is True
+    assert report["structured_output"] == {"checked": True, "valid": True}
--- a/tests/test_server.py
+++ b/tests/test_server.py
@@ -2,14 +2,22 @@
 Tests for LLMServer HTTP serve mode (FR-1).
 """

+import threading
+import time
+from concurrent.futures import ThreadPoolExecutor
 import json
 import urllib.error
 import urllib.request

 import pytest

+from llm_connect._diagnostics import (
+    record_adapter_transformation,
+    record_provider_request,
+    record_provider_response,
+)
 from llm_connect.adapter import MockLLMAdapter, ErrorLLMAdapter
-from llm_connect.models import RunConfig
+from llm_connect.models import LLMResponse, RunConfig
 from llm_connect.server import LLMServer


@@ -45,6 +53,35 @@ def _post(url: str, body: dict) -> tuple[int, dict]:
        return exc.code, json.loads(exc.read())


+class DiagnosticLLMAdapter(MockLLMAdapter):
+    def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse:
+        record_provider_request(
+            url="https://provider.example/v1/chat",
+            payload={"prompt": prompt, "model": config.model_name},
+            headers={"Authorization": "Bearer secret-token"},
+        )
+        response = super().execute_prompt(prompt, config)
+        response.metadata["provider"] = "diagnostic"
+        response.metadata["response_id"] = "diag-response"
+        record_provider_response(status=200, body={"id": "diag-response", "content": response.content})
+        record_adapter_transformation(
+            "diagnostic_transform",
+            {"before": prompt},
+            {"after": response.content},
+        )
+        return response
+
+
+class BarrierLLMAdapter(MockLLMAdapter):
+    def __init__(self):
+        super().__init__(mock_response="parallel")
+        self._barrier = threading.Barrier(2)
+
+    def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse:
+        self._barrier.wait(timeout=2.0)
+        return super().execute_prompt(prompt, config)
+
+
 class TestHealth:
    def test_health_returns_200(self, server):
        status, body = _get(f"http://127.0.0.1:{server.port}/health")
@@ -65,6 +102,7 @@ class TestExecute:
        assert status == 200
        assert body["content"] == "hello world"
        assert body["finish_reason"] == "stop"
+        assert "debug" not in body

    def test_response_includes_usage(self, server):
        status, body = _post(
@@ -150,3 +188,86 @@ class TestExecute:
        )
        assert status == 400
        assert "config" in body["error"]
+
+    def test_debug_query_returns_diagnostics(self):
+        s = LLMServer(adapter=DiagnosticLLMAdapter(mock_response="debug body"), port=0)
+        s.start()
+        try:
+            status, body = _post(
+                f"http://127.0.0.1:{s.port}/execute?debug=1",
+                {"prompt": "inspect", "config": {"model_name": "diagnostic-model"}},
+            )
+        finally:
+            s.stop()
+
+        assert status == 200
+        assert body["content"] == "debug body"
+        debug = body["debug"]
+        assert debug["provider_request"]["payload"] == {
+            "prompt": "inspect",
+            "model": "diagnostic-model",
+        }
+        assert debug["provider_request"]["headers_redacted"]["Authorization"] == "Bearer <redacted>"
+        assert debug["provider_response"]["status"] == 200
+        assert debug["adapter_transformations"][0]["step"] == "diagnostic_transform"
+
+    def test_debug_env_returns_diagnostics(self, monkeypatch):
+        monkeypatch.setenv("LLM_CONNECT_DEBUG", "1")
+        s = LLMServer(adapter=DiagnosticLLMAdapter(mock_response="debug body"), port=0)
+        s.start()
+        try:
+            status, body = _post(
+                f"http://127.0.0.1:{s.port}/execute",
+                {"prompt": "inspect"},
+            )
+        finally:
+            s.stop()
+
+        assert status == 200
+        assert "debug" in body
+
+    def test_audit_dir_records_replayable_call(self, monkeypatch, tmp_path):
+        monkeypatch.setenv("LLM_CONNECT_AUDIT_DIR", str(tmp_path))
+        s = LLMServer(adapter=DiagnosticLLMAdapter(mock_response="audit body"), port=0)
+        s.start()
+        try:
+            status, body = _post(
+                f"http://127.0.0.1:{s.port}/execute",
+                {"prompt": "audit me", "config": {"model_name": "audit-model"}},
+            )
+        finally:
+            s.stop()
+
+        assert status == 200
+        assert "debug" not in body
+        files = list(tmp_path.glob("*.json"))
+        assert len(files) == 1
+        record = json.loads(files[0].read_text(encoding="utf-8"))
+        assert record["prompt"] == "audit me"
+        assert record["config"]["model_name"] == "audit-model"
+        assert record["parsed_content"] == "audit body"
+        assert record["provider_request"]["headers_redacted"]["Authorization"] == "Bearer <redacted>"
+        assert record["provider_response"]["body"]["id"] == "diag-response"
+        assert record["latency_seconds"] >= 0
+
+    def test_execute_requests_run_concurrently(self):
+        s = LLMServer(adapter=BarrierLLMAdapter(), port=0)
+        s.start()
+        try:
+            start = time.monotonic()
+            with ThreadPoolExecutor(max_workers=2) as pool:
+                futures = [
+                    pool.submit(
+                        _post,
+                        f"http://127.0.0.1:{s.port}/execute",
+                        {"prompt": f"request {idx}"},
+                    )
+                    for idx in range(2)
+                ]
+                results = [future.result(timeout=3.0) for future in futures]
+            elapsed = time.monotonic() - start
+        finally:
+            s.stop()
+
+        assert [status for status, _body in results] == [200, 200]
+        assert elapsed < 1.5
--- a/tests/test_structured_output_smoke.py
+++ b/tests/test_structured_output_smoke.py
@@ -0,0 +1,142 @@
+import json
+
+from llm_connect.gemini import GeminiAdapter
+from llm_connect.models import RunConfig
+from llm_connect.openai import OpenAIAdapter
+from llm_connect.openrouter import OpenRouterAdapter
+
+
+STRUCTURED_SCHEMA = {
+    "type": "object",
+    "properties": {
+        "summary": {"type": "string"},
+        "recommendations": {"type": "array", "items": {"type": "string"}},
+    },
+    "required": ["summary", "recommendations"],
+}
+
+
+SMOKE_CONFIG = RunConfig(
+    model_name="gpt-4",
+    temperature=0.1,
+    max_tokens=300,
+    model_params={
+        "reasoning_effort": "medium",
+        "max_depth": 3,
+        "json_schema": STRUCTURED_SCHEMA,
+    },
+)
+
+
+def test_openrouter_structured_output_payload_and_model_routing(monkeypatch):
+    captured: dict[str, object] = {}
+
+    def fake_post_json(url, payload, headers=None, timeout=300):  # noqa: ANN001
+        captured["url"] = url
+        captured["payload"] = payload
+        captured["headers"] = headers
+        captured["timeout"] = timeout
+        return {
+            "id": "or-response",
+            "model": payload["model"],
+            "choices": [
+                {
+                    "message": {
+                        "content": json.dumps(
+                            {"summary": "ok", "recommendations": ["keep payload clean"]}
+                        )
+                    },
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {"prompt_tokens": 1, "completion_tokens": 2, "total_tokens": 3},
+        }
+
+    monkeypatch.setattr("llm_connect.openrouter.post_json", fake_post_json)
+    adapter = OpenRouterAdapter(
+        model="anthropic/claude-sonnet-4",
+        api_key="or-test",
+        api_base="https://openrouter.example/api/v1",
+    )
+
+    response = adapter.execute_prompt("Return JSON.", SMOKE_CONFIG)
+    payload = captured["payload"]
+
+    assert response.model == "anthropic/claude-sonnet-4"
+    assert payload["model"] == "anthropic/claude-sonnet-4"
+    assert payload["response_format"]["json_schema"]["schema"] == STRUCTURED_SCHEMA
+    assert payload["response_format"]["json_schema"]["strict"] is False
+    assert "reasoning_effort" not in payload
+    assert "max_depth" not in payload
+    assert "json_schema" not in payload
+
+
+def test_openai_structured_output_payload(monkeypatch):
+    captured: dict[str, object] = {}
+
+    def fake_post_json(url, payload, headers=None, timeout=300):  # noqa: ANN001
+        captured["payload"] = payload
+        return {
+            "id": "oa-response",
+            "model": payload["model"],
+            "choices": [
+                {
+                    "message": {
+                        "content": json.dumps({"summary": "ok", "recommendations": []})
+                    },
+                    "finish_reason": "stop",
+                }
+            ],
+            "usage": {"prompt_tokens": 1, "completion_tokens": 2, "total_tokens": 3},
+        }
+
+    monkeypatch.setattr("llm_connect.openai.post_json", fake_post_json)
+    adapter = OpenAIAdapter(model="gpt-4.1-mini", api_key="sk-test")
+
+    response = adapter.execute_prompt("Return JSON.", SMOKE_CONFIG)
+    payload = captured["payload"]
+
+    assert response.model == "gpt-4.1-mini"
+    assert payload["model"] == "gpt-4.1-mini"
+    assert payload["response_format"]["json_schema"]["schema"] == STRUCTURED_SCHEMA
+    assert "reasoning_effort" not in payload
+    assert "max_depth" not in payload
+    assert "json_schema" not in payload
+
+
+def test_gemini_structured_output_payload(monkeypatch):
+    captured: dict[str, object] = {}
+
+    def fake_post_json(url, payload, headers=None, timeout=300):  # noqa: ANN001
+        captured["url"] = url
+        captured["payload"] = payload
+        return {
+            "candidates": [
+                {
+                    "content": {
+                        "parts": [
+                            {"text": json.dumps({"summary": "ok", "recommendations": []})}
+                        ]
+                    },
+                    "finishReason": "STOP",
+                }
+            ],
+            "usageMetadata": {
+                "promptTokenCount": 1,
+                "candidatesTokenCount": 2,
+                "totalTokenCount": 3,
+            },
+        }
+
+    monkeypatch.setattr("llm_connect.gemini.post_json", fake_post_json)
+    adapter = GeminiAdapter(model="gemini-2.5-flash", api_key="gemini-test")
+
+    response = adapter.execute_prompt("Return JSON.", SMOKE_CONFIG)
+    payload = captured["payload"]
+
+    assert response.model == "gemini-2.5-flash"
+    assert payload["generationConfig"]["responseMimeType"] == "application/json"
+    assert payload["generationConfig"]["responseSchema"] == STRUCTURED_SCHEMA
+    assert "reasoning_effort" not in payload
+    assert "max_depth" not in payload
+    assert "json_schema" not in payload
--- a/workplans/ADHOC-2026-06-02.md
+++ b/workplans/ADHOC-2026-06-02.md
@@ -4,11 +4,11 @@ type: workplan
 title: "Ad hoc — llm-connect lessons from CUST-WP-0045 canary"
 domain: custodian
 repo: llm-connect
-status: ready
+status: finished
 owner: custodian
 topic_slug: custodian
 created: "2026-06-02"
-updated: "2026-06-02"
+updated: "2026-06-03"
 state_hub_workstream_id: "1c936c91-79c7-427d-ab37-9052e8a61cda"
 ---

@@ -38,7 +38,7 @@ workplan.

 ```task
 id: ADHOC-2026-06-02-T01
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "69626e9e-29f1-40f6-8cd2-d38a7e802293"
 ```
@@ -78,7 +78,7 @@ debug field is omitted in normal mode.

 ```task
 id: ADHOC-2026-06-02-T02
-status: todo
+status: done
 priority: low
 state_hub_task_id: "e2b1be30-71f7-4497-9b10-b0f24d37beba"
 ```
@@ -101,7 +101,7 @@ max of their individual latencies, not the sum.

 ```task
 id: ADHOC-2026-06-02-T03
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "da4821f0-a876-44ce-9dc3-f3fc67732d0f"
 ```
@@ -127,7 +127,7 @@ ergonomics.

 ```task
 id: ADHOC-2026-06-02-T04
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "f8a033e6-22ac-4700-b8d2-43a5d76a3751"
 ```
@@ -155,7 +155,7 @@ forbidden top-level fields, schema in the right wrapper).

 ```task
 id: ADHOC-2026-06-02-T05
-status: todo
+status: done
 priority: medium
 state_hub_task_id: "5d53dbb4-b374-45fe-b81c-ff0b222ca74f"
 ```
@@ -188,7 +188,7 @@ bug) before either was merged.

 ```task
 id: ADHOC-2026-06-02-T06
-status: todo
+status: done
 priority: low
 state_hub_task_id: "33fcb951-d7ab-4d3c-8d67-9eebd986c711"
 ```
@@ -210,3 +210,21 @@ would only send OpenAI-valid fields. Codify the contract in

 Done when a new adapter author can read the doc and know what their
 `_merge_model_params` implementation must support.
+
+## Implementation Notes
+
+Completed on 2026-06-03:
+
+- Added opt-in `/execute` debug envelopes via `LLM_CONNECT_DEBUG=1` or
+  `?debug=1`, with redacted provider request/response capture and adapter
+  transformation records.
+- Switched serve mode to `ThreadingHTTPServer` and added a concurrency
+  regression test.
+- Added `LLM_CONNECT_AUDIT_DIR` per-call audit records plus
+  `python -m llm_connect.replay` for parser/unwrapper replay.
+- Extracted shared OpenAI-compatible and Gemini payload translation helpers
+  and wired OpenRouter, OpenAI, and Gemini through them.
+- Added CI-safe structured-output smoke tests that mock provider HTTP calls
+  and assert model routing plus payload shape.
+- Documented the adapter `model_params` contract in
+  `docs/adapter-model-params.md`.