Add adaptive cost-quality routing primitives

2026-05-17 21:32:27 +02:00
parent bf86a03c5d
commit c4ad4bb9f2
17 changed files with 2480 additions and 25 deletions
--- a/contracts/functional/baseline-grading.md
+++ b/contracts/functional/baseline-grading.md
@@ -0,0 +1,85 @@
+# Contract: Baseline Grading
+
+**layer:** Functional
+**maturity:** Beta
+**module:** `llm_connect.grading`
+**since:** WP-0004
+
+## Purpose
+
+Compare a candidate adapter response against a caller-chosen baseline response
+and return a normalised quality score suitable for storage in
+`QualityLedger`.
+
+## Public surface
+
+```python
+@dataclass(frozen=True)
+class GradingResult:
+    quality_score: float
+    notes: str
+    grader_id: str
+    baseline_response: LLMResponse
+    candidate_response: LLMResponse
+
+class Judge(Protocol):
+    grader_id: str
+    def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
+
+class BaselineGrader(Protocol):
+    def grade(
+        self,
+        baseline_adapter: LLMAdapter,
+        candidate_adapter: LLMAdapter,
+        prompt: str,
+        run_config: RunConfig,
+    ) -> GradingResult: ...
+
+@dataclass
+class ExactMatchJudge: ...
+
+@dataclass
+class EmbeddingSimilarityJudge: ...
+
+@dataclass
+class LLMJudge: ...
+
+@dataclass
+class PairedGrader: ...
+```
+
+## Invariants
+
+1. `quality_score` is always validated as `0.0..1.0`.
+2. `GradingResult` always preserves both baseline and candidate responses.
+3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
+   same prompt and run config, then delegates comparison to its `Judge`.
+4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
+5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
+   single batch and clamps cosine similarity into `0.0..1.0`.
+6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
+   `quality_score` and optional `notes`.
+7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
+   and adds a deterministic `seed` model parameter when configured.
+
+## Error contract
+
+| Condition | Exception |
+|-----------|-----------|
+| Invalid `quality_score` | `ValueError` |
+| Empty `grader_id` | `ValueError` |
+| Embedding adapter returns other than two vectors | `ValueError` |
+| LLM judge response is missing parseable JSON | `ValueError` |
+
+## Bias caveats
+
+LLM-as-judge scoring is heuristic and may exhibit:
+
+- Length bias: longer answers can be preferred even when not better.
+- Format bias: familiar formatting can be rewarded independent of correctness.
+- Position bias: prompt order can affect judgement.
+- Self-preference: a judge may favour outputs from its own model family.
+
+Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
+as exact match or embedding similarity before using its observations to drive
+adaptive routing.