# Contract: Baseline Grading **layer:** Functional **maturity:** Beta **module:** `llm_connect.grading` **since:** WP-0004 ## Purpose Compare a candidate adapter response against a caller-chosen baseline response and return a normalised quality score suitable for storage in `QualityLedger`. ## Public surface ```python @dataclass(frozen=True) class GradingResult: quality_score: float notes: str grader_id: str baseline_response: LLMResponse candidate_response: LLMResponse class Judge(Protocol): grader_id: str def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ... class BaselineGrader(Protocol): def grade( self, baseline_adapter: LLMAdapter, candidate_adapter: LLMAdapter, prompt: str, run_config: RunConfig, ) -> GradingResult: ... @dataclass class ExactMatchJudge: ... @dataclass class EmbeddingSimilarityJudge: ... @dataclass class LLMJudge: ... @dataclass class PairedGrader: ... ``` ## Invariants 1. `quality_score` is always validated as `0.0..1.0`. 2. `GradingResult` always preserves both baseline and candidate responses. 3. `PairedGrader` runs the baseline adapter and the candidate adapter with the same prompt and run config, then delegates comparison to its `Judge`. 4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise. 5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a single batch and clamps cosine similarity into `0.0..1.0`. 6. `LLMJudge` uses a fixed rubric prompt and expects JSON with `quality_score` and optional `notes`. 7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker, and adds a deterministic `seed` model parameter when configured. ## Error contract | Condition | Exception | |-----------|-----------| | Invalid `quality_score` | `ValueError` | | Empty `grader_id` | `ValueError` | | Embedding adapter returns other than two vectors | `ValueError` | | LLM judge response is missing parseable JSON | `ValueError` | ## Bias caveats LLM-as-judge scoring is heuristic and may exhibit: - Length bias: longer answers can be preferred even when not better. - Format bias: familiar formatting can be rewarded independent of correctness. - Position bias: prompt order can affect judgement. - Self-preference: a judge may favour outputs from its own model family. Consumers should calibrate `LLMJudge` against at least one non-LLM judge such as exact match or embedding similarity before using its observations to drive adaptive routing.