llm-connect/contracts/functional/baseline-grading.md

# Contract: Baseline Grading

**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.grading`
**since:** WP-0004

## Purpose

Compare a candidate adapter response against a caller-chosen baseline response
and return a normalised quality score suitable for storage in
`QualityLedger`.

## Public surface

```python
@dataclass(frozen=True)
class GradingResult:
    quality_score: float
    notes: str
    grader_id: str
    baseline_response: LLMResponse
    candidate_response: LLMResponse

class Judge(Protocol):
    grader_id: str
    def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...

class BaselineGrader(Protocol):
    def grade(
        self,
        baseline_adapter: LLMAdapter,
        candidate_adapter: LLMAdapter,
        prompt: str,
        run_config: RunConfig,
    ) -> GradingResult: ...

@dataclass
class ExactMatchJudge: ...

@dataclass
class EmbeddingSimilarityJudge: ...

@dataclass
class LLMJudge: ...

@dataclass
class PairedGrader: ...
```

## Invariants

1. `quality_score` is always validated as `0.0..1.0`.
2. `GradingResult` always preserves both baseline and candidate responses.
3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
   same prompt and run config, then delegates comparison to its `Judge`.
4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
   single batch and clamps cosine similarity into `0.0..1.0`.
6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
   `quality_score` and optional `notes`.
7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
   and adds a deterministic `seed` model parameter when configured.

## Error contract

| Condition | Exception |
|-----------|-----------|
| Invalid `quality_score` | `ValueError` |
| Empty `grader_id` | `ValueError` |
| Embedding adapter returns other than two vectors | `ValueError` |
| LLM judge response is missing parseable JSON | `ValueError` |

## Bias caveats

LLM-as-judge scoring is heuristic and may exhibit:

- Length bias: longer answers can be preferred even when not better.
- Format bias: familiar formatting can be rewarded independent of correctness.
- Position bias: prompt order can affect judgement.
- Self-preference: a judge may favour outputs from its own model family.

Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
as exact match or embedding similarity before using its observations to drive
adaptive routing.