Files
llm-connect/contracts/functional/baseline-grading.md
tegwick c4ad4bb9f2
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add adaptive cost-quality routing primitives
2026-05-17 21:32:27 +02:00

86 lines
2.5 KiB
Markdown

# Contract: Baseline Grading
**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.grading`
**since:** WP-0004
## Purpose
Compare a candidate adapter response against a caller-chosen baseline response
and return a normalised quality score suitable for storage in
`QualityLedger`.
## Public surface
```python
@dataclass(frozen=True)
class GradingResult:
quality_score: float
notes: str
grader_id: str
baseline_response: LLMResponse
candidate_response: LLMResponse
class Judge(Protocol):
grader_id: str
def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
class BaselineGrader(Protocol):
def grade(
self,
baseline_adapter: LLMAdapter,
candidate_adapter: LLMAdapter,
prompt: str,
run_config: RunConfig,
) -> GradingResult: ...
@dataclass
class ExactMatchJudge: ...
@dataclass
class EmbeddingSimilarityJudge: ...
@dataclass
class LLMJudge: ...
@dataclass
class PairedGrader: ...
```
## Invariants
1. `quality_score` is always validated as `0.0..1.0`.
2. `GradingResult` always preserves both baseline and candidate responses.
3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
same prompt and run config, then delegates comparison to its `Judge`.
4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
single batch and clamps cosine similarity into `0.0..1.0`.
6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
`quality_score` and optional `notes`.
7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
and adds a deterministic `seed` model parameter when configured.
## Error contract
| Condition | Exception |
|-----------|-----------|
| Invalid `quality_score` | `ValueError` |
| Empty `grader_id` | `ValueError` |
| Embedding adapter returns other than two vectors | `ValueError` |
| LLM judge response is missing parseable JSON | `ValueError` |
## Bias caveats
LLM-as-judge scoring is heuristic and may exhibit:
- Length bias: longer answers can be preferred even when not better.
- Format bias: familiar formatting can be rewarded independent of correctness.
- Position bias: prompt order can affect judgement.
- Self-preference: a judge may favour outputs from its own model family.
Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
as exact match or embedding similarity before using its observations to drive
adaptive routing.