generated from coulomb/repo-seed
2.5 KiB
2.5 KiB
Contract: Baseline Grading
layer: Functional
maturity: Beta
module: llm_connect.grading
since: WP-0004
Purpose
Compare a candidate adapter response against a caller-chosen baseline response
and return a normalised quality score suitable for storage in
QualityLedger.
Public surface
@dataclass(frozen=True)
class GradingResult:
quality_score: float
notes: str
grader_id: str
baseline_response: LLMResponse
candidate_response: LLMResponse
class Judge(Protocol):
grader_id: str
def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
class BaselineGrader(Protocol):
def grade(
self,
baseline_adapter: LLMAdapter,
candidate_adapter: LLMAdapter,
prompt: str,
run_config: RunConfig,
) -> GradingResult: ...
@dataclass
class ExactMatchJudge: ...
@dataclass
class EmbeddingSimilarityJudge: ...
@dataclass
class LLMJudge: ...
@dataclass
class PairedGrader: ...
Invariants
quality_scoreis always validated as0.0..1.0.GradingResultalways preserves both baseline and candidate responses.PairedGraderruns the baseline adapter and the candidate adapter with the same prompt and run config, then delegates comparison to itsJudge.ExactMatchJudgereturns1.0for matched content and0.0otherwise.EmbeddingSimilarityJudgeembeds baseline and candidate response text in a single batch and clamps cosine similarity into0.0..1.0.LLMJudgeuses a fixed rubric prompt and expects JSON withquality_scoreand optionalnotes.LLMJudgeruns withtemperature=0.0, drops the caller's budget tracker, and adds a deterministicseedmodel parameter when configured.
Error contract
| Condition | Exception |
|---|---|
Invalid quality_score |
ValueError |
Empty grader_id |
ValueError |
| Embedding adapter returns other than two vectors | ValueError |
| LLM judge response is missing parseable JSON | ValueError |
Bias caveats
LLM-as-judge scoring is heuristic and may exhibit:
- Length bias: longer answers can be preferred even when not better.
- Format bias: familiar formatting can be rewarded independent of correctness.
- Position bias: prompt order can affect judgement.
- Self-preference: a judge may favour outputs from its own model family.
Consumers should calibrate LLMJudge against at least one non-LLM judge such
as exact match or embedding similarity before using its observations to drive
adaptive routing.