Contract: Baseline Grading

layer: Functional maturity: Beta module: llm_connect.grading since: WP-0004

Purpose

Compare a candidate adapter response against a caller-chosen baseline response and return a normalised quality score suitable for storage in QualityLedger.

Public surface

@dataclass(frozen=True)
class GradingResult:
    quality_score: float
    notes: str
    grader_id: str
    baseline_response: LLMResponse
    candidate_response: LLMResponse

class Judge(Protocol):
    grader_id: str
    def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...

class BaselineGrader(Protocol):
    def grade(
        self,
        baseline_adapter: LLMAdapter,
        candidate_adapter: LLMAdapter,
        prompt: str,
        run_config: RunConfig,
    ) -> GradingResult: ...

@dataclass
class ExactMatchJudge: ...

@dataclass
class EmbeddingSimilarityJudge: ...

@dataclass
class LLMJudge: ...

@dataclass
class PairedGrader: ...

Invariants

quality_score is always validated as 0.0..1.0.
GradingResult always preserves both baseline and candidate responses.
PairedGrader runs the baseline adapter and the candidate adapter with the same prompt and run config, then delegates comparison to its Judge.
ExactMatchJudge returns 1.0 for matched content and 0.0 otherwise.
EmbeddingSimilarityJudge embeds baseline and candidate response text in a single batch and clamps cosine similarity into 0.0..1.0.
LLMJudge uses a fixed rubric prompt and expects JSON with quality_score and optional notes.
LLMJudge runs with temperature=0.0, drops the caller's budget tracker, and adds a deterministic seed model parameter when configured.

Error contract

Condition	Exception
Invalid `quality_score`	`ValueError`
Empty `grader_id`	`ValueError`
Embedding adapter returns other than two vectors	`ValueError`
LLM judge response is missing parseable JSON	`ValueError`

Bias caveats

LLM-as-judge scoring is heuristic and may exhibit:

Length bias: longer answers can be preferred even when not better.
Format bias: familiar formatting can be rewarded independent of correctness.
Position bias: prompt order can affect judgement.
Self-preference: a judge may favour outputs from its own model family.

Consumers should calibrate LLMJudge against at least one non-LLM judge such as exact match or embedding similarity before using its observations to drive adaptive routing.

2.5 KiB Raw Blame History