generated from coulomb/repo-seed
86 lines
2.5 KiB
Markdown
86 lines
2.5 KiB
Markdown
# Contract: Baseline Grading
|
|
|
|
**layer:** Functional
|
|
**maturity:** Beta
|
|
**module:** `llm_connect.grading`
|
|
**since:** WP-0004
|
|
|
|
## Purpose
|
|
|
|
Compare a candidate adapter response against a caller-chosen baseline response
|
|
and return a normalised quality score suitable for storage in
|
|
`QualityLedger`.
|
|
|
|
## Public surface
|
|
|
|
```python
|
|
@dataclass(frozen=True)
|
|
class GradingResult:
|
|
quality_score: float
|
|
notes: str
|
|
grader_id: str
|
|
baseline_response: LLMResponse
|
|
candidate_response: LLMResponse
|
|
|
|
class Judge(Protocol):
|
|
grader_id: str
|
|
def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
|
|
|
|
class BaselineGrader(Protocol):
|
|
def grade(
|
|
self,
|
|
baseline_adapter: LLMAdapter,
|
|
candidate_adapter: LLMAdapter,
|
|
prompt: str,
|
|
run_config: RunConfig,
|
|
) -> GradingResult: ...
|
|
|
|
@dataclass
|
|
class ExactMatchJudge: ...
|
|
|
|
@dataclass
|
|
class EmbeddingSimilarityJudge: ...
|
|
|
|
@dataclass
|
|
class LLMJudge: ...
|
|
|
|
@dataclass
|
|
class PairedGrader: ...
|
|
```
|
|
|
|
## Invariants
|
|
|
|
1. `quality_score` is always validated as `0.0..1.0`.
|
|
2. `GradingResult` always preserves both baseline and candidate responses.
|
|
3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
|
|
same prompt and run config, then delegates comparison to its `Judge`.
|
|
4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
|
|
5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
|
|
single batch and clamps cosine similarity into `0.0..1.0`.
|
|
6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
|
|
`quality_score` and optional `notes`.
|
|
7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
|
|
and adds a deterministic `seed` model parameter when configured.
|
|
|
|
## Error contract
|
|
|
|
| Condition | Exception |
|
|
|-----------|-----------|
|
|
| Invalid `quality_score` | `ValueError` |
|
|
| Empty `grader_id` | `ValueError` |
|
|
| Embedding adapter returns other than two vectors | `ValueError` |
|
|
| LLM judge response is missing parseable JSON | `ValueError` |
|
|
|
|
## Bias caveats
|
|
|
|
LLM-as-judge scoring is heuristic and may exhibit:
|
|
|
|
- Length bias: longer answers can be preferred even when not better.
|
|
- Format bias: familiar formatting can be rewarded independent of correctness.
|
|
- Position bias: prompt order can affect judgement.
|
|
- Self-preference: a judge may favour outputs from its own model family.
|
|
|
|
Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
|
|
as exact match or embedding similarity before using its observations to drive
|
|
adaptive routing.
|