generated from coulomb/repo-seed
Add adaptive cost-quality routing primitives
This commit is contained in:
85
contracts/functional/baseline-grading.md
Normal file
85
contracts/functional/baseline-grading.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Contract: Baseline Grading
|
||||
|
||||
**layer:** Functional
|
||||
**maturity:** Beta
|
||||
**module:** `llm_connect.grading`
|
||||
**since:** WP-0004
|
||||
|
||||
## Purpose
|
||||
|
||||
Compare a candidate adapter response against a caller-chosen baseline response
|
||||
and return a normalised quality score suitable for storage in
|
||||
`QualityLedger`.
|
||||
|
||||
## Public surface
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class GradingResult:
|
||||
quality_score: float
|
||||
notes: str
|
||||
grader_id: str
|
||||
baseline_response: LLMResponse
|
||||
candidate_response: LLMResponse
|
||||
|
||||
class Judge(Protocol):
|
||||
grader_id: str
|
||||
def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
|
||||
|
||||
class BaselineGrader(Protocol):
|
||||
def grade(
|
||||
self,
|
||||
baseline_adapter: LLMAdapter,
|
||||
candidate_adapter: LLMAdapter,
|
||||
prompt: str,
|
||||
run_config: RunConfig,
|
||||
) -> GradingResult: ...
|
||||
|
||||
@dataclass
|
||||
class ExactMatchJudge: ...
|
||||
|
||||
@dataclass
|
||||
class EmbeddingSimilarityJudge: ...
|
||||
|
||||
@dataclass
|
||||
class LLMJudge: ...
|
||||
|
||||
@dataclass
|
||||
class PairedGrader: ...
|
||||
```
|
||||
|
||||
## Invariants
|
||||
|
||||
1. `quality_score` is always validated as `0.0..1.0`.
|
||||
2. `GradingResult` always preserves both baseline and candidate responses.
|
||||
3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
|
||||
same prompt and run config, then delegates comparison to its `Judge`.
|
||||
4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
|
||||
5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
|
||||
single batch and clamps cosine similarity into `0.0..1.0`.
|
||||
6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
|
||||
`quality_score` and optional `notes`.
|
||||
7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
|
||||
and adds a deterministic `seed` model parameter when configured.
|
||||
|
||||
## Error contract
|
||||
|
||||
| Condition | Exception |
|
||||
|-----------|-----------|
|
||||
| Invalid `quality_score` | `ValueError` |
|
||||
| Empty `grader_id` | `ValueError` |
|
||||
| Embedding adapter returns other than two vectors | `ValueError` |
|
||||
| LLM judge response is missing parseable JSON | `ValueError` |
|
||||
|
||||
## Bias caveats
|
||||
|
||||
LLM-as-judge scoring is heuristic and may exhibit:
|
||||
|
||||
- Length bias: longer answers can be preferred even when not better.
|
||||
- Format bias: familiar formatting can be rewarded independent of correctness.
|
||||
- Position bias: prompt order can affect judgement.
|
||||
- Self-preference: a judge may favour outputs from its own model family.
|
||||
|
||||
Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
|
||||
as exact match or embedding similarity before using its observations to drive
|
||||
adaptive routing.
|
||||
Reference in New Issue
Block a user