Add adaptive cost-quality routing primitives
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled

This commit is contained in:
2026-05-17 21:32:27 +02:00
parent bf86a03c5d
commit c4ad4bb9f2
17 changed files with 2480 additions and 25 deletions

View File

@@ -0,0 +1,87 @@
# Contract: AdaptiveRoutingPolicy
**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.routing`
**since:** WP-0004
## Purpose
Select the cheapest adapter whose observed mean quality for a task type clears
a caller-supplied quality floor. The policy builds on `RoutingPolicy`: static
rules remain the cold-start and failure fallback, while adaptive selection is
used only when the ledger has enough qualifying observations.
## Public surface
```python
@dataclass
class AdaptiveRoutingPolicy(RoutingPolicy):
ledger: Optional[QualityLedger] = None
adapters_by_id: Mapping[str, LLMAdapter] = field(default_factory=dict)
window_size: int = 20
min_observations: int = 1
max_age: Optional[timedelta] = None
def resolve(
self,
task_type: str,
estimated_cost_per_1k: Optional[float] = None,
*,
quality_floor: Optional[float] = None,
) -> LLMAdapter: ...
```
## Candidate identity
Observations are keyed by `(task_type, adapter_id)`. Callers should pass
`adapters_by_id` so the policy can map ledger observations back to concrete
`LLMAdapter` instances. If a static rule adapter is not present in
`adapters_by_id`, the policy also checks common string attributes
`adapter_id`, `id`, and `name`.
## Invariants
1. If `quality_floor is None` or `ledger is None`, resolution is exactly the
same as `RoutingPolicy.resolve()`.
2. `quality_floor` must be between `0` and `1`, inclusive.
3. Each candidate is evaluated over the newest `window_size` observations for
the requested `task_type` and adapter id.
4. `max_age`, when provided, filters out observations older than that age.
5. A candidate is considered only when it has at least `min_observations` after
filtering.
6. A candidate qualifies when its mean `quality_score` is greater than or equal
to `quality_floor`.
7. Among qualifying candidates, the policy chooses the lowest mean observed
`cost_usd`.
8. If mean observed cost ties exactly, the policy prefers the matching static
rule's explicit `prefer` adapter.
9. If there are still ties, stable candidate order is used.
10. If no candidate qualifies, resolution falls through to
`RoutingPolicy.resolve(task_type, estimated_cost_per_1k)`.
## Sample-size and freshness trade-off
Small `window_size` values react quickly to model or prompt changes but can be
noisy. Larger windows are more stable but may preserve stale behavior after a
provider update or prompt template change. `min_observations` lets callers avoid
acting on a single lucky sample, while `max_age` bounds how long old observations
can influence routing. Callers that change prompts materially should also filter
by a prompt fingerprint in observation tags before writing comparable samples to
the same ledger regime.
## Error contract
| Condition | Exception |
|-----------|-----------|
| `quality_floor` outside `0..1` | `ValueError` |
| `window_size <= 0` | `ValueError` |
| `min_observations <= 0` | `ValueError` |
| `max_age < 0` | `ValueError` |
| No qualifying adaptive candidate and no static fallback | `LookupError` |
## Non-goals
The policy does not define a task taxonomy, set task quality floors, decide
which baseline is authoritative, or perform billing-grade accounting. Those are
consumer policy choices.

View File

@@ -0,0 +1,85 @@
# Contract: Baseline Grading
**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.grading`
**since:** WP-0004
## Purpose
Compare a candidate adapter response against a caller-chosen baseline response
and return a normalised quality score suitable for storage in
`QualityLedger`.
## Public surface
```python
@dataclass(frozen=True)
class GradingResult:
quality_score: float
notes: str
grader_id: str
baseline_response: LLMResponse
candidate_response: LLMResponse
class Judge(Protocol):
grader_id: str
def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
class BaselineGrader(Protocol):
def grade(
self,
baseline_adapter: LLMAdapter,
candidate_adapter: LLMAdapter,
prompt: str,
run_config: RunConfig,
) -> GradingResult: ...
@dataclass
class ExactMatchJudge: ...
@dataclass
class EmbeddingSimilarityJudge: ...
@dataclass
class LLMJudge: ...
@dataclass
class PairedGrader: ...
```
## Invariants
1. `quality_score` is always validated as `0.0..1.0`.
2. `GradingResult` always preserves both baseline and candidate responses.
3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
same prompt and run config, then delegates comparison to its `Judge`.
4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
single batch and clamps cosine similarity into `0.0..1.0`.
6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
`quality_score` and optional `notes`.
7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
and adds a deterministic `seed` model parameter when configured.
## Error contract
| Condition | Exception |
|-----------|-----------|
| Invalid `quality_score` | `ValueError` |
| Empty `grader_id` | `ValueError` |
| Embedding adapter returns other than two vectors | `ValueError` |
| LLM judge response is missing parseable JSON | `ValueError` |
## Bias caveats
LLM-as-judge scoring is heuristic and may exhibit:
- Length bias: longer answers can be preferred even when not better.
- Format bias: familiar formatting can be rewarded independent of correctness.
- Position bias: prompt order can affect judgement.
- Self-preference: a judge may favour outputs from its own model family.
Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
as exact match or embedding similarity before using its observations to drive
adaptive routing.

View File

@@ -0,0 +1,87 @@
# Contract: QualityObservation and QualityLedger
**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.quality`
**since:** WP-0004
## Purpose
Record observed quality, cost, latency, and token outcomes for a logical task
type so consumers can build adaptive routing policy without putting
consumer-specific thresholds into llm-connect.
## Public surface
```python
@dataclass(frozen=True)
class QualityObservation:
task_type: str
adapter_id: str
model_id: str
cost_usd: float
quality_score: float
latency_ms: float
tokens_in: int
tokens_out: int
baseline_adapter_id: str | None = None
recorded_at: datetime = field(default_factory=...)
tags: dict[str, Any] = field(default_factory=dict)
@property
def total_tokens(self) -> int: ...
def to_dict(self) -> dict[str, Any]: ...
@classmethod
def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
class QualityLedger:
def __init__(self, path: str | Path): ...
@property
def path(self) -> Path: ...
def append(self, observation: QualityObservation) -> None: ...
def read_all(self) -> list[QualityObservation]: ...
def malformed_count(self) -> int: ...
def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
def recent(...) -> list[QualityObservation]: ...
def mean_quality(...) -> float | None: ...
def prune_before(self, timestamp: datetime) -> int: ...
def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
```
## Invariants
1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
candidate fully meets the grader's quality bar and `0.0` means complete
failure for that grader.
2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
6. `QualityLedger.append()` performs a process-local lock plus an advisory file
lock around each write.
7. Read/query helpers skip malformed lines instead of failing the whole ledger.
`malformed_count()` exposes how many lines were skipped.
8. `prune_before()` removes only valid observations older than the cutoff.
Malformed lines are preserved.
## Error contract
| Condition | Exception |
|-----------|-----------|
| Invalid observation field | `ValueError` |
| Invalid datetime field | `TypeError` or `ValueError` |
| Negative recent limit | `ValueError` |
| `mean_quality(min_observations <= 0)` | `ValueError` |
| `is_stale(max_age < 0)` | `ValueError` |
## Known consumers
- `infospace-bench` is the first intended consumer. It is expected to provide
task taxonomy, thresholds, and baseline choice.
## Notes
The ledger intentionally stores only observation metadata in this slice. Callers
that need prompt or response digests can place those in `tags`, for example
`prompt_fingerprint`.

View File

@@ -0,0 +1,84 @@
# Contract: ShadowingAdapter
**layer:** Functional
**maturity:** Beta
**module:** `llm_connect.shadowing`
**since:** WP-0004
## Purpose
Collect quality observations without changing caller-visible model behavior.
`ShadowingAdapter` wraps a candidate adapter, returns the candidate response to
the caller, and samples extra baseline/grading work that appends
`QualityObservation` records to a `QualityLedger`.
## Public surface
```python
@dataclass
class ShadowingAdapter(LLMAdapter):
candidate_adapter: LLMAdapter
baseline_adapter: LLMAdapter
grader: BaselineGrader
ledger: QualityLedger
task_type: str
adapter_id: str
model_id: Optional[str] = None
baseline_adapter_id: Optional[str] = None
shadow_rate: float = 1.0
async_shadow: bool = False
tags: Mapping[str, Any] = field(default_factory=dict)
on_shadow_error: Optional[Callable[[Exception], None]] = None
def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
async def async_execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
def flush(self, timeout: Optional[float] = None) -> None: ...
def shutdown(self, wait: bool = True) -> None: ...
```
## Invariants
1. The candidate adapter is always called first.
2. The response returned by `execute_prompt()` and `async_execute_prompt()` is
always the candidate response.
3. Shadow failures from the baseline adapter, grader, or ledger writer are
isolated from the caller. They are sent to `on_shadow_error` when configured.
4. `shadow_rate=0.0` records no observations. `shadow_rate=1.0` shadows every
successful candidate call. Intermediate values sample with `random_source`.
5. Shadow grading reuses the candidate response already returned by the wrapped
candidate adapter; it does not make a second candidate model call.
6. Shadow calls use a copy of `RunConfig` with `budget_tracker=None`, so
observation collection cannot consume the caller's foreground token budget.
7. `async_shadow=True` schedules shadow work on a background thread. `flush()`
waits for currently queued work, and `shutdown()` releases the executor.
## Observation mapping
The appended observation uses:
- `task_type` from the wrapper configuration
- `adapter_id` from the wrapper configuration
- `model_id` from the wrapper configuration, then candidate response model, then
`RunConfig.model_name`
- `quality_score` from the `GradingResult`
- `cost_usd` from response metadata keys `cost_usd`, `estimated_cost_usd`, or
`cost`, falling back to `0.0`
- token counts from candidate response usage keys `prompt_tokens` and
`completion_tokens`
- `baseline_adapter_id` and `tags` from wrapper configuration
## Error contract
| Condition | Exception |
|-----------|-----------|
| Empty `task_type` | `ValueError` |
| Empty `adapter_id` | `ValueError` |
| `shadow_rate` outside `0..1` | `ValueError` |
| Candidate adapter failure | Original exception propagates |
| Shadow baseline/grading/ledger failure | Suppressed; optional callback |
## Privacy note
The wrapper does not store prompt or response text in the ledger by default.
Callers that need regime tracking should store non-sensitive fingerprints in
`tags`, for example `prompt_fingerprint` or `template_version`.