Add adaptive cost-quality routing primitives

2026-05-17 21:32:27 +02:00
parent bf86a03c5d
commit c4ad4bb9f2
17 changed files with 2480 additions and 25 deletions
--- a/contracts/functional/adaptive-routing-policy.md
+++ b/contracts/functional/adaptive-routing-policy.md
@@ -0,0 +1,87 @@
+# Contract: AdaptiveRoutingPolicy
+
+**layer:** Functional
+**maturity:** Beta
+**module:** `llm_connect.routing`
+**since:** WP-0004
+
+## Purpose
+
+Select the cheapest adapter whose observed mean quality for a task type clears
+a caller-supplied quality floor. The policy builds on `RoutingPolicy`: static
+rules remain the cold-start and failure fallback, while adaptive selection is
+used only when the ledger has enough qualifying observations.
+
+## Public surface
+
+```python
+@dataclass
+class AdaptiveRoutingPolicy(RoutingPolicy):
+    ledger: Optional[QualityLedger] = None
+    adapters_by_id: Mapping[str, LLMAdapter] = field(default_factory=dict)
+    window_size: int = 20
+    min_observations: int = 1
+    max_age: Optional[timedelta] = None
+
+    def resolve(
+        self,
+        task_type: str,
+        estimated_cost_per_1k: Optional[float] = None,
+        *,
+        quality_floor: Optional[float] = None,
+    ) -> LLMAdapter: ...
+```
+
+## Candidate identity
+
+Observations are keyed by `(task_type, adapter_id)`. Callers should pass
+`adapters_by_id` so the policy can map ledger observations back to concrete
+`LLMAdapter` instances. If a static rule adapter is not present in
+`adapters_by_id`, the policy also checks common string attributes
+`adapter_id`, `id`, and `name`.
+
+## Invariants
+
+1. If `quality_floor is None` or `ledger is None`, resolution is exactly the
+   same as `RoutingPolicy.resolve()`.
+2. `quality_floor` must be between `0` and `1`, inclusive.
+3. Each candidate is evaluated over the newest `window_size` observations for
+   the requested `task_type` and adapter id.
+4. `max_age`, when provided, filters out observations older than that age.
+5. A candidate is considered only when it has at least `min_observations` after
+   filtering.
+6. A candidate qualifies when its mean `quality_score` is greater than or equal
+   to `quality_floor`.
+7. Among qualifying candidates, the policy chooses the lowest mean observed
+   `cost_usd`.
+8. If mean observed cost ties exactly, the policy prefers the matching static
+   rule's explicit `prefer` adapter.
+9. If there are still ties, stable candidate order is used.
+10. If no candidate qualifies, resolution falls through to
+    `RoutingPolicy.resolve(task_type, estimated_cost_per_1k)`.
+
+## Sample-size and freshness trade-off
+
+Small `window_size` values react quickly to model or prompt changes but can be
+noisy. Larger windows are more stable but may preserve stale behavior after a
+provider update or prompt template change. `min_observations` lets callers avoid
+acting on a single lucky sample, while `max_age` bounds how long old observations
+can influence routing. Callers that change prompts materially should also filter
+by a prompt fingerprint in observation tags before writing comparable samples to
+the same ledger regime.
+
+## Error contract
+
+| Condition | Exception |
+|-----------|-----------|
+| `quality_floor` outside `0..1` | `ValueError` |
+| `window_size <= 0` | `ValueError` |
+| `min_observations <= 0` | `ValueError` |
+| `max_age < 0` | `ValueError` |
+| No qualifying adaptive candidate and no static fallback | `LookupError` |
+
+## Non-goals
+
+The policy does not define a task taxonomy, set task quality floors, decide
+which baseline is authoritative, or perform billing-grade accounting. Those are
+consumer policy choices.
--- a/contracts/functional/baseline-grading.md
+++ b/contracts/functional/baseline-grading.md
@@ -0,0 +1,85 @@
+# Contract: Baseline Grading
+
+**layer:** Functional
+**maturity:** Beta
+**module:** `llm_connect.grading`
+**since:** WP-0004
+
+## Purpose
+
+Compare a candidate adapter response against a caller-chosen baseline response
+and return a normalised quality score suitable for storage in
+`QualityLedger`.
+
+## Public surface
+
+```python
+@dataclass(frozen=True)
+class GradingResult:
+    quality_score: float
+    notes: str
+    grader_id: str
+    baseline_response: LLMResponse
+    candidate_response: LLMResponse
+
+class Judge(Protocol):
+    grader_id: str
+    def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
+
+class BaselineGrader(Protocol):
+    def grade(
+        self,
+        baseline_adapter: LLMAdapter,
+        candidate_adapter: LLMAdapter,
+        prompt: str,
+        run_config: RunConfig,
+    ) -> GradingResult: ...
+
+@dataclass
+class ExactMatchJudge: ...
+
+@dataclass
+class EmbeddingSimilarityJudge: ...
+
+@dataclass
+class LLMJudge: ...
+
+@dataclass
+class PairedGrader: ...
+```
+
+## Invariants
+
+1. `quality_score` is always validated as `0.0..1.0`.
+2. `GradingResult` always preserves both baseline and candidate responses.
+3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
+   same prompt and run config, then delegates comparison to its `Judge`.
+4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
+5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
+   single batch and clamps cosine similarity into `0.0..1.0`.
+6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
+   `quality_score` and optional `notes`.
+7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
+   and adds a deterministic `seed` model parameter when configured.
+
+## Error contract
+
+| Condition | Exception |
+|-----------|-----------|
+| Invalid `quality_score` | `ValueError` |
+| Empty `grader_id` | `ValueError` |
+| Embedding adapter returns other than two vectors | `ValueError` |
+| LLM judge response is missing parseable JSON | `ValueError` |
+
+## Bias caveats
+
+LLM-as-judge scoring is heuristic and may exhibit:
+
+- Length bias: longer answers can be preferred even when not better.
+- Format bias: familiar formatting can be rewarded independent of correctness.
+- Position bias: prompt order can affect judgement.
+- Self-preference: a judge may favour outputs from its own model family.
+
+Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
+as exact match or embedding similarity before using its observations to drive
+adaptive routing.
--- a/contracts/functional/quality-ledger.md
+++ b/contracts/functional/quality-ledger.md
@@ -0,0 +1,87 @@
+# Contract: QualityObservation and QualityLedger
+
+**layer:** Functional
+**maturity:** Beta
+**module:** `llm_connect.quality`
+**since:** WP-0004
+
+## Purpose
+
+Record observed quality, cost, latency, and token outcomes for a logical task
+type so consumers can build adaptive routing policy without putting
+consumer-specific thresholds into llm-connect.
+
+## Public surface
+
+```python
+@dataclass(frozen=True)
+class QualityObservation:
+    task_type: str
+    adapter_id: str
+    model_id: str
+    cost_usd: float
+    quality_score: float
+    latency_ms: float
+    tokens_in: int
+    tokens_out: int
+    baseline_adapter_id: str | None = None
+    recorded_at: datetime = field(default_factory=...)
+    tags: dict[str, Any] = field(default_factory=dict)
+
+    @property
+    def total_tokens(self) -> int: ...
+    def to_dict(self) -> dict[str, Any]: ...
+    @classmethod
+    def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
+
+class QualityLedger:
+    def __init__(self, path: str | Path): ...
+    @property
+    def path(self) -> Path: ...
+    def append(self, observation: QualityObservation) -> None: ...
+    def read_all(self) -> list[QualityObservation]: ...
+    def malformed_count(self) -> int: ...
+    def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
+    def recent(...) -> list[QualityObservation]: ...
+    def mean_quality(...) -> float | None: ...
+    def prune_before(self, timestamp: datetime) -> int: ...
+
+def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
+```
+
+## Invariants
+
+1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
+   candidate fully meets the grader's quality bar and `0.0` means complete
+   failure for that grader.
+2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
+3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
+4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
+5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
+6. `QualityLedger.append()` performs a process-local lock plus an advisory file
+   lock around each write.
+7. Read/query helpers skip malformed lines instead of failing the whole ledger.
+   `malformed_count()` exposes how many lines were skipped.
+8. `prune_before()` removes only valid observations older than the cutoff.
+   Malformed lines are preserved.
+
+## Error contract
+
+| Condition | Exception |
+|-----------|-----------|
+| Invalid observation field | `ValueError` |
+| Invalid datetime field | `TypeError` or `ValueError` |
+| Negative recent limit | `ValueError` |
+| `mean_quality(min_observations <= 0)` | `ValueError` |
+| `is_stale(max_age < 0)` | `ValueError` |
+
+## Known consumers
+
+- `infospace-bench` is the first intended consumer. It is expected to provide
+  task taxonomy, thresholds, and baseline choice.
+
+## Notes
+
+The ledger intentionally stores only observation metadata in this slice. Callers
+that need prompt or response digests can place those in `tags`, for example
+`prompt_fingerprint`.
--- a/contracts/functional/shadowing-adapter.md
+++ b/contracts/functional/shadowing-adapter.md
@@ -0,0 +1,84 @@
+# Contract: ShadowingAdapter
+
+**layer:** Functional
+**maturity:** Beta
+**module:** `llm_connect.shadowing`
+**since:** WP-0004
+
+## Purpose
+
+Collect quality observations without changing caller-visible model behavior.
+`ShadowingAdapter` wraps a candidate adapter, returns the candidate response to
+the caller, and samples extra baseline/grading work that appends
+`QualityObservation` records to a `QualityLedger`.
+
+## Public surface
+
+```python
+@dataclass
+class ShadowingAdapter(LLMAdapter):
+    candidate_adapter: LLMAdapter
+    baseline_adapter: LLMAdapter
+    grader: BaselineGrader
+    ledger: QualityLedger
+    task_type: str
+    adapter_id: str
+    model_id: Optional[str] = None
+    baseline_adapter_id: Optional[str] = None
+    shadow_rate: float = 1.0
+    async_shadow: bool = False
+    tags: Mapping[str, Any] = field(default_factory=dict)
+    on_shadow_error: Optional[Callable[[Exception], None]] = None
+
+    def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
+    async def async_execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
+    def flush(self, timeout: Optional[float] = None) -> None: ...
+    def shutdown(self, wait: bool = True) -> None: ...
+```
+
+## Invariants
+
+1. The candidate adapter is always called first.
+2. The response returned by `execute_prompt()` and `async_execute_prompt()` is
+   always the candidate response.
+3. Shadow failures from the baseline adapter, grader, or ledger writer are
+   isolated from the caller. They are sent to `on_shadow_error` when configured.
+4. `shadow_rate=0.0` records no observations. `shadow_rate=1.0` shadows every
+   successful candidate call. Intermediate values sample with `random_source`.
+5. Shadow grading reuses the candidate response already returned by the wrapped
+   candidate adapter; it does not make a second candidate model call.
+6. Shadow calls use a copy of `RunConfig` with `budget_tracker=None`, so
+   observation collection cannot consume the caller's foreground token budget.
+7. `async_shadow=True` schedules shadow work on a background thread. `flush()`
+   waits for currently queued work, and `shutdown()` releases the executor.
+
+## Observation mapping
+
+The appended observation uses:
+
+- `task_type` from the wrapper configuration
+- `adapter_id` from the wrapper configuration
+- `model_id` from the wrapper configuration, then candidate response model, then
+  `RunConfig.model_name`
+- `quality_score` from the `GradingResult`
+- `cost_usd` from response metadata keys `cost_usd`, `estimated_cost_usd`, or
+  `cost`, falling back to `0.0`
+- token counts from candidate response usage keys `prompt_tokens` and
+  `completion_tokens`
+- `baseline_adapter_id` and `tags` from wrapper configuration
+
+## Error contract
+
+| Condition | Exception |
+|-----------|-----------|
+| Empty `task_type` | `ValueError` |
+| Empty `adapter_id` | `ValueError` |
+| `shadow_rate` outside `0..1` | `ValueError` |
+| Candidate adapter failure | Original exception propagates |
+| Shadow baseline/grading/ledger failure | Suppressed; optional callback |
+
+## Privacy note
+
+The wrapper does not store prompt or response text in the ledger by default.
+Callers that need regime tracking should store non-sensitive fingerprints in
+`tags`, for example `prompt_fingerprint` or `template_version`.