generated from coulomb/repo-seed
Add adaptive cost-quality routing primitives
This commit is contained in:
87
contracts/functional/adaptive-routing-policy.md
Normal file
87
contracts/functional/adaptive-routing-policy.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Contract: AdaptiveRoutingPolicy
|
||||
|
||||
**layer:** Functional
|
||||
**maturity:** Beta
|
||||
**module:** `llm_connect.routing`
|
||||
**since:** WP-0004
|
||||
|
||||
## Purpose
|
||||
|
||||
Select the cheapest adapter whose observed mean quality for a task type clears
|
||||
a caller-supplied quality floor. The policy builds on `RoutingPolicy`: static
|
||||
rules remain the cold-start and failure fallback, while adaptive selection is
|
||||
used only when the ledger has enough qualifying observations.
|
||||
|
||||
## Public surface
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class AdaptiveRoutingPolicy(RoutingPolicy):
|
||||
ledger: Optional[QualityLedger] = None
|
||||
adapters_by_id: Mapping[str, LLMAdapter] = field(default_factory=dict)
|
||||
window_size: int = 20
|
||||
min_observations: int = 1
|
||||
max_age: Optional[timedelta] = None
|
||||
|
||||
def resolve(
|
||||
self,
|
||||
task_type: str,
|
||||
estimated_cost_per_1k: Optional[float] = None,
|
||||
*,
|
||||
quality_floor: Optional[float] = None,
|
||||
) -> LLMAdapter: ...
|
||||
```
|
||||
|
||||
## Candidate identity
|
||||
|
||||
Observations are keyed by `(task_type, adapter_id)`. Callers should pass
|
||||
`adapters_by_id` so the policy can map ledger observations back to concrete
|
||||
`LLMAdapter` instances. If a static rule adapter is not present in
|
||||
`adapters_by_id`, the policy also checks common string attributes
|
||||
`adapter_id`, `id`, and `name`.
|
||||
|
||||
## Invariants
|
||||
|
||||
1. If `quality_floor is None` or `ledger is None`, resolution is exactly the
|
||||
same as `RoutingPolicy.resolve()`.
|
||||
2. `quality_floor` must be between `0` and `1`, inclusive.
|
||||
3. Each candidate is evaluated over the newest `window_size` observations for
|
||||
the requested `task_type` and adapter id.
|
||||
4. `max_age`, when provided, filters out observations older than that age.
|
||||
5. A candidate is considered only when it has at least `min_observations` after
|
||||
filtering.
|
||||
6. A candidate qualifies when its mean `quality_score` is greater than or equal
|
||||
to `quality_floor`.
|
||||
7. Among qualifying candidates, the policy chooses the lowest mean observed
|
||||
`cost_usd`.
|
||||
8. If mean observed cost ties exactly, the policy prefers the matching static
|
||||
rule's explicit `prefer` adapter.
|
||||
9. If there are still ties, stable candidate order is used.
|
||||
10. If no candidate qualifies, resolution falls through to
|
||||
`RoutingPolicy.resolve(task_type, estimated_cost_per_1k)`.
|
||||
|
||||
## Sample-size and freshness trade-off
|
||||
|
||||
Small `window_size` values react quickly to model or prompt changes but can be
|
||||
noisy. Larger windows are more stable but may preserve stale behavior after a
|
||||
provider update or prompt template change. `min_observations` lets callers avoid
|
||||
acting on a single lucky sample, while `max_age` bounds how long old observations
|
||||
can influence routing. Callers that change prompts materially should also filter
|
||||
by a prompt fingerprint in observation tags before writing comparable samples to
|
||||
the same ledger regime.
|
||||
|
||||
## Error contract
|
||||
|
||||
| Condition | Exception |
|
||||
|-----------|-----------|
|
||||
| `quality_floor` outside `0..1` | `ValueError` |
|
||||
| `window_size <= 0` | `ValueError` |
|
||||
| `min_observations <= 0` | `ValueError` |
|
||||
| `max_age < 0` | `ValueError` |
|
||||
| No qualifying adaptive candidate and no static fallback | `LookupError` |
|
||||
|
||||
## Non-goals
|
||||
|
||||
The policy does not define a task taxonomy, set task quality floors, decide
|
||||
which baseline is authoritative, or perform billing-grade accounting. Those are
|
||||
consumer policy choices.
|
||||
85
contracts/functional/baseline-grading.md
Normal file
85
contracts/functional/baseline-grading.md
Normal file
@@ -0,0 +1,85 @@
|
||||
# Contract: Baseline Grading
|
||||
|
||||
**layer:** Functional
|
||||
**maturity:** Beta
|
||||
**module:** `llm_connect.grading`
|
||||
**since:** WP-0004
|
||||
|
||||
## Purpose
|
||||
|
||||
Compare a candidate adapter response against a caller-chosen baseline response
|
||||
and return a normalised quality score suitable for storage in
|
||||
`QualityLedger`.
|
||||
|
||||
## Public surface
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class GradingResult:
|
||||
quality_score: float
|
||||
notes: str
|
||||
grader_id: str
|
||||
baseline_response: LLMResponse
|
||||
candidate_response: LLMResponse
|
||||
|
||||
class Judge(Protocol):
|
||||
grader_id: str
|
||||
def judge(..., *, prompt: str, run_config: RunConfig) -> GradingResult: ...
|
||||
|
||||
class BaselineGrader(Protocol):
|
||||
def grade(
|
||||
self,
|
||||
baseline_adapter: LLMAdapter,
|
||||
candidate_adapter: LLMAdapter,
|
||||
prompt: str,
|
||||
run_config: RunConfig,
|
||||
) -> GradingResult: ...
|
||||
|
||||
@dataclass
|
||||
class ExactMatchJudge: ...
|
||||
|
||||
@dataclass
|
||||
class EmbeddingSimilarityJudge: ...
|
||||
|
||||
@dataclass
|
||||
class LLMJudge: ...
|
||||
|
||||
@dataclass
|
||||
class PairedGrader: ...
|
||||
```
|
||||
|
||||
## Invariants
|
||||
|
||||
1. `quality_score` is always validated as `0.0..1.0`.
|
||||
2. `GradingResult` always preserves both baseline and candidate responses.
|
||||
3. `PairedGrader` runs the baseline adapter and the candidate adapter with the
|
||||
same prompt and run config, then delegates comparison to its `Judge`.
|
||||
4. `ExactMatchJudge` returns `1.0` for matched content and `0.0` otherwise.
|
||||
5. `EmbeddingSimilarityJudge` embeds baseline and candidate response text in a
|
||||
single batch and clamps cosine similarity into `0.0..1.0`.
|
||||
6. `LLMJudge` uses a fixed rubric prompt and expects JSON with
|
||||
`quality_score` and optional `notes`.
|
||||
7. `LLMJudge` runs with `temperature=0.0`, drops the caller's budget tracker,
|
||||
and adds a deterministic `seed` model parameter when configured.
|
||||
|
||||
## Error contract
|
||||
|
||||
| Condition | Exception |
|
||||
|-----------|-----------|
|
||||
| Invalid `quality_score` | `ValueError` |
|
||||
| Empty `grader_id` | `ValueError` |
|
||||
| Embedding adapter returns other than two vectors | `ValueError` |
|
||||
| LLM judge response is missing parseable JSON | `ValueError` |
|
||||
|
||||
## Bias caveats
|
||||
|
||||
LLM-as-judge scoring is heuristic and may exhibit:
|
||||
|
||||
- Length bias: longer answers can be preferred even when not better.
|
||||
- Format bias: familiar formatting can be rewarded independent of correctness.
|
||||
- Position bias: prompt order can affect judgement.
|
||||
- Self-preference: a judge may favour outputs from its own model family.
|
||||
|
||||
Consumers should calibrate `LLMJudge` against at least one non-LLM judge such
|
||||
as exact match or embedding similarity before using its observations to drive
|
||||
adaptive routing.
|
||||
87
contracts/functional/quality-ledger.md
Normal file
87
contracts/functional/quality-ledger.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Contract: QualityObservation and QualityLedger
|
||||
|
||||
**layer:** Functional
|
||||
**maturity:** Beta
|
||||
**module:** `llm_connect.quality`
|
||||
**since:** WP-0004
|
||||
|
||||
## Purpose
|
||||
|
||||
Record observed quality, cost, latency, and token outcomes for a logical task
|
||||
type so consumers can build adaptive routing policy without putting
|
||||
consumer-specific thresholds into llm-connect.
|
||||
|
||||
## Public surface
|
||||
|
||||
```python
|
||||
@dataclass(frozen=True)
|
||||
class QualityObservation:
|
||||
task_type: str
|
||||
adapter_id: str
|
||||
model_id: str
|
||||
cost_usd: float
|
||||
quality_score: float
|
||||
latency_ms: float
|
||||
tokens_in: int
|
||||
tokens_out: int
|
||||
baseline_adapter_id: str | None = None
|
||||
recorded_at: datetime = field(default_factory=...)
|
||||
tags: dict[str, Any] = field(default_factory=dict)
|
||||
|
||||
@property
|
||||
def total_tokens(self) -> int: ...
|
||||
def to_dict(self) -> dict[str, Any]: ...
|
||||
@classmethod
|
||||
def from_dict(cls, data: dict[str, Any]) -> "QualityObservation": ...
|
||||
|
||||
class QualityLedger:
|
||||
def __init__(self, path: str | Path): ...
|
||||
@property
|
||||
def path(self) -> Path: ...
|
||||
def append(self, observation: QualityObservation) -> None: ...
|
||||
def read_all(self) -> list[QualityObservation]: ...
|
||||
def malformed_count(self) -> int: ...
|
||||
def by_task_type(self, task_type: str) -> list[QualityObservation]: ...
|
||||
def recent(...) -> list[QualityObservation]: ...
|
||||
def mean_quality(...) -> float | None: ...
|
||||
def prune_before(self, timestamp: datetime) -> int: ...
|
||||
|
||||
def is_stale(observation: QualityObservation, max_age: timedelta, *, now: datetime | None = None) -> bool: ...
|
||||
```
|
||||
|
||||
## Invariants
|
||||
|
||||
1. `quality_score` is a normalised `0.0..1.0` score where `1.0` means the
|
||||
candidate fully meets the grader's quality bar and `0.0` means complete
|
||||
failure for that grader.
|
||||
2. `task_type`, `adapter_id`, and `model_id` must be non-empty strings.
|
||||
3. `cost_usd`, `latency_ms`, `tokens_in`, and `tokens_out` are non-negative.
|
||||
4. `recorded_at` is normalised to UTC. Naive datetimes are interpreted as UTC.
|
||||
5. Ledger records are JSON Lines. Each line is one `QualityObservation.to_dict()`.
|
||||
6. `QualityLedger.append()` performs a process-local lock plus an advisory file
|
||||
lock around each write.
|
||||
7. Read/query helpers skip malformed lines instead of failing the whole ledger.
|
||||
`malformed_count()` exposes how many lines were skipped.
|
||||
8. `prune_before()` removes only valid observations older than the cutoff.
|
||||
Malformed lines are preserved.
|
||||
|
||||
## Error contract
|
||||
|
||||
| Condition | Exception |
|
||||
|-----------|-----------|
|
||||
| Invalid observation field | `ValueError` |
|
||||
| Invalid datetime field | `TypeError` or `ValueError` |
|
||||
| Negative recent limit | `ValueError` |
|
||||
| `mean_quality(min_observations <= 0)` | `ValueError` |
|
||||
| `is_stale(max_age < 0)` | `ValueError` |
|
||||
|
||||
## Known consumers
|
||||
|
||||
- `infospace-bench` is the first intended consumer. It is expected to provide
|
||||
task taxonomy, thresholds, and baseline choice.
|
||||
|
||||
## Notes
|
||||
|
||||
The ledger intentionally stores only observation metadata in this slice. Callers
|
||||
that need prompt or response digests can place those in `tags`, for example
|
||||
`prompt_fingerprint`.
|
||||
84
contracts/functional/shadowing-adapter.md
Normal file
84
contracts/functional/shadowing-adapter.md
Normal file
@@ -0,0 +1,84 @@
|
||||
# Contract: ShadowingAdapter
|
||||
|
||||
**layer:** Functional
|
||||
**maturity:** Beta
|
||||
**module:** `llm_connect.shadowing`
|
||||
**since:** WP-0004
|
||||
|
||||
## Purpose
|
||||
|
||||
Collect quality observations without changing caller-visible model behavior.
|
||||
`ShadowingAdapter` wraps a candidate adapter, returns the candidate response to
|
||||
the caller, and samples extra baseline/grading work that appends
|
||||
`QualityObservation` records to a `QualityLedger`.
|
||||
|
||||
## Public surface
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ShadowingAdapter(LLMAdapter):
|
||||
candidate_adapter: LLMAdapter
|
||||
baseline_adapter: LLMAdapter
|
||||
grader: BaselineGrader
|
||||
ledger: QualityLedger
|
||||
task_type: str
|
||||
adapter_id: str
|
||||
model_id: Optional[str] = None
|
||||
baseline_adapter_id: Optional[str] = None
|
||||
shadow_rate: float = 1.0
|
||||
async_shadow: bool = False
|
||||
tags: Mapping[str, Any] = field(default_factory=dict)
|
||||
on_shadow_error: Optional[Callable[[Exception], None]] = None
|
||||
|
||||
def execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
|
||||
async def async_execute_prompt(self, prompt: str, config: RunConfig) -> LLMResponse: ...
|
||||
def flush(self, timeout: Optional[float] = None) -> None: ...
|
||||
def shutdown(self, wait: bool = True) -> None: ...
|
||||
```
|
||||
|
||||
## Invariants
|
||||
|
||||
1. The candidate adapter is always called first.
|
||||
2. The response returned by `execute_prompt()` and `async_execute_prompt()` is
|
||||
always the candidate response.
|
||||
3. Shadow failures from the baseline adapter, grader, or ledger writer are
|
||||
isolated from the caller. They are sent to `on_shadow_error` when configured.
|
||||
4. `shadow_rate=0.0` records no observations. `shadow_rate=1.0` shadows every
|
||||
successful candidate call. Intermediate values sample with `random_source`.
|
||||
5. Shadow grading reuses the candidate response already returned by the wrapped
|
||||
candidate adapter; it does not make a second candidate model call.
|
||||
6. Shadow calls use a copy of `RunConfig` with `budget_tracker=None`, so
|
||||
observation collection cannot consume the caller's foreground token budget.
|
||||
7. `async_shadow=True` schedules shadow work on a background thread. `flush()`
|
||||
waits for currently queued work, and `shutdown()` releases the executor.
|
||||
|
||||
## Observation mapping
|
||||
|
||||
The appended observation uses:
|
||||
|
||||
- `task_type` from the wrapper configuration
|
||||
- `adapter_id` from the wrapper configuration
|
||||
- `model_id` from the wrapper configuration, then candidate response model, then
|
||||
`RunConfig.model_name`
|
||||
- `quality_score` from the `GradingResult`
|
||||
- `cost_usd` from response metadata keys `cost_usd`, `estimated_cost_usd`, or
|
||||
`cost`, falling back to `0.0`
|
||||
- token counts from candidate response usage keys `prompt_tokens` and
|
||||
`completion_tokens`
|
||||
- `baseline_adapter_id` and `tags` from wrapper configuration
|
||||
|
||||
## Error contract
|
||||
|
||||
| Condition | Exception |
|
||||
|-----------|-----------|
|
||||
| Empty `task_type` | `ValueError` |
|
||||
| Empty `adapter_id` | `ValueError` |
|
||||
| `shadow_rate` outside `0..1` | `ValueError` |
|
||||
| Candidate adapter failure | Original exception propagates |
|
||||
| Shadow baseline/grading/ledger failure | Suppressed; optional callback |
|
||||
|
||||
## Privacy note
|
||||
|
||||
The wrapper does not store prompt or response text in the ledger by default.
|
||||
Callers that need regime tracking should store non-sensitive fingerprints in
|
||||
`tags`, for example `prompt_fingerprint` or `template_version`.
|
||||
Reference in New Issue
Block a user