Files
llm-connect/workplans/llm-connect-WP-0004-adaptive-cost-quality-routing.md
tegwick c4ad4bb9f2
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add adaptive cost-quality routing primitives
2026-05-17 21:32:27 +02:00

368 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: LLM-WP-0004
type: workplan
title: Adaptive Cost-Quality Routing
domain: custodian
status: completed
owner: llm-connect
created: 2026-05-17
repo: llm-connect
planning_priority: high
planning_order: 4
state_hub_workstream_id: e1807fab-e29e-4517-b362-95737a96582d
---
# LLM-WP-0004 — Adaptive Cost-Quality Routing
**status:** completed
**owner:** llm-connect
**repo:** llm-connect
**created:** 2026-05-17
**depends-on:** LLM-WP-0003 (RoutingPolicy primitive)
## Purpose
Provide reusable primitives that let a consumer route each task to the
cheapest model whose observed output quality clears a per-task bar, with
the local Claude Code session (`ClaudeCodeAdapter`) available as the
baseline-quality oracle. The current `RoutingPolicy` (LLM-WP-0003) is
static: rules and cost caps are hand-authored. This workplan adds the
observation, grading, and adaptive-selection layer that learns *which*
model is good enough for which task type.
Demand signal: `infospace-bench` is about to scale from one-chapter
smoke runs to multi-chapter and full-book infospace generation across
multiple workflow stages (summarise, extract entities, extract
relations, evaluate). Each stage has very different quality / cost
trade-offs, and the consumer needs llm-connect to pick the right model
per stage instead of hard-coding one model for the whole run.
## Scope guardrails (read this before adding tasks)
llm-connect ships *primitives*, not consumer policy.
In scope here:
- Data models for quality observations and grading results
- A reusable observation ledger (persistent, append-only, file-backed)
- A `BaselineGrader` that pairs a baseline adapter with a candidate
adapter and emits a structured quality delta
- A small built-in grader catalogue (exact-match, embedding similarity,
LLM-as-judge wrapper)
- An `AdaptiveRoutingPolicy` that extends `RoutingPolicy` by consulting
the ledger to pick the cheapest adapter whose observed quality for a
task type still clears a configured threshold
- A shadow-mode wrapper adapter for collecting observations in
production without changing caller behaviour
Out of scope (belongs in consumer repos):
- Task-type taxonomy (callers name their tasks)
- Quality thresholds per task type (callers set their own bars)
- Choice of baseline (callers wire whichever adapter they trust)
- When to re-grade (callers decide; this repo just exposes ledger TTL
and refresh helpers)
- Cost accounting for billing or budgets beyond a per-call estimate
## GAAF notes
All additions are Functional-layer per GAAF-2026. Core stays untouched.
Each new module gets a functional contract doc under
`contracts/functional/`. Maturity on release: Beta — `infospace-bench`
is the first known consumer; the API may shift before any second
consumer (`inter-hub`, `markitect`) adopts it.
## Tasks
The fenced `task` blocks below are the State Hub registration index. Keep them
in sync with the detailed task tables that follow.
```task
id: T01
title: 'QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags'
priority: high
status: done
state_hub_task_id: "1c285bec-c30b-45a8-a408-3f91d810a078"
```
```task
id: T02
title: 'QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality)'
priority: high
status: done
state_hub_task_id: "5249f171-a047-499f-9ec4-cb50e1477765"
```
```task
id: T03
title: 'TTL helpers: prune_before(timestamp) and is_stale(observation, max_age)'
priority: medium
status: done
state_hub_task_id: "adb255cf-7e89-4fea-b822-6be437d99789"
```
```task
id: T04
title: 'Functional contract doc for the ledger schema and quality_score semantics'
priority: medium
status: done
state_hub_task_id: "51a33180-a99d-4aa4-96be-2fcee15bfbc3"
```
```task
id: T05
title: 'Tests: ledger round-trip, concurrent writes, query helpers, TTL, malformed-line resilience'
priority: high
status: done
state_hub_task_id: "458610c5-c903-4b42-9602-cd511999c9ba"
```
```task
id: T06
title: 'GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response'
priority: high
status: done
state_hub_task_id: "c12a595b-90fc-4a80-8394-549edbda2031"
```
```task
id: T07
title: 'BaselineGrader protocol plus PairedGrader that runs baseline and candidate calls and delegates to a Judge'
priority: high
status: done
state_hub_task_id: "80b98e31-06fc-4462-b030-a12881095f93"
```
```task
id: T08
title: 'Judge protocol and built-ins: ExactMatchJudge, EmbeddingSimilarityJudge, LLMJudge'
priority: high
status: done
state_hub_task_id: "c2887fe3-bae6-4298-8c26-f9a519264dcf"
```
```task
id: T09
title: 'Functional contract doc covering judge bias caveats'
priority: medium
status: done
state_hub_task_id: "7a4fd87a-b0ba-41b0-8e1a-a60fdaded905"
```
```task
id: T10
title: 'Tests: judges with canned inputs, stable grader result, deterministic LLMJudge rubric seed'
priority: high
status: done
state_hub_task_id: "8415a11d-d508-4d17-8082-10f93e9d16c5"
```
```task
id: T11
title: 'AdaptiveRoutingPolicy extends RoutingPolicy and selects the cheapest adapter whose observed mean quality clears the floor'
priority: high
status: done
state_hub_task_id: "0e9f9f8e-5066-4257-913b-a19f5b3fc47d"
```
```task
id: T12
title: 'Tie-breaking: prefer lower observed cost, then explicit preferred adapter from static rules'
priority: medium
status: done
state_hub_task_id: "59d44712-1088-41ac-bad8-5d95db6f3a4f"
```
```task
id: T13
title: 'Cold-start behaviour falls through to static RoutingPolicy.resolve when observations are missing'
priority: high
status: done
state_hub_task_id: "1927d369-f5f6-48d3-8f53-7e4f1cae370e"
```
```task
id: T14
title: 'Functional contract doc for adaptive policy and sample-size/freshness trade-off'
priority: medium
status: done
state_hub_task_id: "4d4717c1-8849-4fed-8f8d-515901ecafe0"
```
```task
id: T15
title: 'Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain'
priority: high
status: done
state_hub_task_id: "304bd782-db15-4b7a-8d05-49e064a926c3"
```
```task
id: T16
title: 'ShadowingAdapter wraps a candidate adapter, also invokes the baseline adapter, grades, and appends to QualityLedger'
priority: medium
status: done
state_hub_task_id: "62dd507f-536a-4623-8cbd-fa9f78e85ca6"
```
```task
id: T17
title: 'Sampling: caller-configurable shadow_rate so production load is not doubled'
priority: medium
status: done
state_hub_task_id: "ccb73e92-1fca-42f9-8437-9b2b50e6424c"
```
```task
id: T18
title: 'Failure isolation: shadow errors never affect the candidate response returned to the caller'
priority: high
status: done
state_hub_task_id: "b879d232-d6ce-4ff6-b534-616729ea5ad7"
```
```task
id: T19
title: 'Functional contract doc for ShadowingAdapter'
priority: low
status: done
state_hub_task_id: "99d2c1bc-f1d8-42b3-9e04-6eea49460943"
```
```task
id: T20
title: 'Tests: candidate response survives baseline failure, ledger sampling rate, sync vs async modes'
priority: high
status: done
state_hub_task_id: "f533fbf4-484f-4408-8260-7e84e23bdc46"
```
```task
id: T21
title: 'Example script: route fixture batch through three candidate adapters and populate the ledger'
priority: medium
status: done
state_hub_task_id: "7ef0c143-74b0-4740-81fa-819a826cf8f3"
```
```task
id: T22
title: 'Integration test: cold-start, static fallback, first observations, convergence to cheapest qualifying adapter'
priority: high
status: done
state_hub_task_id: "c4c6743f-157b-4445-8576-9caa6421d463"
```
```task
id: T23
title: 'Consumer integration guide showing how infospace-bench wires task types into adaptive policy'
priority: medium
status: done
state_hub_task_id: "3a073ff7-0170-4a95-9c2a-a5daa84964e6"
```
### T01 — Quality observation data model + ledger
| ID | Title | Priority | Status |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
| T01 | `QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags` | high | done |
| T02 | `QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`) | high | done |
| T03 | TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger | medium | done |
| T04 | Functional contract doc for the ledger schema and the field semantics of `quality_score` | medium | done |
| T05 | Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience | high | done |
### T02 — Baseline grader
| ID | Title | Priority | Status |
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
| T06 | `GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response` | high | done |
| T07 | `BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)``GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge` | high | done |
| T08 | `Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt) | high | done |
| T09 | Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`) | medium | done |
| T10 | Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric | high | done |
### T03 — Adaptive routing policy
| ID | Title | Priority | Status |
|-----|--------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
| T11 | `AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window | high | done |
| T12 | Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules | medium | done |
| T13 | Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero | high | done |
| T14 | Functional contract doc; document the trade-off between sample size and freshness | medium | done |
| T15 | Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain | high | done |
### T04 — Shadow-mode observation wrapper
| ID | Title | Priority | Status |
|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
| T16 | `ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger` | medium | done |
| T17 | Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled | medium | done |
| T18 | Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised | high | done |
| T19 | Functional contract doc | low | done |
| T20 | Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes | high | done |
### T05 — End-to-end example + integration test
| ID | Title | Priority | Status |
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
| T21 | Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger | medium | done |
| T22 | Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter | high | done |
| T23 | Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy | medium | done |
## Risks and open questions
- **Judge bias.** `LLMJudge` has known biases — length, position, format,
self-preference when the judge model is the same family as a
candidate. The contract must document these and recommend pairing
with at least one non-LLM judge for calibration.
- **Baseline cost in shadow mode.** `ClaudeCodeAdapter` is not per-call
billed (it shells out to a local subscription session), but every
shadow call still consumes wall-clock and rate-budget. Sampling is
load-bearing, not optional.
- **Non-stationarity.** Provider model updates, prompt changes, and
template edits all silently invalidate prior observations. Plan for
a `prompt_fingerprint` tag on observations so the ledger can be
filtered to a coherent regime.
- **Scope creep.** "Pick a model" is one decision; "decide whether the
task is worth doing at all" is another. The latter is consumer
policy. Keep this workplan firmly on the former.
- **Privacy.** Observations contain prompt and response text by
default. Add a `redact: Callable[[str], str]` hook on the ledger
writer so sensitive callers can store hashes / digests instead.
- **API-vs-CLI baseline parity.** A consumer that grades against
`ClaudeCodeAdapter` (CLI) but later switches to a Claude API adapter
may see quality drift that's actually transport drift. Document this.
## Exit criteria
- `QualityLedger` round-trips observations and exposes the documented
query helpers
- `BaselineGrader` produces deterministic `GradingResult` objects for at
least one non-LLM judge and one LLM judge given canned inputs
- `AdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8)` returns
the cheapest adapter whose mean quality over the configured window
clears the floor, with a documented cold-start fallback
- `ShadowingAdapter` never alters the candidate response and respects
the sampling rate within statistical tolerance
- End-to-end example produces a ledger with at least three adapters per
task type and the integration test shows convergence to the cheapest
qualifying adapter
- Functional contracts published for the new data models, the grader
protocols, and the adaptive policy
## Consumer-side follow-up
`infospace-bench` will need a small companion workplan (`IB-WP-NNNN`) to:
- Replace direct `OpenRouterAssistedGenerationAdapter` use with a
task-type-tagged route through `AdaptiveRoutingPolicy`
- Define its task-type taxonomy (`summarize-source`, `extract-entities`,
`extract-relations`, `evaluate-entity`, `synthesize-report`)
- Pick a baseline adapter (most likely `ClaudeCodeAdapter`) and a
quality threshold per stage
- Wire the shadow-mode wrapper for the first multi-chapter run so the
ledger fills up while real generation proceeds
That workplan should be drafted after T01T03 of this workplan land, so
that the consumer-side wiring is anchored in a stable llm-connect API.