Draft the workplan that extends the static RoutingPolicy (WP-0003) with a quality observation ledger, a BaselineGrader (ClaudeCodeAdapter as the default oracle), an AdaptiveRoutingPolicy that picks the cheapest adapter clearing a per-task quality floor, and a sampled ShadowingAdapter for production observation collection. Scope is explicit: ship primitives only. Task-type taxonomy, quality thresholds, baseline choice, and re-grading cadence stay with the consumer. infospace-bench is the named first consumer; consumer wiring deferred until T01-T03 land. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
12 KiB
LLM-WP-0004 — Adaptive Cost-Quality Routing
status: todo owner: llm-connect repo: llm-connect created: 2026-05-17 depends-on: LLM-WP-0003 (RoutingPolicy primitive)
Purpose
Provide reusable primitives that let a consumer route each task to the
cheapest model whose observed output quality clears a per-task bar, with
the local Claude Code session (ClaudeCodeAdapter) available as the
baseline-quality oracle. The current RoutingPolicy (LLM-WP-0003) is
static: rules and cost caps are hand-authored. This workplan adds the
observation, grading, and adaptive-selection layer that learns which
model is good enough for which task type.
Demand signal: infospace-bench is about to scale from one-chapter
smoke runs to multi-chapter and full-book infospace generation across
multiple workflow stages (summarise, extract entities, extract
relations, evaluate). Each stage has very different quality / cost
trade-offs, and the consumer needs llm-connect to pick the right model
per stage instead of hard-coding one model for the whole run.
Scope guardrails (read this before adding tasks)
llm-connect ships primitives, not consumer policy.
In scope here:
- Data models for quality observations and grading results
- A reusable observation ledger (persistent, append-only, file-backed)
- A
BaselineGraderthat pairs a baseline adapter with a candidate adapter and emits a structured quality delta - A small built-in grader catalogue (exact-match, embedding similarity, LLM-as-judge wrapper)
- An
AdaptiveRoutingPolicythat extendsRoutingPolicyby consulting the ledger to pick the cheapest adapter whose observed quality for a task type still clears a configured threshold - A shadow-mode wrapper adapter for collecting observations in production without changing caller behaviour
Out of scope (belongs in consumer repos):
- Task-type taxonomy (callers name their tasks)
- Quality thresholds per task type (callers set their own bars)
- Choice of baseline (callers wire whichever adapter they trust)
- When to re-grade (callers decide; this repo just exposes ledger TTL and refresh helpers)
- Cost accounting for billing or budgets beyond a per-call estimate
GAAF notes
All additions are Functional-layer per GAAF-2026. Core stays untouched.
Each new module gets a functional contract doc under
contracts/functional/. Maturity on release: Beta — infospace-bench
is the first known consumer; the API may shift before any second
consumer (inter-hub, markitect) adopts it.
Tasks
T01 — Quality observation data model + ledger
| ID | Title | Priority | Status |
|---|---|---|---|
| T01 | QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags |
high | todo |
| T02 | QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality) |
high | todo |
| T03 | TTL helpers: prune_before(timestamp) and is_stale(observation, max_age) so callers can refresh observations without re-reading the whole ledger |
medium | todo |
| T04 | Functional contract doc for the ledger schema and the field semantics of quality_score |
medium | todo |
| T05 | Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience | high | todo |
T02 — Baseline grader
| ID | Title | Priority | Status |
|---|---|---|---|
| T06 | GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response |
high | todo |
| T07 | BaselineGrader protocol: .grade(baseline_adapter, candidate_adapter, prompt, run_config) → GradingResult; built-in concrete PairedGrader runs both calls and delegates to a Judge |
high | todo |
| T08 | Judge protocol + three built-ins: ExactMatchJudge, EmbeddingSimilarityJudge (uses an embedding adapter), LLMJudge (uses a third adapter with a fixed rubric prompt) |
high | todo |
| T09 | Functional contract doc covering judge bias caveats (length bias, format bias, position bias for LLMJudge) |
medium | todo |
| T10 | Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for LLMJudge rubric |
high | todo |
T03 — Adaptive routing policy
| ID | Title | Priority | Status |
|---|---|---|---|
| T11 | AdaptiveRoutingPolicy extends RoutingPolicy: given task_type + quality_floor + ledger, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window |
high | todo |
| T12 | Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules | medium | todo |
| T13 | Cold-start behaviour: when no observations exist for a (task_type, adapter) pair, fall through to the static RoutingPolicy.resolve result so the system stays usable on day zero |
high | todo |
| T14 | Functional contract doc; document the trade-off between sample size and freshness | medium | todo |
| T15 | Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain | high | todo |
T04 — Shadow-mode observation wrapper
| ID | Title | Priority | Status |
|---|---|---|---|
| T16 | ShadowingAdapter wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a QualityLedger |
medium | todo |
| T17 | Sampling: caller-configurable fraction (shadow_rate=0.1 means grade one call in ten) so production load is not doubled |
medium | todo |
| T18 | Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised | high | todo |
| T19 | Functional contract doc | low | todo |
| T20 | Tests: candidate response always returned even when baseline raises, ledger gets exactly shadow_rate × calls entries (within tolerance), sync vs async modes |
high | todo |
T05 — End-to-end example + integration test
| ID | Title | Priority | Status |
|---|---|---|---|
| T21 | Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, ClaudeCodeAdapter as baseline), grade each, populate ledger |
medium | todo |
| T22 | Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter | high | todo |
| T23 | Brief consumer-integration guide in docs/ showing how infospace-bench (or any caller) wires task-type-per-stage into the adaptive policy |
medium | todo |
Risks and open questions
- Judge bias.
LLMJudgehas known biases — length, position, format, self-preference when the judge model is the same family as a candidate. The contract must document these and recommend pairing with at least one non-LLM judge for calibration. - Baseline cost in shadow mode.
ClaudeCodeAdapteris not per-call billed (it shells out to a local subscription session), but every shadow call still consumes wall-clock and rate-budget. Sampling is load-bearing, not optional. - Non-stationarity. Provider model updates, prompt changes, and
template edits all silently invalidate prior observations. Plan for
a
prompt_fingerprinttag on observations so the ledger can be filtered to a coherent regime. - Scope creep. "Pick a model" is one decision; "decide whether the task is worth doing at all" is another. The latter is consumer policy. Keep this workplan firmly on the former.
- Privacy. Observations contain prompt and response text by
default. Add a
redact: Callable[[str], str]hook on the ledger writer so sensitive callers can store hashes / digests instead. - API-vs-CLI baseline parity. A consumer that grades against
ClaudeCodeAdapter(CLI) but later switches to a Claude API adapter may see quality drift that's actually transport drift. Document this.
Exit criteria
QualityLedgerround-trips observations and exposes the documented query helpersBaselineGraderproduces deterministicGradingResultobjects for at least one non-LLM judge and one LLM judge given canned inputsAdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8)returns the cheapest adapter whose mean quality over the configured window clears the floor, with a documented cold-start fallbackShadowingAdapternever alters the candidate response and respects the sampling rate within statistical tolerance- End-to-end example produces a ledger with at least three adapters per task type and the integration test shows convergence to the cheapest qualifying adapter
- Functional contracts published for the new data models, the grader protocols, and the adaptive policy
Consumer-side follow-up
infospace-bench will need a small companion workplan (IB-WP-NNNN) to:
- Replace direct
OpenRouterAssistedGenerationAdapteruse with a task-type-tagged route throughAdaptiveRoutingPolicy - Define its task-type taxonomy (
summarize-source,extract-entities,extract-relations,evaluate-entity,synthesize-report) - Pick a baseline adapter (most likely
ClaudeCodeAdapter) and a quality threshold per stage - Wire the shadow-mode wrapper for the first multi-chapter run so the ledger fills up while real generation proceeds
That workplan should be drafted after T01–T03 of this workplan land, so that the consumer-side wiring is anchored in a stable llm-connect API.