Files
llm-connect/workplans/llm-connect-WP-0004-adaptive-cost-quality-routing.md
tegwick c4ad4bb9f2
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Add adaptive cost-quality routing primitives
2026-05-17 21:32:27 +02:00

17 KiB
Raw Blame History

id, type, title, domain, status, owner, created, repo, planning_priority, planning_order, state_hub_workstream_id
id type title domain status owner created repo planning_priority planning_order state_hub_workstream_id
LLM-WP-0004 workplan Adaptive Cost-Quality Routing custodian completed llm-connect 2026-05-17 llm-connect high 4 e1807fab-e29e-4517-b362-95737a96582d

LLM-WP-0004 — Adaptive Cost-Quality Routing

status: completed owner: llm-connect repo: llm-connect created: 2026-05-17 depends-on: LLM-WP-0003 (RoutingPolicy primitive)

Purpose

Provide reusable primitives that let a consumer route each task to the cheapest model whose observed output quality clears a per-task bar, with the local Claude Code session (ClaudeCodeAdapter) available as the baseline-quality oracle. The current RoutingPolicy (LLM-WP-0003) is static: rules and cost caps are hand-authored. This workplan adds the observation, grading, and adaptive-selection layer that learns which model is good enough for which task type.

Demand signal: infospace-bench is about to scale from one-chapter smoke runs to multi-chapter and full-book infospace generation across multiple workflow stages (summarise, extract entities, extract relations, evaluate). Each stage has very different quality / cost trade-offs, and the consumer needs llm-connect to pick the right model per stage instead of hard-coding one model for the whole run.

Scope guardrails (read this before adding tasks)

llm-connect ships primitives, not consumer policy.

In scope here:

  • Data models for quality observations and grading results
  • A reusable observation ledger (persistent, append-only, file-backed)
  • A BaselineGrader that pairs a baseline adapter with a candidate adapter and emits a structured quality delta
  • A small built-in grader catalogue (exact-match, embedding similarity, LLM-as-judge wrapper)
  • An AdaptiveRoutingPolicy that extends RoutingPolicy by consulting the ledger to pick the cheapest adapter whose observed quality for a task type still clears a configured threshold
  • A shadow-mode wrapper adapter for collecting observations in production without changing caller behaviour

Out of scope (belongs in consumer repos):

  • Task-type taxonomy (callers name their tasks)
  • Quality thresholds per task type (callers set their own bars)
  • Choice of baseline (callers wire whichever adapter they trust)
  • When to re-grade (callers decide; this repo just exposes ledger TTL and refresh helpers)
  • Cost accounting for billing or budgets beyond a per-call estimate

GAAF notes

All additions are Functional-layer per GAAF-2026. Core stays untouched. Each new module gets a functional contract doc under contracts/functional/. Maturity on release: Beta — infospace-bench is the first known consumer; the API may shift before any second consumer (inter-hub, markitect) adopts it.

Tasks

The fenced task blocks below are the State Hub registration index. Keep them in sync with the detailed task tables that follow.

id: T01
title: 'QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags'
priority: high
status: done
state_hub_task_id: "1c285bec-c30b-45a8-a408-3f91d810a078"
id: T02
title: 'QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality)'
priority: high
status: done
state_hub_task_id: "5249f171-a047-499f-9ec4-cb50e1477765"
id: T03
title: 'TTL helpers: prune_before(timestamp) and is_stale(observation, max_age)'
priority: medium
status: done
state_hub_task_id: "adb255cf-7e89-4fea-b822-6be437d99789"
id: T04
title: 'Functional contract doc for the ledger schema and quality_score semantics'
priority: medium
status: done
state_hub_task_id: "51a33180-a99d-4aa4-96be-2fcee15bfbc3"
id: T05
title: 'Tests: ledger round-trip, concurrent writes, query helpers, TTL, malformed-line resilience'
priority: high
status: done
state_hub_task_id: "458610c5-c903-4b42-9602-cd511999c9ba"
id: T06
title: 'GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response'
priority: high
status: done
state_hub_task_id: "c12a595b-90fc-4a80-8394-549edbda2031"
id: T07
title: 'BaselineGrader protocol plus PairedGrader that runs baseline and candidate calls and delegates to a Judge'
priority: high
status: done
state_hub_task_id: "80b98e31-06fc-4462-b030-a12881095f93"
id: T08
title: 'Judge protocol and built-ins: ExactMatchJudge, EmbeddingSimilarityJudge, LLMJudge'
priority: high
status: done
state_hub_task_id: "c2887fe3-bae6-4298-8c26-f9a519264dcf"
id: T09
title: 'Functional contract doc covering judge bias caveats'
priority: medium
status: done
state_hub_task_id: "7a4fd87a-b0ba-41b0-8e1a-a60fdaded905"
id: T10
title: 'Tests: judges with canned inputs, stable grader result, deterministic LLMJudge rubric seed'
priority: high
status: done
state_hub_task_id: "8415a11d-d508-4d17-8082-10f93e9d16c5"
id: T11
title: 'AdaptiveRoutingPolicy extends RoutingPolicy and selects the cheapest adapter whose observed mean quality clears the floor'
priority: high
status: done
state_hub_task_id: "0e9f9f8e-5066-4257-913b-a19f5b3fc47d"
id: T12
title: 'Tie-breaking: prefer lower observed cost, then explicit preferred adapter from static rules'
priority: medium
status: done
state_hub_task_id: "59d44712-1088-41ac-bad8-5d95db6f3a4f"
id: T13
title: 'Cold-start behaviour falls through to static RoutingPolicy.resolve when observations are missing'
priority: high
status: done
state_hub_task_id: "1927d369-f5f6-48d3-8f53-7e4f1cae370e"
id: T14
title: 'Functional contract doc for adaptive policy and sample-size/freshness trade-off'
priority: medium
status: done
state_hub_task_id: "4d4717c1-8849-4fed-8f8d-515901ecafe0"
id: T15
title: 'Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain'
priority: high
status: done
state_hub_task_id: "304bd782-db15-4b7a-8d05-49e064a926c3"
id: T16
title: 'ShadowingAdapter wraps a candidate adapter, also invokes the baseline adapter, grades, and appends to QualityLedger'
priority: medium
status: done
state_hub_task_id: "62dd507f-536a-4623-8cbd-fa9f78e85ca6"
id: T17
title: 'Sampling: caller-configurable shadow_rate so production load is not doubled'
priority: medium
status: done
state_hub_task_id: "ccb73e92-1fca-42f9-8437-9b2b50e6424c"
id: T18
title: 'Failure isolation: shadow errors never affect the candidate response returned to the caller'
priority: high
status: done
state_hub_task_id: "b879d232-d6ce-4ff6-b534-616729ea5ad7"
id: T19
title: 'Functional contract doc for ShadowingAdapter'
priority: low
status: done
state_hub_task_id: "99d2c1bc-f1d8-42b3-9e04-6eea49460943"
id: T20
title: 'Tests: candidate response survives baseline failure, ledger sampling rate, sync vs async modes'
priority: high
status: done
state_hub_task_id: "f533fbf4-484f-4408-8260-7e84e23bdc46"
id: T21
title: 'Example script: route fixture batch through three candidate adapters and populate the ledger'
priority: medium
status: done
state_hub_task_id: "7ef0c143-74b0-4740-81fa-819a826cf8f3"
id: T22
title: 'Integration test: cold-start, static fallback, first observations, convergence to cheapest qualifying adapter'
priority: high
status: done
state_hub_task_id: "c4c6743f-157b-4445-8576-9caa6421d463"
id: T23
title: 'Consumer integration guide showing how infospace-bench wires task types into adaptive policy'
priority: medium
status: done
state_hub_task_id: "3a073ff7-0170-4a95-9c2a-a5daa84964e6"

T01 — Quality observation data model + ledger

ID Title Priority Status
T01 QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags high done
T02 QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality) high done
T03 TTL helpers: prune_before(timestamp) and is_stale(observation, max_age) so callers can refresh observations without re-reading the whole ledger medium done
T04 Functional contract doc for the ledger schema and the field semantics of quality_score medium done
T05 Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience high done

T02 — Baseline grader

ID Title Priority Status
T06 GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response high done
T07 BaselineGrader protocol: .grade(baseline_adapter, candidate_adapter, prompt, run_config)GradingResult; built-in concrete PairedGrader runs both calls and delegates to a Judge high done
T08 Judge protocol + three built-ins: ExactMatchJudge, EmbeddingSimilarityJudge (uses an embedding adapter), LLMJudge (uses a third adapter with a fixed rubric prompt) high done
T09 Functional contract doc covering judge bias caveats (length bias, format bias, position bias for LLMJudge) medium done
T10 Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for LLMJudge rubric high done

T03 — Adaptive routing policy

ID Title Priority Status
T11 AdaptiveRoutingPolicy extends RoutingPolicy: given task_type + quality_floor + ledger, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window high done
T12 Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules medium done
T13 Cold-start behaviour: when no observations exist for a (task_type, adapter) pair, fall through to the static RoutingPolicy.resolve result so the system stays usable on day zero high done
T14 Functional contract doc; document the trade-off between sample size and freshness medium done
T15 Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain high done

T04 — Shadow-mode observation wrapper

ID Title Priority Status
T16 ShadowingAdapter wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a QualityLedger medium done
T17 Sampling: caller-configurable fraction (shadow_rate=0.1 means grade one call in ten) so production load is not doubled medium done
T18 Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised high done
T19 Functional contract doc low done
T20 Tests: candidate response always returned even when baseline raises, ledger gets exactly shadow_rate × calls entries (within tolerance), sync vs async modes high done

T05 — End-to-end example + integration test

ID Title Priority Status
T21 Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, ClaudeCodeAdapter as baseline), grade each, populate ledger medium done
T22 Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter high done
T23 Brief consumer-integration guide in docs/ showing how infospace-bench (or any caller) wires task-type-per-stage into the adaptive policy medium done

Risks and open questions

  • Judge bias. LLMJudge has known biases — length, position, format, self-preference when the judge model is the same family as a candidate. The contract must document these and recommend pairing with at least one non-LLM judge for calibration.
  • Baseline cost in shadow mode. ClaudeCodeAdapter is not per-call billed (it shells out to a local subscription session), but every shadow call still consumes wall-clock and rate-budget. Sampling is load-bearing, not optional.
  • Non-stationarity. Provider model updates, prompt changes, and template edits all silently invalidate prior observations. Plan for a prompt_fingerprint tag on observations so the ledger can be filtered to a coherent regime.
  • Scope creep. "Pick a model" is one decision; "decide whether the task is worth doing at all" is another. The latter is consumer policy. Keep this workplan firmly on the former.
  • Privacy. Observations contain prompt and response text by default. Add a redact: Callable[[str], str] hook on the ledger writer so sensitive callers can store hashes / digests instead.
  • API-vs-CLI baseline parity. A consumer that grades against ClaudeCodeAdapter (CLI) but later switches to a Claude API adapter may see quality drift that's actually transport drift. Document this.

Exit criteria

  • QualityLedger round-trips observations and exposes the documented query helpers
  • BaselineGrader produces deterministic GradingResult objects for at least one non-LLM judge and one LLM judge given canned inputs
  • AdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8) returns the cheapest adapter whose mean quality over the configured window clears the floor, with a documented cold-start fallback
  • ShadowingAdapter never alters the candidate response and respects the sampling rate within statistical tolerance
  • End-to-end example produces a ledger with at least three adapters per task type and the integration test shows convergence to the cheapest qualifying adapter
  • Functional contracts published for the new data models, the grader protocols, and the adaptive policy

Consumer-side follow-up

infospace-bench will need a small companion workplan (IB-WP-NNNN) to:

  • Replace direct OpenRouterAssistedGenerationAdapter use with a task-type-tagged route through AdaptiveRoutingPolicy
  • Define its task-type taxonomy (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report)
  • Pick a baseline adapter (most likely ClaudeCodeAdapter) and a quality threshold per stage
  • Wire the shadow-mode wrapper for the first multi-chapter run so the ledger fills up while real generation proceeds

That workplan should be drafted after T01T03 of this workplan land, so that the consumer-side wiring is anchored in a stable llm-connect API.