generated from coulomb/repo-seed
368 lines
17 KiB
Markdown
368 lines
17 KiB
Markdown
---
|
||
id: LLM-WP-0004
|
||
type: workplan
|
||
title: Adaptive Cost-Quality Routing
|
||
domain: custodian
|
||
status: completed
|
||
owner: llm-connect
|
||
created: 2026-05-17
|
||
repo: llm-connect
|
||
planning_priority: high
|
||
planning_order: 4
|
||
state_hub_workstream_id: e1807fab-e29e-4517-b362-95737a96582d
|
||
---
|
||
|
||
# LLM-WP-0004 — Adaptive Cost-Quality Routing
|
||
|
||
**status:** completed
|
||
**owner:** llm-connect
|
||
**repo:** llm-connect
|
||
**created:** 2026-05-17
|
||
**depends-on:** LLM-WP-0003 (RoutingPolicy primitive)
|
||
|
||
## Purpose
|
||
|
||
Provide reusable primitives that let a consumer route each task to the
|
||
cheapest model whose observed output quality clears a per-task bar, with
|
||
the local Claude Code session (`ClaudeCodeAdapter`) available as the
|
||
baseline-quality oracle. The current `RoutingPolicy` (LLM-WP-0003) is
|
||
static: rules and cost caps are hand-authored. This workplan adds the
|
||
observation, grading, and adaptive-selection layer that learns *which*
|
||
model is good enough for which task type.
|
||
|
||
Demand signal: `infospace-bench` is about to scale from one-chapter
|
||
smoke runs to multi-chapter and full-book infospace generation across
|
||
multiple workflow stages (summarise, extract entities, extract
|
||
relations, evaluate). Each stage has very different quality / cost
|
||
trade-offs, and the consumer needs llm-connect to pick the right model
|
||
per stage instead of hard-coding one model for the whole run.
|
||
|
||
## Scope guardrails (read this before adding tasks)
|
||
|
||
llm-connect ships *primitives*, not consumer policy.
|
||
|
||
In scope here:
|
||
|
||
- Data models for quality observations and grading results
|
||
- A reusable observation ledger (persistent, append-only, file-backed)
|
||
- A `BaselineGrader` that pairs a baseline adapter with a candidate
|
||
adapter and emits a structured quality delta
|
||
- A small built-in grader catalogue (exact-match, embedding similarity,
|
||
LLM-as-judge wrapper)
|
||
- An `AdaptiveRoutingPolicy` that extends `RoutingPolicy` by consulting
|
||
the ledger to pick the cheapest adapter whose observed quality for a
|
||
task type still clears a configured threshold
|
||
- A shadow-mode wrapper adapter for collecting observations in
|
||
production without changing caller behaviour
|
||
|
||
Out of scope (belongs in consumer repos):
|
||
|
||
- Task-type taxonomy (callers name their tasks)
|
||
- Quality thresholds per task type (callers set their own bars)
|
||
- Choice of baseline (callers wire whichever adapter they trust)
|
||
- When to re-grade (callers decide; this repo just exposes ledger TTL
|
||
and refresh helpers)
|
||
- Cost accounting for billing or budgets beyond a per-call estimate
|
||
|
||
## GAAF notes
|
||
|
||
All additions are Functional-layer per GAAF-2026. Core stays untouched.
|
||
Each new module gets a functional contract doc under
|
||
`contracts/functional/`. Maturity on release: Beta — `infospace-bench`
|
||
is the first known consumer; the API may shift before any second
|
||
consumer (`inter-hub`, `markitect`) adopts it.
|
||
|
||
## Tasks
|
||
|
||
The fenced `task` blocks below are the State Hub registration index. Keep them
|
||
in sync with the detailed task tables that follow.
|
||
|
||
```task
|
||
id: T01
|
||
title: 'QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "1c285bec-c30b-45a8-a408-3f91d810a078"
|
||
```
|
||
|
||
```task
|
||
id: T02
|
||
title: 'QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality)'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "5249f171-a047-499f-9ec4-cb50e1477765"
|
||
```
|
||
|
||
```task
|
||
id: T03
|
||
title: 'TTL helpers: prune_before(timestamp) and is_stale(observation, max_age)'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "adb255cf-7e89-4fea-b822-6be437d99789"
|
||
```
|
||
|
||
```task
|
||
id: T04
|
||
title: 'Functional contract doc for the ledger schema and quality_score semantics'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "51a33180-a99d-4aa4-96be-2fcee15bfbc3"
|
||
```
|
||
|
||
```task
|
||
id: T05
|
||
title: 'Tests: ledger round-trip, concurrent writes, query helpers, TTL, malformed-line resilience'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "458610c5-c903-4b42-9602-cd511999c9ba"
|
||
```
|
||
|
||
```task
|
||
id: T06
|
||
title: 'GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "c12a595b-90fc-4a80-8394-549edbda2031"
|
||
```
|
||
|
||
```task
|
||
id: T07
|
||
title: 'BaselineGrader protocol plus PairedGrader that runs baseline and candidate calls and delegates to a Judge'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "80b98e31-06fc-4462-b030-a12881095f93"
|
||
```
|
||
|
||
```task
|
||
id: T08
|
||
title: 'Judge protocol and built-ins: ExactMatchJudge, EmbeddingSimilarityJudge, LLMJudge'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "c2887fe3-bae6-4298-8c26-f9a519264dcf"
|
||
```
|
||
|
||
```task
|
||
id: T09
|
||
title: 'Functional contract doc covering judge bias caveats'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "7a4fd87a-b0ba-41b0-8e1a-a60fdaded905"
|
||
```
|
||
|
||
```task
|
||
id: T10
|
||
title: 'Tests: judges with canned inputs, stable grader result, deterministic LLMJudge rubric seed'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "8415a11d-d508-4d17-8082-10f93e9d16c5"
|
||
```
|
||
|
||
```task
|
||
id: T11
|
||
title: 'AdaptiveRoutingPolicy extends RoutingPolicy and selects the cheapest adapter whose observed mean quality clears the floor'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "0e9f9f8e-5066-4257-913b-a19f5b3fc47d"
|
||
```
|
||
|
||
```task
|
||
id: T12
|
||
title: 'Tie-breaking: prefer lower observed cost, then explicit preferred adapter from static rules'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "59d44712-1088-41ac-bad8-5d95db6f3a4f"
|
||
```
|
||
|
||
```task
|
||
id: T13
|
||
title: 'Cold-start behaviour falls through to static RoutingPolicy.resolve when observations are missing'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "1927d369-f5f6-48d3-8f53-7e4f1cae370e"
|
||
```
|
||
|
||
```task
|
||
id: T14
|
||
title: 'Functional contract doc for adaptive policy and sample-size/freshness trade-off'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "4d4717c1-8849-4fed-8f8d-515901ecafe0"
|
||
```
|
||
|
||
```task
|
||
id: T15
|
||
title: 'Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "304bd782-db15-4b7a-8d05-49e064a926c3"
|
||
```
|
||
|
||
```task
|
||
id: T16
|
||
title: 'ShadowingAdapter wraps a candidate adapter, also invokes the baseline adapter, grades, and appends to QualityLedger'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "62dd507f-536a-4623-8cbd-fa9f78e85ca6"
|
||
```
|
||
|
||
```task
|
||
id: T17
|
||
title: 'Sampling: caller-configurable shadow_rate so production load is not doubled'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "ccb73e92-1fca-42f9-8437-9b2b50e6424c"
|
||
```
|
||
|
||
```task
|
||
id: T18
|
||
title: 'Failure isolation: shadow errors never affect the candidate response returned to the caller'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "b879d232-d6ce-4ff6-b534-616729ea5ad7"
|
||
```
|
||
|
||
```task
|
||
id: T19
|
||
title: 'Functional contract doc for ShadowingAdapter'
|
||
priority: low
|
||
status: done
|
||
state_hub_task_id: "99d2c1bc-f1d8-42b3-9e04-6eea49460943"
|
||
```
|
||
|
||
```task
|
||
id: T20
|
||
title: 'Tests: candidate response survives baseline failure, ledger sampling rate, sync vs async modes'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "f533fbf4-484f-4408-8260-7e84e23bdc46"
|
||
```
|
||
|
||
```task
|
||
id: T21
|
||
title: 'Example script: route fixture batch through three candidate adapters and populate the ledger'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "7ef0c143-74b0-4740-81fa-819a826cf8f3"
|
||
```
|
||
|
||
```task
|
||
id: T22
|
||
title: 'Integration test: cold-start, static fallback, first observations, convergence to cheapest qualifying adapter'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "c4c6743f-157b-4445-8576-9caa6421d463"
|
||
```
|
||
|
||
```task
|
||
id: T23
|
||
title: 'Consumer integration guide showing how infospace-bench wires task types into adaptive policy'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "3a073ff7-0170-4a95-9c2a-a5daa84964e6"
|
||
```
|
||
|
||
### T01 — Quality observation data model + ledger
|
||
|
||
| ID | Title | Priority | Status |
|
||
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
|
||
| T01 | `QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags` | high | done |
|
||
| T02 | `QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`) | high | done |
|
||
| T03 | TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger | medium | done |
|
||
| T04 | Functional contract doc for the ledger schema and the field semantics of `quality_score` | medium | done |
|
||
| T05 | Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience | high | done |
|
||
|
||
### T02 — Baseline grader
|
||
|
||
| ID | Title | Priority | Status |
|
||
|-----|----------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
|
||
| T06 | `GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response` | high | done |
|
||
| T07 | `BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)` → `GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge` | high | done |
|
||
| T08 | `Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt) | high | done |
|
||
| T09 | Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`) | medium | done |
|
||
| T10 | Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric | high | done |
|
||
|
||
### T03 — Adaptive routing policy
|
||
|
||
| ID | Title | Priority | Status |
|
||
|-----|--------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
|
||
| T11 | `AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window | high | done |
|
||
| T12 | Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules | medium | done |
|
||
| T13 | Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero | high | done |
|
||
| T14 | Functional contract doc; document the trade-off between sample size and freshness | medium | done |
|
||
| T15 | Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain | high | done |
|
||
|
||
### T04 — Shadow-mode observation wrapper
|
||
|
||
| ID | Title | Priority | Status |
|
||
|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
|
||
| T16 | `ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger` | medium | done |
|
||
| T17 | Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled | medium | done |
|
||
| T18 | Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised | high | done |
|
||
| T19 | Functional contract doc | low | done |
|
||
| T20 | Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes | high | done |
|
||
|
||
### T05 — End-to-end example + integration test
|
||
|
||
| ID | Title | Priority | Status |
|
||
|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
|
||
| T21 | Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger | medium | done |
|
||
| T22 | Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter | high | done |
|
||
| T23 | Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy | medium | done |
|
||
|
||
## Risks and open questions
|
||
|
||
- **Judge bias.** `LLMJudge` has known biases — length, position, format,
|
||
self-preference when the judge model is the same family as a
|
||
candidate. The contract must document these and recommend pairing
|
||
with at least one non-LLM judge for calibration.
|
||
- **Baseline cost in shadow mode.** `ClaudeCodeAdapter` is not per-call
|
||
billed (it shells out to a local subscription session), but every
|
||
shadow call still consumes wall-clock and rate-budget. Sampling is
|
||
load-bearing, not optional.
|
||
- **Non-stationarity.** Provider model updates, prompt changes, and
|
||
template edits all silently invalidate prior observations. Plan for
|
||
a `prompt_fingerprint` tag on observations so the ledger can be
|
||
filtered to a coherent regime.
|
||
- **Scope creep.** "Pick a model" is one decision; "decide whether the
|
||
task is worth doing at all" is another. The latter is consumer
|
||
policy. Keep this workplan firmly on the former.
|
||
- **Privacy.** Observations contain prompt and response text by
|
||
default. Add a `redact: Callable[[str], str]` hook on the ledger
|
||
writer so sensitive callers can store hashes / digests instead.
|
||
- **API-vs-CLI baseline parity.** A consumer that grades against
|
||
`ClaudeCodeAdapter` (CLI) but later switches to a Claude API adapter
|
||
may see quality drift that's actually transport drift. Document this.
|
||
|
||
## Exit criteria
|
||
|
||
- `QualityLedger` round-trips observations and exposes the documented
|
||
query helpers
|
||
- `BaselineGrader` produces deterministic `GradingResult` objects for at
|
||
least one non-LLM judge and one LLM judge given canned inputs
|
||
- `AdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8)` returns
|
||
the cheapest adapter whose mean quality over the configured window
|
||
clears the floor, with a documented cold-start fallback
|
||
- `ShadowingAdapter` never alters the candidate response and respects
|
||
the sampling rate within statistical tolerance
|
||
- End-to-end example produces a ledger with at least three adapters per
|
||
task type and the integration test shows convergence to the cheapest
|
||
qualifying adapter
|
||
- Functional contracts published for the new data models, the grader
|
||
protocols, and the adaptive policy
|
||
|
||
## Consumer-side follow-up
|
||
|
||
`infospace-bench` will need a small companion workplan (`IB-WP-NNNN`) to:
|
||
|
||
- Replace direct `OpenRouterAssistedGenerationAdapter` use with a
|
||
task-type-tagged route through `AdaptiveRoutingPolicy`
|
||
- Define its task-type taxonomy (`summarize-source`, `extract-entities`,
|
||
`extract-relations`, `evaluate-entity`, `synthesize-report`)
|
||
- Pick a baseline adapter (most likely `ClaudeCodeAdapter`) and a
|
||
quality threshold per stage
|
||
- Wire the shadow-mode wrapper for the first multi-chapter run so the
|
||
ledger fills up while real generation proceeds
|
||
|
||
That workplan should be drafted after T01–T03 of this workplan land, so
|
||
that the consumer-side wiring is anchored in a stable llm-connect API.
|