plan: WP-0004 — adaptive cost-quality routing (todo)

Draft the workplan that extends the static RoutingPolicy (WP-0003) with a quality observation ledger, a BaselineGrader (ClaudeCodeAdapter as the default oracle), an AdaptiveRoutingPolicy that picks the cheapest adapter clearing a per-task quality floor, and a sampled ShadowingAdapter for production observation collection. Scope is explicit: ship primitives only. Task-type taxonomy, quality thresholds, baseline choice, and re-grading cadence stay with the consumer. infospace-bench is the named first consumer; consumer wiring deferred until T01-T03 land. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 17:17:07 +02:00
parent 66dfc7cf06
commit deade6ad76
1 changed files with 166 additions and 0 deletions
--- a/workplans/llm-connect-WP-0004-adaptive-cost-quality-routing.md
+++ b/workplans/llm-connect-WP-0004-adaptive-cost-quality-routing.md
@@ -0,0 +1,166 @@
+# LLM-WP-0004 — Adaptive Cost-Quality Routing
+
+**status:** todo
+**owner:** llm-connect
+**repo:** llm-connect
+**created:** 2026-05-17
+**depends-on:** LLM-WP-0003 (RoutingPolicy primitive)
+
+## Purpose
+
+Provide reusable primitives that let a consumer route each task to the
+cheapest model whose observed output quality clears a per-task bar, with
+the local Claude Code session (`ClaudeCodeAdapter`) available as the
+baseline-quality oracle. The current `RoutingPolicy` (LLM-WP-0003) is
+static: rules and cost caps are hand-authored. This workplan adds the
+observation, grading, and adaptive-selection layer that learns *which*
+model is good enough for which task type.
+
+Demand signal: `infospace-bench` is about to scale from one-chapter
+smoke runs to multi-chapter and full-book infospace generation across
+multiple workflow stages (summarise, extract entities, extract
+relations, evaluate). Each stage has very different quality / cost
+trade-offs, and the consumer needs llm-connect to pick the right model
+per stage instead of hard-coding one model for the whole run.
+
+## Scope guardrails (read this before adding tasks)
+
+llm-connect ships *primitives*, not consumer policy.
+
+In scope here:
+
+- Data models for quality observations and grading results
+- A reusable observation ledger (persistent, append-only, file-backed)
+- A `BaselineGrader` that pairs a baseline adapter with a candidate
+  adapter and emits a structured quality delta
+- A small built-in grader catalogue (exact-match, embedding similarity,
+  LLM-as-judge wrapper)
+- An `AdaptiveRoutingPolicy` that extends `RoutingPolicy` by consulting
+  the ledger to pick the cheapest adapter whose observed quality for a
+  task type still clears a configured threshold
+- A shadow-mode wrapper adapter for collecting observations in
+  production without changing caller behaviour
+
+Out of scope (belongs in consumer repos):
+
+- Task-type taxonomy (callers name their tasks)
+- Quality thresholds per task type (callers set their own bars)
+- Choice of baseline (callers wire whichever adapter they trust)
+- When to re-grade (callers decide; this repo just exposes ledger TTL
+  and refresh helpers)
+- Cost accounting for billing or budgets beyond a per-call estimate
+
+## GAAF notes
+
+All additions are Functional-layer per GAAF-2026. Core stays untouched.
+Each new module gets a functional contract doc under
+`contracts/functional/`. Maturity on release: Beta — `infospace-bench`
+is the first known consumer; the API may shift before any second
+consumer (`inter-hub`, `markitect`) adopts it.
+
+## Tasks
+
+### T01 — Quality observation data model + ledger
+
+| ID  | Title                                                                                                                                                                                       | Priority | Status |
+|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
+| T01 | `QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags` | high     | todo   |
+| T02 | `QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`)                                          | high     | todo   |
+| T03 | TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger                                          | medium   | todo   |
+| T04 | Functional contract doc for the ledger schema and the field semantics of `quality_score`                                                                                                    | medium   | todo   |
+| T05 | Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience                                                                                                          | high     | todo   |
+
+### T02 — Baseline grader
+
+| ID  | Title                                                                                                                                              | Priority | Status |
+|-----|----------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
+| T06 | `GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response`                                          | high     | todo   |
+| T07 | `BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)` → `GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge` | high     | todo   |
+| T08 | `Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt)                  | high     | todo   |
+| T09 | Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`)                                       | medium   | todo   |
+| T10 | Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric         | high     | todo   |
+
+### T03 — Adaptive routing policy
+
+| ID  | Title                                                                                                                                            | Priority | Status |
+|-----|--------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
+| T11 | `AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window | high     | todo   |
+| T12 | Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules | medium   | todo   |
+| T13 | Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero | high     | todo   |
+| T14 | Functional contract doc; document the trade-off between sample size and freshness                                                                | medium   | todo   |
+| T15 | Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain                                                              | high     | todo   |
+
+### T04 — Shadow-mode observation wrapper
+
+| ID  | Title                                                                                                                                                                       | Priority | Status |
+|-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
+| T16 | `ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger`         | medium   | todo   |
+| T17 | Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled                                                     | medium   | todo   |
+| T18 | Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised                                              | high     | todo   |
+| T19 | Functional contract doc                                                                                                                                                     | low      | todo   |
+| T20 | Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes              | high     | todo   |
+
+### T05 — End-to-end example + integration test
+
+| ID  | Title                                                                                                                                                                                 | Priority | Status |
+|-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
+| T21 | Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger | medium   | todo   |
+| T22 | Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter                   | high     | todo   |
+| T23 | Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy                                            | medium   | todo   |
+
+## Risks and open questions
+
+- **Judge bias.** `LLMJudge` has known biases — length, position, format,
+  self-preference when the judge model is the same family as a
+  candidate. The contract must document these and recommend pairing
+  with at least one non-LLM judge for calibration.
+- **Baseline cost in shadow mode.** `ClaudeCodeAdapter` is not per-call
+  billed (it shells out to a local subscription session), but every
+  shadow call still consumes wall-clock and rate-budget. Sampling is
+  load-bearing, not optional.
+- **Non-stationarity.** Provider model updates, prompt changes, and
+  template edits all silently invalidate prior observations. Plan for
+  a `prompt_fingerprint` tag on observations so the ledger can be
+  filtered to a coherent regime.
+- **Scope creep.** "Pick a model" is one decision; "decide whether the
+  task is worth doing at all" is another. The latter is consumer
+  policy. Keep this workplan firmly on the former.
+- **Privacy.** Observations contain prompt and response text by
+  default. Add a `redact: Callable[[str], str]` hook on the ledger
+  writer so sensitive callers can store hashes / digests instead.
+- **API-vs-CLI baseline parity.** A consumer that grades against
+  `ClaudeCodeAdapter` (CLI) but later switches to a Claude API adapter
+  may see quality drift that's actually transport drift. Document this.
+
+## Exit criteria
+
+- `QualityLedger` round-trips observations and exposes the documented
+  query helpers
+- `BaselineGrader` produces deterministic `GradingResult` objects for at
+  least one non-LLM judge and one LLM judge given canned inputs
+- `AdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8)` returns
+  the cheapest adapter whose mean quality over the configured window
+  clears the floor, with a documented cold-start fallback
+- `ShadowingAdapter` never alters the candidate response and respects
+  the sampling rate within statistical tolerance
+- End-to-end example produces a ledger with at least three adapters per
+  task type and the integration test shows convergence to the cheapest
+  qualifying adapter
+- Functional contracts published for the new data models, the grader
+  protocols, and the adaptive policy
+
+## Consumer-side follow-up
+
+`infospace-bench` will need a small companion workplan (`IB-WP-NNNN`) to:
+
+- Replace direct `OpenRouterAssistedGenerationAdapter` use with a
+  task-type-tagged route through `AdaptiveRoutingPolicy`
+- Define its task-type taxonomy (`summarize-source`, `extract-entities`,
+  `extract-relations`, `evaluate-entity`, `synthesize-report`)
+- Pick a baseline adapter (most likely `ClaudeCodeAdapter`) and a
+  quality threshold per stage
+- Wire the shadow-mode wrapper for the first multi-chapter run so the
+  ledger fills up while real generation proceeds
+
+That workplan should be drafted after T01–T03 of this workplan land, so
+that the consumer-side wiring is anchored in a stable llm-connect API.