coulomb/llm-connect

Fork 0

generated from coulomb/repo-seed

Files

tegwick deade6ad76

CI / test (3.10) (push) Has been cancelled

Details

CI / test (3.11) (push) Has been cancelled

Details

CI / test (3.12) (push) Has been cancelled

Details

plan: WP-0004 — adaptive cost-quality routing (todo)

Draft the workplan that extends the static RoutingPolicy (WP-0003) with
a quality observation ledger, a BaselineGrader (ClaudeCodeAdapter as the
default oracle), an AdaptiveRoutingPolicy that picks the cheapest
adapter clearing a per-task quality floor, and a sampled
ShadowingAdapter for production observation collection.

Scope is explicit: ship primitives only. Task-type taxonomy, quality
thresholds, baseline choice, and re-grading cadence stay with the
consumer. infospace-bench is the named first consumer; consumer wiring
deferred until T01-T03 land.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-17 17:17:07 +02:00

12 KiB

Raw Blame History

LLM-WP-0004 — Adaptive Cost-Quality Routing

status: todo owner: llm-connect repo: llm-connect created: 2026-05-17 depends-on: LLM-WP-0003 (RoutingPolicy primitive)

Purpose

Provide reusable primitives that let a consumer route each task to the cheapest model whose observed output quality clears a per-task bar, with the local Claude Code session (ClaudeCodeAdapter) available as the baseline-quality oracle. The current RoutingPolicy (LLM-WP-0003) is static: rules and cost caps are hand-authored. This workplan adds the observation, grading, and adaptive-selection layer that learns which model is good enough for which task type.

Demand signal: infospace-bench is about to scale from one-chapter smoke runs to multi-chapter and full-book infospace generation across multiple workflow stages (summarise, extract entities, extract relations, evaluate). Each stage has very different quality / cost trade-offs, and the consumer needs llm-connect to pick the right model per stage instead of hard-coding one model for the whole run.

Scope guardrails (read this before adding tasks)

llm-connect ships primitives, not consumer policy.

In scope here:

Data models for quality observations and grading results
A reusable observation ledger (persistent, append-only, file-backed)
A BaselineGrader that pairs a baseline adapter with a candidate adapter and emits a structured quality delta
A small built-in grader catalogue (exact-match, embedding similarity, LLM-as-judge wrapper)
An AdaptiveRoutingPolicy that extends RoutingPolicy by consulting the ledger to pick the cheapest adapter whose observed quality for a task type still clears a configured threshold
A shadow-mode wrapper adapter for collecting observations in production without changing caller behaviour

Out of scope (belongs in consumer repos):

Task-type taxonomy (callers name their tasks)
Quality thresholds per task type (callers set their own bars)
Choice of baseline (callers wire whichever adapter they trust)
When to re-grade (callers decide; this repo just exposes ledger TTL and refresh helpers)
Cost accounting for billing or budgets beyond a per-call estimate

GAAF notes

All additions are Functional-layer per GAAF-2026. Core stays untouched. Each new module gets a functional contract doc under contracts/functional/. Maturity on release: Beta — infospace-bench is the first known consumer; the API may shift before any second consumer (inter-hub, markitect) adopts it.

Tasks

T01 — Quality observation data model + ledger

ID	Title	Priority	Status
T01	`QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags`	high	todo
T02	`QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`)	high	todo
T03	TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger	medium	todo
T04	Functional contract doc for the ledger schema and the field semantics of `quality_score`	medium	todo
T05	Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience	high	todo

T02 — Baseline grader

ID	Title	Priority	Status
T06	`GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response`	high	todo
T07	`BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)` → `GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge`	high	todo
T08	`Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt)	high	todo
T09	Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`)	medium	todo
T10	Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric	high	todo

T03 — Adaptive routing policy

ID	Title	Priority	Status
T11	`AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window	high	todo
T12	Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules	medium	todo
T13	Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero	high	todo
T14	Functional contract doc; document the trade-off between sample size and freshness	medium	todo
T15	Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain	high	todo

T04 — Shadow-mode observation wrapper

ID	Title	Priority	Status
T16	`ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger`	medium	todo
T17	Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled	medium	todo
T18	Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised	high	todo
T19	Functional contract doc	low	todo
T20	Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes	high	todo

T05 — End-to-end example + integration test

ID	Title	Priority	Status
T21	Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger	medium	todo
T22	Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter	high	todo
T23	Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy	medium	todo

Risks and open questions

Judge bias. LLMJudge has known biases — length, position, format, self-preference when the judge model is the same family as a candidate. The contract must document these and recommend pairing with at least one non-LLM judge for calibration.
Baseline cost in shadow mode. ClaudeCodeAdapter is not per-call billed (it shells out to a local subscription session), but every shadow call still consumes wall-clock and rate-budget. Sampling is load-bearing, not optional.
Non-stationarity. Provider model updates, prompt changes, and template edits all silently invalidate prior observations. Plan for a prompt_fingerprint tag on observations so the ledger can be filtered to a coherent regime.
Scope creep. "Pick a model" is one decision; "decide whether the task is worth doing at all" is another. The latter is consumer policy. Keep this workplan firmly on the former.
Privacy. Observations contain prompt and response text by default. Add a redact: Callable[[str], str] hook on the ledger writer so sensitive callers can store hashes / digests instead.
API-vs-CLI baseline parity. A consumer that grades against ClaudeCodeAdapter (CLI) but later switches to a Claude API adapter may see quality drift that's actually transport drift. Document this.

Exit criteria

QualityLedger round-trips observations and exposes the documented query helpers
BaselineGrader produces deterministic GradingResult objects for at least one non-LLM judge and one LLM judge given canned inputs
AdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8) returns the cheapest adapter whose mean quality over the configured window clears the floor, with a documented cold-start fallback
ShadowingAdapter never alters the candidate response and respects the sampling rate within statistical tolerance
End-to-end example produces a ledger with at least three adapters per task type and the integration test shows convergence to the cheapest qualifying adapter
Functional contracts published for the new data models, the grader protocols, and the adaptive policy

Consumer-side follow-up

infospace-bench will need a small companion workplan (IB-WP-NNNN) to:

Replace direct OpenRouterAssistedGenerationAdapter use with a task-type-tagged route through AdaptiveRoutingPolicy
Define its task-type taxonomy (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report)
Pick a baseline adapter (most likely ClaudeCodeAdapter) and a quality threshold per stage
Wire the shadow-mode wrapper for the first multi-chapter run so the ledger fills up while real generation proceeds

That workplan should be drafted after T01–T03 of this workplan land, so that the consumer-side wiring is anchored in a stable llm-connect API.

12 KiB Raw Blame History Unescape Escape

LLM-WP-0004 — Adaptive Cost-Quality Routing

Purpose

Scope guardrails (read this before adding tasks)

GAAF notes

Tasks

T01 — Quality observation data model + ledger

T02 — Baseline grader

T03 — Adaptive routing policy

T04 — Shadow-mode observation wrapper

T05 — End-to-end example + integration test

Risks and open questions

Exit criteria

Consumer-side follow-up

12 KiB

Raw Blame History