coulomb/llm-connect

Fork 0

generated from coulomb/repo-seed

Files

tegwick c4ad4bb9f2

CI / test (3.10) (push) Has been cancelled

Details

CI / test (3.11) (push) Has been cancelled

Details

CI / test (3.12) (push) Has been cancelled

Details

Add adaptive cost-quality routing primitives

2026-05-17 21:32:27 +02:00

17 KiB

Raw Blame History

id, type, title, domain, status, owner, created, repo, planning_priority, planning_order, state_hub_workstream_id

id	type	title	domain	status	owner	created	repo	planning_priority	planning_order	state_hub_workstream_id
LLM-WP-0004	workplan	Adaptive Cost-Quality Routing	custodian	completed	llm-connect	2026-05-17	llm-connect	high	4	e1807fab-e29e-4517-b362-95737a96582d

LLM-WP-0004 — Adaptive Cost-Quality Routing

status: completed owner: llm-connect repo: llm-connect created: 2026-05-17 depends-on: LLM-WP-0003 (RoutingPolicy primitive)

Purpose

Provide reusable primitives that let a consumer route each task to the cheapest model whose observed output quality clears a per-task bar, with the local Claude Code session (ClaudeCodeAdapter) available as the baseline-quality oracle. The current RoutingPolicy (LLM-WP-0003) is static: rules and cost caps are hand-authored. This workplan adds the observation, grading, and adaptive-selection layer that learns which model is good enough for which task type.

Demand signal: infospace-bench is about to scale from one-chapter smoke runs to multi-chapter and full-book infospace generation across multiple workflow stages (summarise, extract entities, extract relations, evaluate). Each stage has very different quality / cost trade-offs, and the consumer needs llm-connect to pick the right model per stage instead of hard-coding one model for the whole run.

Scope guardrails (read this before adding tasks)

llm-connect ships primitives, not consumer policy.

In scope here:

Data models for quality observations and grading results
A reusable observation ledger (persistent, append-only, file-backed)
A BaselineGrader that pairs a baseline adapter with a candidate adapter and emits a structured quality delta
A small built-in grader catalogue (exact-match, embedding similarity, LLM-as-judge wrapper)
An AdaptiveRoutingPolicy that extends RoutingPolicy by consulting the ledger to pick the cheapest adapter whose observed quality for a task type still clears a configured threshold
A shadow-mode wrapper adapter for collecting observations in production without changing caller behaviour

Out of scope (belongs in consumer repos):

Task-type taxonomy (callers name their tasks)
Quality thresholds per task type (callers set their own bars)
Choice of baseline (callers wire whichever adapter they trust)
When to re-grade (callers decide; this repo just exposes ledger TTL and refresh helpers)
Cost accounting for billing or budgets beyond a per-call estimate

GAAF notes

All additions are Functional-layer per GAAF-2026. Core stays untouched. Each new module gets a functional contract doc under contracts/functional/. Maturity on release: Beta — infospace-bench is the first known consumer; the API may shift before any second consumer (inter-hub, markitect) adopts it.

Tasks

The fenced task blocks below are the State Hub registration index. Keep them in sync with the detailed task tables that follow.

id: T01
title: 'QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags'
priority: high
status: done
state_hub_task_id: "1c285bec-c30b-45a8-a408-3f91d810a078"

id: T02
title: 'QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality)'
priority: high
status: done
state_hub_task_id: "5249f171-a047-499f-9ec4-cb50e1477765"

id: T03
title: 'TTL helpers: prune_before(timestamp) and is_stale(observation, max_age)'
priority: medium
status: done
state_hub_task_id: "adb255cf-7e89-4fea-b822-6be437d99789"

id: T04
title: 'Functional contract doc for the ledger schema and quality_score semantics'
priority: medium
status: done
state_hub_task_id: "51a33180-a99d-4aa4-96be-2fcee15bfbc3"

id: T05
title: 'Tests: ledger round-trip, concurrent writes, query helpers, TTL, malformed-line resilience'
priority: high
status: done
state_hub_task_id: "458610c5-c903-4b42-9602-cd511999c9ba"

id: T06
title: 'GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response'
priority: high
status: done
state_hub_task_id: "c12a595b-90fc-4a80-8394-549edbda2031"

id: T07
title: 'BaselineGrader protocol plus PairedGrader that runs baseline and candidate calls and delegates to a Judge'
priority: high
status: done
state_hub_task_id: "80b98e31-06fc-4462-b030-a12881095f93"

id: T08
title: 'Judge protocol and built-ins: ExactMatchJudge, EmbeddingSimilarityJudge, LLMJudge'
priority: high
status: done
state_hub_task_id: "c2887fe3-bae6-4298-8c26-f9a519264dcf"

id: T09
title: 'Functional contract doc covering judge bias caveats'
priority: medium
status: done
state_hub_task_id: "7a4fd87a-b0ba-41b0-8e1a-a60fdaded905"

id: T10
title: 'Tests: judges with canned inputs, stable grader result, deterministic LLMJudge rubric seed'
priority: high
status: done
state_hub_task_id: "8415a11d-d508-4d17-8082-10f93e9d16c5"

id: T11
title: 'AdaptiveRoutingPolicy extends RoutingPolicy and selects the cheapest adapter whose observed mean quality clears the floor'
priority: high
status: done
state_hub_task_id: "0e9f9f8e-5066-4257-913b-a19f5b3fc47d"

id: T12
title: 'Tie-breaking: prefer lower observed cost, then explicit preferred adapter from static rules'
priority: medium
status: done
state_hub_task_id: "59d44712-1088-41ac-bad8-5d95db6f3a4f"

id: T13
title: 'Cold-start behaviour falls through to static RoutingPolicy.resolve when observations are missing'
priority: high
status: done
state_hub_task_id: "1927d369-f5f6-48d3-8f53-7e4f1cae370e"

id: T14
title: 'Functional contract doc for adaptive policy and sample-size/freshness trade-off'
priority: medium
status: done
state_hub_task_id: "4d4717c1-8849-4fed-8f8d-515901ecafe0"

id: T15
title: 'Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain'
priority: high
status: done
state_hub_task_id: "304bd782-db15-4b7a-8d05-49e064a926c3"

id: T16
title: 'ShadowingAdapter wraps a candidate adapter, also invokes the baseline adapter, grades, and appends to QualityLedger'
priority: medium
status: done
state_hub_task_id: "62dd507f-536a-4623-8cbd-fa9f78e85ca6"

id: T17
title: 'Sampling: caller-configurable shadow_rate so production load is not doubled'
priority: medium
status: done
state_hub_task_id: "ccb73e92-1fca-42f9-8437-9b2b50e6424c"

id: T18
title: 'Failure isolation: shadow errors never affect the candidate response returned to the caller'
priority: high
status: done
state_hub_task_id: "b879d232-d6ce-4ff6-b534-616729ea5ad7"

id: T19
title: 'Functional contract doc for ShadowingAdapter'
priority: low
status: done
state_hub_task_id: "99d2c1bc-f1d8-42b3-9e04-6eea49460943"

id: T20
title: 'Tests: candidate response survives baseline failure, ledger sampling rate, sync vs async modes'
priority: high
status: done
state_hub_task_id: "f533fbf4-484f-4408-8260-7e84e23bdc46"

id: T21
title: 'Example script: route fixture batch through three candidate adapters and populate the ledger'
priority: medium
status: done
state_hub_task_id: "7ef0c143-74b0-4740-81fa-819a826cf8f3"

id: T22
title: 'Integration test: cold-start, static fallback, first observations, convergence to cheapest qualifying adapter'
priority: high
status: done
state_hub_task_id: "c4c6743f-157b-4445-8576-9caa6421d463"

id: T23
title: 'Consumer integration guide showing how infospace-bench wires task types into adaptive policy'
priority: medium
status: done
state_hub_task_id: "3a073ff7-0170-4a95-9c2a-a5daa84964e6"

T01 — Quality observation data model + ledger

ID	Title	Priority	Status
T01	`QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags`	high	done
T02	`QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`)	high	done
T03	TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger	medium	done
T04	Functional contract doc for the ledger schema and the field semantics of `quality_score`	medium	done
T05	Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience	high	done

T02 — Baseline grader

ID	Title	Priority	Status
T06	`GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response`	high	done
T07	`BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)` → `GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge`	high	done
T08	`Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt)	high	done
T09	Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`)	medium	done
T10	Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric	high	done

T03 — Adaptive routing policy

ID	Title	Priority	Status
T11	`AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window	high	done
T12	Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules	medium	done
T13	Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero	high	done
T14	Functional contract doc; document the trade-off between sample size and freshness	medium	done
T15	Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain	high	done

T04 — Shadow-mode observation wrapper

ID	Title	Priority	Status
T16	`ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger`	medium	done
T17	Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled	medium	done
T18	Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised	high	done
T19	Functional contract doc	low	done
T20	Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes	high	done

T05 — End-to-end example + integration test

ID	Title	Priority	Status
T21	Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger	medium	done
T22	Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter	high	done
T23	Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy	medium	done

Risks and open questions

Judge bias. LLMJudge has known biases — length, position, format, self-preference when the judge model is the same family as a candidate. The contract must document these and recommend pairing with at least one non-LLM judge for calibration.
Baseline cost in shadow mode. ClaudeCodeAdapter is not per-call billed (it shells out to a local subscription session), but every shadow call still consumes wall-clock and rate-budget. Sampling is load-bearing, not optional.
Non-stationarity. Provider model updates, prompt changes, and template edits all silently invalidate prior observations. Plan for a prompt_fingerprint tag on observations so the ledger can be filtered to a coherent regime.
Scope creep. "Pick a model" is one decision; "decide whether the task is worth doing at all" is another. The latter is consumer policy. Keep this workplan firmly on the former.
Privacy. Observations contain prompt and response text by default. Add a redact: Callable[[str], str] hook on the ledger writer so sensitive callers can store hashes / digests instead.
API-vs-CLI baseline parity. A consumer that grades against ClaudeCodeAdapter (CLI) but later switches to a Claude API adapter may see quality drift that's actually transport drift. Document this.

Exit criteria

QualityLedger round-trips observations and exposes the documented query helpers
BaselineGrader produces deterministic GradingResult objects for at least one non-LLM judge and one LLM judge given canned inputs
AdaptiveRoutingPolicy.resolve(task_type, quality_floor=0.8) returns the cheapest adapter whose mean quality over the configured window clears the floor, with a documented cold-start fallback
ShadowingAdapter never alters the candidate response and respects the sampling rate within statistical tolerance
End-to-end example produces a ledger with at least three adapters per task type and the integration test shows convergence to the cheapest qualifying adapter
Functional contracts published for the new data models, the grader protocols, and the adaptive policy

Consumer-side follow-up

infospace-bench will need a small companion workplan (IB-WP-NNNN) to:

Replace direct OpenRouterAssistedGenerationAdapter use with a task-type-tagged route through AdaptiveRoutingPolicy
Define its task-type taxonomy (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report)
Pick a baseline adapter (most likely ClaudeCodeAdapter) and a quality threshold per stage
Wire the shadow-mode wrapper for the first multi-chapter run so the ledger fills up while real generation proceeds

That workplan should be drafted after T01–T03 of this workplan land, so that the consumer-side wiring is anchored in a stable llm-connect API.

17 KiB Raw Blame History Unescape Escape

LLM-WP-0004 — Adaptive Cost-Quality Routing

Purpose

Scope guardrails (read this before adding tasks)

GAAF notes

Tasks

T01 — Quality observation data model + ledger

T02 — Baseline grader

T03 — Adaptive routing policy

T04 — Shadow-mode observation wrapper

T05 — End-to-end example + integration test

Risks and open questions

Exit criteria

Consumer-side follow-up

17 KiB

Raw Blame History