Add adaptive cost-quality routing primitives

2026-05-17 21:32:27 +02:00
parent bf86a03c5d
commit c4ad4bb9f2
17 changed files with 2480 additions and 25 deletions
--- a/workplans/llm-connect-WP-0004-adaptive-cost-quality-routing.md
+++ b/workplans/llm-connect-WP-0004-adaptive-cost-quality-routing.md
@@ -1,6 +1,20 @@
+---
+id: LLM-WP-0004
+type: workplan
+title: Adaptive Cost-Quality Routing
+domain: custodian
+status: completed
+owner: llm-connect
+created: 2026-05-17
+repo: llm-connect
+planning_priority: high
+planning_order: 4
+state_hub_workstream_id: e1807fab-e29e-4517-b362-95737a96582d
+---
+
 # LLM-WP-0004 — Adaptive Cost-Quality Routing

-**status:** todo
+**status:** completed
 **owner:** llm-connect
 **repo:** llm-connect
 **created:** 2026-05-17
@@ -60,53 +74,240 @@ consumer (`inter-hub`, `markitect`) adopts it.

 ## Tasks

+The fenced `task` blocks below are the State Hub registration index. Keep them
+in sync with the detailed task tables that follow.
+
+```task
+id: T01
+title: 'QualityObservation dataclass: task_type, adapter_id, model_id, cost_usd, quality_score (0..1), latency_ms, tokens_in, tokens_out, baseline_adapter_id, recorded_at, tags'
+priority: high
+status: done
+state_hub_task_id: "1c285bec-c30b-45a8-a408-3f91d810a078"
+```
+
+```task
+id: T02
+title: 'QualityLedger append-only JSONL store with file-locked writes, configurable path, simple query helpers (by_task_type, recent, mean_quality)'
+priority: high
+status: done
+state_hub_task_id: "5249f171-a047-499f-9ec4-cb50e1477765"
+```
+
+```task
+id: T03
+title: 'TTL helpers: prune_before(timestamp) and is_stale(observation, max_age)'
+priority: medium
+status: done
+state_hub_task_id: "adb255cf-7e89-4fea-b822-6be437d99789"
+```
+
+```task
+id: T04
+title: 'Functional contract doc for the ledger schema and quality_score semantics'
+priority: medium
+status: done
+state_hub_task_id: "51a33180-a99d-4aa4-96be-2fcee15bfbc3"
+```
+
+```task
+id: T05
+title: 'Tests: ledger round-trip, concurrent writes, query helpers, TTL, malformed-line resilience'
+priority: high
+status: done
+state_hub_task_id: "458610c5-c903-4b42-9602-cd511999c9ba"
+```
+
+```task
+id: T06
+title: 'GradingResult dataclass: quality_score, notes, grader_id, baseline_response, candidate_response'
+priority: high
+status: done
+state_hub_task_id: "c12a595b-90fc-4a80-8394-549edbda2031"
+```
+
+```task
+id: T07
+title: 'BaselineGrader protocol plus PairedGrader that runs baseline and candidate calls and delegates to a Judge'
+priority: high
+status: done
+state_hub_task_id: "80b98e31-06fc-4462-b030-a12881095f93"
+```
+
+```task
+id: T08
+title: 'Judge protocol and built-ins: ExactMatchJudge, EmbeddingSimilarityJudge, LLMJudge'
+priority: high
+status: done
+state_hub_task_id: "c2887fe3-bae6-4298-8c26-f9a519264dcf"
+```
+
+```task
+id: T09
+title: 'Functional contract doc covering judge bias caveats'
+priority: medium
+status: done
+state_hub_task_id: "7a4fd87a-b0ba-41b0-8e1a-a60fdaded905"
+```
+
+```task
+id: T10
+title: 'Tests: judges with canned inputs, stable grader result, deterministic LLMJudge rubric seed'
+priority: high
+status: done
+state_hub_task_id: "8415a11d-d508-4d17-8082-10f93e9d16c5"
+```
+
+```task
+id: T11
+title: 'AdaptiveRoutingPolicy extends RoutingPolicy and selects the cheapest adapter whose observed mean quality clears the floor'
+priority: high
+status: done
+state_hub_task_id: "0e9f9f8e-5066-4257-913b-a19f5b3fc47d"
+```
+
+```task
+id: T12
+title: 'Tie-breaking: prefer lower observed cost, then explicit preferred adapter from static rules'
+priority: medium
+status: done
+state_hub_task_id: "59d44712-1088-41ac-bad8-5d95db6f3a4f"
+```
+
+```task
+id: T13
+title: 'Cold-start behaviour falls through to static RoutingPolicy.resolve when observations are missing'
+priority: high
+status: done
+state_hub_task_id: "1927d369-f5f6-48d3-8f53-7e4f1cae370e"
+```
+
+```task
+id: T14
+title: 'Functional contract doc for adaptive policy and sample-size/freshness trade-off'
+priority: medium
+status: done
+state_hub_task_id: "4d4717c1-8849-4fed-8f8d-515901ecafe0"
+```
+
+```task
+id: T15
+title: 'Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain'
+priority: high
+status: done
+state_hub_task_id: "304bd782-db15-4b7a-8d05-49e064a926c3"
+```
+
+```task
+id: T16
+title: 'ShadowingAdapter wraps a candidate adapter, also invokes the baseline adapter, grades, and appends to QualityLedger'
+priority: medium
+status: done
+state_hub_task_id: "62dd507f-536a-4623-8cbd-fa9f78e85ca6"
+```
+
+```task
+id: T17
+title: 'Sampling: caller-configurable shadow_rate so production load is not doubled'
+priority: medium
+status: done
+state_hub_task_id: "ccb73e92-1fca-42f9-8437-9b2b50e6424c"
+```
+
+```task
+id: T18
+title: 'Failure isolation: shadow errors never affect the candidate response returned to the caller'
+priority: high
+status: done
+state_hub_task_id: "b879d232-d6ce-4ff6-b534-616729ea5ad7"
+```
+
+```task
+id: T19
+title: 'Functional contract doc for ShadowingAdapter'
+priority: low
+status: done
+state_hub_task_id: "99d2c1bc-f1d8-42b3-9e04-6eea49460943"
+```
+
+```task
+id: T20
+title: 'Tests: candidate response survives baseline failure, ledger sampling rate, sync vs async modes'
+priority: high
+status: done
+state_hub_task_id: "f533fbf4-484f-4408-8260-7e84e23bdc46"
+```
+
+```task
+id: T21
+title: 'Example script: route fixture batch through three candidate adapters and populate the ledger'
+priority: medium
+status: done
+state_hub_task_id: "7ef0c143-74b0-4740-81fa-819a826cf8f3"
+```
+
+```task
+id: T22
+title: 'Integration test: cold-start, static fallback, first observations, convergence to cheapest qualifying adapter'
+priority: high
+status: done
+state_hub_task_id: "c4c6743f-157b-4445-8576-9caa6421d463"
+```
+
+```task
+id: T23
+title: 'Consumer integration guide showing how infospace-bench wires task types into adaptive policy'
+priority: medium
+status: done
+state_hub_task_id: "3a073ff7-0170-4a95-9c2a-a5daa84964e6"
+```
+
 ### T01 — Quality observation data model + ledger

 | ID  | Title                                                                                                                                                                                       | Priority | Status |
 |-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
-| T01 | `QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags` | high     | todo   |
-| T02 | `QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`)                                          | high     | todo   |
-| T03 | TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger                                          | medium   | todo   |
-| T04 | Functional contract doc for the ledger schema and the field semantics of `quality_score`                                                                                                    | medium   | todo   |
-| T05 | Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience                                                                                                          | high     | todo   |
+| T01 | `QualityObservation` dataclass: `task_type`, `adapter_id`, `model_id`, `cost_usd`, `quality_score` (0..1), `latency_ms`, `tokens_in`, `tokens_out`, `baseline_adapter_id`, `recorded_at`, `tags` | high     | done   |
+| T02 | `QualityLedger` append-only JSONL store with file-locked writes, configurable path, simple query helpers (`by_task_type`, `recent`, `mean_quality`)                                          | high     | done   |
+| T03 | TTL helpers: `prune_before(timestamp)` and `is_stale(observation, max_age)` so callers can refresh observations without re-reading the whole ledger                                          | medium   | done   |
+| T04 | Functional contract doc for the ledger schema and the field semantics of `quality_score`                                                                                                    | medium   | done   |
+| T05 | Tests: round-trip, concurrent writes, query helpers, TTL, malformed-line resilience                                                                                                          | high     | done   |

 ### T02 — Baseline grader

 | ID  | Title                                                                                                                                              | Priority | Status |
 |-----|----------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
-| T06 | `GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response`                                          | high     | todo   |
-| T07 | `BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)` → `GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge` | high     | todo   |
-| T08 | `Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt)                  | high     | todo   |
-| T09 | Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`)                                       | medium   | todo   |
-| T10 | Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric         | high     | todo   |
+| T06 | `GradingResult` dataclass: `quality_score`, `notes`, `grader_id`, `baseline_response`, `candidate_response`                                          | high     | done   |
+| T07 | `BaselineGrader` protocol: `.grade(baseline_adapter, candidate_adapter, prompt, run_config)` → `GradingResult`; built-in concrete `PairedGrader` runs both calls and delegates to a `Judge` | high     | done   |
+| T08 | `Judge` protocol + three built-ins: `ExactMatchJudge`, `EmbeddingSimilarityJudge` (uses an embedding adapter), `LLMJudge` (uses a third adapter with a fixed rubric prompt)                  | high     | done   |
+| T09 | Functional contract doc covering judge bias caveats (length bias, format bias, position bias for `LLMJudge`)                                       | medium   | done   |
+| T10 | Tests: each judge against canned inputs, grader emits stable result with both responses preserved, deterministic seed for `LLMJudge` rubric         | high     | done   |

 ### T03 — Adaptive routing policy

 | ID  | Title                                                                                                                                            | Priority | Status |
 |-----|--------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
-| T11 | `AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window | high     | todo   |
-| T12 | Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules | medium   | todo   |
-| T13 | Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero | high     | todo   |
-| T14 | Functional contract doc; document the trade-off between sample size and freshness                                                                | medium   | todo   |
-| T15 | Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain                                                              | high     | todo   |
+| T11 | `AdaptiveRoutingPolicy` extends `RoutingPolicy`: given `task_type` + `quality_floor` + `ledger`, returns the cheapest adapter whose observed mean quality clears the floor over a configurable window | high     | done   |
+| T12 | Tie-breaking: when two adapters meet the floor, prefer lower observed cost; if still tied, prefer the explicitly-preferred adapter from the underlying static rules | medium   | done   |
+| T13 | Cold-start behaviour: when no observations exist for a `(task_type, adapter)` pair, fall through to the static `RoutingPolicy.resolve` result so the system stays usable on day zero | high     | done   |
+| T14 | Functional contract doc; document the trade-off between sample size and freshness                                                                | medium   | done   |
+| T15 | Tests: floor enforcement, tie-break, cold-start, window-size effect, fallback chain                                                              | high     | done   |

 ### T04 — Shadow-mode observation wrapper

 | ID  | Title                                                                                                                                                                       | Priority | Status |
 |-----|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
-| T16 | `ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger`         | medium   | todo   |
-| T17 | Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled                                                     | medium   | todo   |
-| T18 | Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised                                              | high     | todo   |
-| T19 | Functional contract doc                                                                                                                                                     | low      | todo   |
-| T20 | Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes              | high     | todo   |
+| T16 | `ShadowingAdapter` wraps a candidate adapter; on each call, also invokes the baseline adapter (sync or via a thread pool), grades, and appends to a `QualityLedger`         | medium   | done   |
+| T17 | Sampling: caller-configurable fraction (`shadow_rate=0.1` means grade one call in ten) so production load is not doubled                                                     | medium   | done   |
+| T18 | Failure isolation: shadow errors never affect the candidate response returned to the caller; failures are logged but not raised                                              | high     | done   |
+| T19 | Functional contract doc                                                                                                                                                     | low      | done   |
+| T20 | Tests: candidate response always returned even when baseline raises, ledger gets exactly `shadow_rate × calls` entries (within tolerance), sync vs async modes              | high     | done   |

 ### T05 — End-to-end example + integration test

 | ID  | Title                                                                                                                                                                                 | Priority | Status |
 |-----|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------|--------|
-| T21 | Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger | medium   | todo   |
-| T22 | Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter                   | high     | todo   |
-| T23 | Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy                                            | medium   | todo   |
+| T21 | Example script: route a small fixture batch through three candidate adapters (one OpenRouter cheap, one OpenRouter mid, `ClaudeCodeAdapter` as baseline), grade each, populate ledger | medium   | done   |
+| T22 | Integration test with mocked adapters covering: cold-start → static fallback → first observations → adaptive selection converges to the cheapest qualifying adapter                   | high     | done   |
+| T23 | Brief consumer-integration guide in `docs/` showing how `infospace-bench` (or any caller) wires task-type-per-stage into the adaptive policy                                            | medium   | done   |

 ## Risks and open questions