Files
infospace-bench/workplans/IB-WP-0018-adaptive-llm-routing-consumer.md
tegwick f818acfc62 IB-WP-0018-T03+T04: shadow sampling + report/CLI surfacing; close IB-WP-0018
T03 — wrap_with_shadow_sampling() helper in routing.py: builds a
llm-connect ShadowingAdapter around any candidate LLMAdapter with a
caller-supplied baseline, grader, and QualityLedger. async_shadow=True
by default so production load is not doubled; on_shadow_error escape
hatch keeps caller logs informed when a baseline outage swallows the
shadow path. The returned adapter is still an LLMAdapter so it slots
into a RoutingPolicy rule without further code change.

T04 — generation report enrichment plus a small CLI helper:

- _collect_adapter_choices walks artifact provenance, groups by
  (stage_id, adapter_id), and surfaces calls + prompt/completion tokens
  per (stage, adapter) pair in a new ## Per-stage adapter choices
  section. Runs that did not go through the bridge have no
  provider_metadata.adapter_id and emit an empty list, so fixture-only
  reports stay terse.
- summarise_quality_ledger() rolls a llm-connect QualityLedger up by
  (task_type, adapter_id) with mean quality, mean cost, observations,
  and cumulative tokens.
- infospace-bench routing ledger <path> CLI prints the rollup as JSON.

Five new tests cover shadow happy-path, shadow failure isolation,
ledger rollup, the routing CLI, and the report's adapter-choice
aggregation. Closes IB-WP-0018: T01-T05 are all done and the workplan
status flips from blocked to done now that LLM-WP-0004's primitives
have shipped.

144 tests pass, 1 skipped (the OpenRouter live smoke, gated as before).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 11:52:05 +02:00

4.3 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated depends_on_workplans related_workplans state_hub_workstream_id
IB-WP-0018 workplan Adaptive LLM Routing — infospace-bench Consumer Wiring markitect infospace-bench done markitect markitect 2026-05-17 2026-05-18
LLM-WP-0004
IB-WP-0016
3d38642e-9d6d-4c7f-869f-b185a00bd0e6

IB-WP-0018 — Adaptive LLM Routing — infospace-bench Consumer Wiring

Goal

Wire infospace-bench workflow stages to llm-connect's adaptive cost-quality routing once LLM-WP-0004 ships the primitives. The goal is to let an infospace generation run pick the cheapest model that clears a per-stage quality bar — for example, a small/cheap model for chunk summarisation and a larger model for entity/relation extraction — without hardcoding any specific model in infospace-bench itself.

This workplan is a stub until LLM-WP-0004 tasks T01..T03 (ledger, grader, adaptive policy) are done in llm-connect. The exact task list will be refined once that API is stable.

Status

Done. LLM-WP-0004 landed QualityLedger, QualityObservation, BaselineGrader/PairedGrader/ExactMatchJudge/EmbeddingSimilarityJudge/ LLMJudge, AdaptiveRoutingPolicy, and ShadowingAdapter in llm-connect; the five tasks below are all complete.

  • T01 — task-type taxonomy (docs/routing-task-types.md)
  • T02 — RoutingAssistedGenerationAdapter bridge in src/infospace_bench/routing.py
  • T03 — wrap_with_shadow_sampling() helper that opt-in installs llm-connect's ShadowingAdapter around any candidate
  • T04 — ## Per-stage adapter choices section in reports/generation-summary.md (driven from artifact provenance.provider_metadata) and infospace-bench routing ledger CLI subcommand
  • T05 — tests/test_routing_adapter.py (13 tests, including a CLI smoke and the adapter-choices unit test)

Why this is a separate workplan

IB-WP-0016 brings the Lefevre EPUB pipeline to a state where a chapter-by-chapter live OpenRouter run is feasible. That work uses OpenRouterAssistedGenerationAdapter directly. Replacing that direct adapter with a task-typed adaptive route is a meaningful architectural shift that deserves its own scope, baseline, and tests, rather than being absorbed into IB-WP-0016.

Provisional Tasks (refined when LLM-WP-0004 lands)

T01 — Task-type taxonomy

  • Name the generation stages as task types for routing (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report)
  • Document quality expectations for each task type so a per-stage quality floor can be set

T02 — Adapter swap

  • Introduce a small router-aware adapter that wraps AdaptiveRoutingPolicy.resolve(task_type) and exposes the existing AssistedGenerationAdapter protocol used by workflow.py
  • Keep OpenRouterAssistedGenerationAdapter available as the static baseline so deterministic test runs and fixture mode continue to work

T03 — Baseline + shadow integration

  • Use ClaudeCodeAdapter as the default baseline grader (subject to availability)
  • Enable ShadowingAdapter for the first multi-chapter run so the quality ledger fills up while real generation proceeds

T04 — Cost/quality reporting

  • Surface per-stage chosen adapter, observed quality, and cumulative cost in reports/generation-summary.md
  • Add a small CLI helper to print the ledger summary for an infospace

T05 — Tests

  • Fixture-backed test that routes through a deterministic adaptive policy with mocked observations
  • Regression test that demonstrates the static path still works when the router is bypassed

Acceptance

  • An infospace generation run can be configured to use the adaptive router without any code change inside workflow.py
  • A multi-chapter Lefevre run completes with per-stage adapter choices recorded in the generation summary
  • The fixture-mode test suite continues to pass with no live calls
  • The static OpenRouterAssistedGenerationAdapter path remains usable for callers that opt out of the router

Non-Goals

  • Authoring the routing primitives themselves (that is LLM-WP-0004's job)
  • Owning a task-type taxonomy beyond infospace-bench workflow stages
  • Embedding cost or quality observations inside infospace-bench beyond what the llm-connect ledger already records