Files

tegwick f818acfc62 IB-WP-0018-T03+T04: shadow sampling + report/CLI surfacing; close IB-WP-0018

T03 — wrap_with_shadow_sampling() helper in routing.py: builds a
llm-connect ShadowingAdapter around any candidate LLMAdapter with a
caller-supplied baseline, grader, and QualityLedger. async_shadow=True
by default so production load is not doubled; on_shadow_error escape
hatch keeps caller logs informed when a baseline outage swallows the
shadow path. The returned adapter is still an LLMAdapter so it slots
into a RoutingPolicy rule without further code change.

T04 — generation report enrichment plus a small CLI helper:

- _collect_adapter_choices walks artifact provenance, groups by
  (stage_id, adapter_id), and surfaces calls + prompt/completion tokens
  per (stage, adapter) pair in a new ## Per-stage adapter choices
  section. Runs that did not go through the bridge have no
  provider_metadata.adapter_id and emit an empty list, so fixture-only
  reports stay terse.
- summarise_quality_ledger() rolls a llm-connect QualityLedger up by
  (task_type, adapter_id) with mean quality, mean cost, observations,
  and cumulative tokens.
- infospace-bench routing ledger <path> CLI prints the rollup as JSON.

Five new tests cover shadow happy-path, shadow failure isolation,
ledger rollup, the routing CLI, and the report's adapter-choice
aggregation. Closes IB-WP-0018: T01-T05 are all done and the workplan
status flips from blocked to done now that LLM-WP-0004's primitives
have shipped.

144 tests pass, 1 skipped (the OpenRouter live smoke, gated as before).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 11:52:05 +02:00

4.3 KiB

Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id

type

title

domain

repo

status

owner

topic_slug

created

updated

depends_on_workplans

related_workplans

state_hub_workstream_id

IB-WP-0018

workplan

Adaptive LLM Routing — infospace-bench Consumer Wiring

markitect

infospace-bench

done

markitect

2026-05-17

2026-05-18

LLM-WP-0004

IB-WP-0016

3d38642e-9d6d-4c7f-869f-b185a00bd0e6

IB-WP-0018 — Adaptive LLM Routing — infospace-bench Consumer Wiring

Goal

Wire infospace-bench workflow stages to llm-connect's adaptive cost-quality routing once LLM-WP-0004 ships the primitives. The goal is to let an infospace generation run pick the cheapest model that clears a per-stage quality bar — for example, a small/cheap model for chunk summarisation and a larger model for entity/relation extraction — without hardcoding any specific model in infospace-bench itself.

This workplan is a stub until LLM-WP-0004 tasks T01..T03 (ledger, grader, adaptive policy) are done in llm-connect. The exact task list will be refined once that API is stable.

Status

Done. LLM-WP-0004 landed QualityLedger, QualityObservation, BaselineGrader/PairedGrader/ExactMatchJudge/EmbeddingSimilarityJudge/ LLMJudge, AdaptiveRoutingPolicy, and ShadowingAdapter in llm-connect; the five tasks below are all complete.

T01 — task-type taxonomy (docs/routing-task-types.md)
T02 — RoutingAssistedGenerationAdapter bridge in src/infospace_bench/routing.py
T03 — wrap_with_shadow_sampling() helper that opt-in installs llm-connect's ShadowingAdapter around any candidate
T04 — ## Per-stage adapter choices section in reports/generation-summary.md (driven from artifact provenance.provider_metadata) and infospace-bench routing ledger CLI subcommand
T05 — tests/test_routing_adapter.py (13 tests, including a CLI smoke and the adapter-choices unit test)

Why this is a separate workplan

IB-WP-0016 brings the Lefevre EPUB pipeline to a state where a chapter-by-chapter live OpenRouter run is feasible. That work uses OpenRouterAssistedGenerationAdapter directly. Replacing that direct adapter with a task-typed adaptive route is a meaningful architectural shift that deserves its own scope, baseline, and tests, rather than being absorbed into IB-WP-0016.

Provisional Tasks (refined when LLM-WP-0004 lands)

T01 — Task-type taxonomy

Name the generation stages as task types for routing (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report)
Document quality expectations for each task type so a per-stage quality floor can be set

T02 — Adapter swap

Introduce a small router-aware adapter that wraps AdaptiveRoutingPolicy.resolve(task_type) and exposes the existing AssistedGenerationAdapter protocol used by workflow.py
Keep OpenRouterAssistedGenerationAdapter available as the static baseline so deterministic test runs and fixture mode continue to work

T03 — Baseline + shadow integration

Use ClaudeCodeAdapter as the default baseline grader (subject to availability)
Enable ShadowingAdapter for the first multi-chapter run so the quality ledger fills up while real generation proceeds

T04 — Cost/quality reporting

Surface per-stage chosen adapter, observed quality, and cumulative cost in reports/generation-summary.md
Add a small CLI helper to print the ledger summary for an infospace

T05 — Tests

Fixture-backed test that routes through a deterministic adaptive policy with mocked observations
Regression test that demonstrates the static path still works when the router is bypassed

Acceptance

An infospace generation run can be configured to use the adaptive router without any code change inside workflow.py
A multi-chapter Lefevre run completes with per-stage adapter choices recorded in the generation summary
The fixture-mode test suite continues to pass with no live calls
The static OpenRouterAssistedGenerationAdapter path remains usable for callers that opt out of the router

Non-Goals

Authoring the routing primitives themselves (that is LLM-WP-0004's job)
Owning a task-type taxonomy beyond infospace-bench workflow stages
Embedding cost or quality observations inside infospace-bench beyond what the llm-connect ledger already records

4.3 KiB Raw Permalink Blame History