Files

tegwick b0d67ae79e IB-WP-0020-T05: shadow-mode CLI flags; close IB-WP-0020

Add --shadow-baseline <id> and --shadow-rate <float> opt-in flags to
generate run, generate resume, and generate from-source. When
--shadow-baseline names a candidate id from the routing config,
build_routing_policy_from_config wraps every other candidate in an
llm-connect ShadowingAdapter using that baseline plus a
PairedGrader(ExactMatchJudge()) and the workspace-resolved
QualityLedger. The baseline candidate itself is never wrapped — that
would shadow it against itself. --shadow-rate defaults to 0.1 when
--shadow-baseline is set; passing --shadow-rate without
--shadow-baseline fails fast with shadow_rate_without_baseline.
Setting --shadow-baseline without a ledger_path in the config fails
with missing_routing_ledger_for_shadow so observations have a place to
land before any call goes out.

run_generation grew shadow_baseline + shadow_rate kwargs and
_adapter_for("routing", ...) plumbs them into
build_routing_policy_from_config. The wrapped ShadowingAdapter slots
into the policy's prefer/fallback per task type via a
(candidate_id, task_type) reverse lookup, and adapters_by_id on the
adaptive policy gets the string-keyed entries.

Five new tests cover: shadow_rate without baseline fails fast, shadow
mode without a ledger fails fast, unknown shadow baseline id fails
fast, structural assertion that ShadowingAdapter wraps non-baseline
candidates and leaves the baseline raw, and a behavioural check that
shadow_rate=1.0 calls the baseline on every call while shadow_rate=0.0
skips entirely. Test forces async_shadow=False so the call counter is
deterministic.

Closes IB-WP-0020: T01-T05 all done. Workplan status flips from active
to finished. 179 tests pass, 2 skipped (both live OpenRouter smokes).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 23:30:36 +02:00

7.7 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_slug, state_hub_workstream_id

type

title

domain

repo

status

owner

topic_slug

created

updated

depends_on_workplans

related_workplans

state_hub_workstream_slug

state_hub_workstream_id

IB-WP-0020

workplan

Provider Routing CLI Integration

markitect

infospace-bench

finished

markitect

2026-05-18

IB-WP-0018

LLM-WP-0004

IB-WP-0016

IB-WP-0019

ib-wp-0020-provider-routing-cli

172bb082-610a-477b-b5e0-26c9f4bdfd95

IB-WP-0020 — Provider Routing CLI Integration

Goal

Expose RoutingAssistedGenerationAdapter (IB-WP-0018) as a first-class CLI option so a real multi-chapter or full-book run can use the adaptive router without writing any Python. Today --provider accepts fixture and openrouter; this workplan adds routing, plus a small config file that names the rules, the ledger, the quality floors, and the per-stage task-type overrides.

The end state is a single command that does cost-aware adaptive routing across multiple OpenRouter models and writes back the per-stage adapter choices, the budget log, and (optionally) sampled shadow grades:

infospace-bench generate from-source ./LEFEVRE.epub \
  --workspace ./infospaces \
  --slug reminiscences-routed \
  --name "Reminiscences (Routed)" \
  --profile trading-literature \
  --provider routing \
  --routing-config ./routing.yaml \
  --chapter I \
  --apply

Why this is a separate workplan

IB-WP-0018 shipped the bridge module and its programmatic API. CLI wiring needs its own config-file schema, its own loader, its own error surfaces, and its own end-to-end smoke test — and that is enough scope to justify a separate review surface rather than absorbing it into the already-closed IB-WP-0018.

Non-Goals

Owning the routing policy primitives (those live in llm-connect LLM-WP-0004).
Replacing the static openrouter provider — that path stays usable for callers who do not want the router.
Embedding model selection logic inside the CLI; the config file is declarative and routing decisions stay with AdaptiveRoutingPolicy.

Tasks

T01 — Routing config file schema

id: IB-WP-0020-T01
status: done
priority: medium
state_hub_task_id: "39597441-22ab-4dcf-b68d-b045823a9374"

Define a small YAML schema for a routing config:
- quality_floor: <float | null> (global default)
- ledger_path: <str | null> (relative to workspace by default)
- task_types: map of task_type to a list of candidate adapters, each with id, provider (openrouter, claude_code, openai, …), model, api_key_env, optional max_cost_per_1k, optional quality_floor override
- stage_to_task_type: optional override map
Document the schema in docs/routing-config.md with two annotated examples (one OpenRouter-only, one ClaudeCode-as-baseline + OpenRouter candidates).
Tests: schema parses; missing fields default cleanly; unknown providers raise a focused error.

T02 — Routing config loader

id: IB-WP-0020-T02
status: done
priority: high
state_hub_task_id: "5e38514b-ad6a-4d39-8716-f812f241d9fd"

Add src/infospace_bench/routing_config.py (or extend routing.py) with load_routing_config(path, *, workspace) that returns a RoutingPolicy (or AdaptiveRoutingPolicy when the config sets quality_floor or names a ledger) ready to hand to RoutingAssistedGenerationAdapter.
Provider construction:
- openrouter → llm-connect OpenRouterAdapter with API key from api_key_env (default OPENROUTER_API_KEY)
- claude_code → llm-connect ClaudeCodeAdapter
- others (openai, gemini) supported but explicitly documented as untested for production use
Tests: builds a static policy from a minimal config; builds an adaptive policy with a ledger; missing API key raises before any network call.

T03 — `--provider routing` and `--routing-config` CLI flags

id: IB-WP-0020-T03
status: done
priority: high
state_hub_task_id: "fe5888e0-da33-413a-b026-71ed811b8c73"

Add routing to the --provider choices on generate run, generate resume, and generate from-source.
Add --routing-config <path> (required when --provider routing).
Add --quality-floor <float> to override the config-level floor at the call site (handy for tightening or loosening for a single run without editing the file).
Wire the loader into _adapter_for/run_generation so a RoutingAssistedGenerationAdapter is constructed and passed to the workflow engine.
Tests: CLI smoke that builds a routing config pointing at mocked adapter ids and confirms the run goes through the bridge.

T04 — Example config and live-smoke wiring

id: IB-WP-0020-T04
status: done
priority: medium
state_hub_task_id: "69288131-f265-4db5-a4b0-b0c8a6f55dd8"

Add examples/routing/trading-literature.yaml with a realistic Lefevre-aimed config: cheap model for summaries, mid model for entities/relations, ClaudeCode baseline behind a shadow sampler.
Update the optional live-OpenRouter smoke test (tests/test_openrouter_live.py) with a parallel skipped test that exercises --provider routing end-to-end when both OPENROUTER_API_KEY and INFOSPACE_BENCH_ENABLE_LIVE_OPENROUTER=1 are set.
Document how to run the live routing smoke in docs/generic-source-generator.md.

T05 — Shadow-mode opt-in flag

id: IB-WP-0020-T05
status: done
priority: medium
state_hub_task_id: "02658420-056c-4d73-8055-e6a7ab51876b"

Add --shadow-rate <float> and --shadow-baseline <id> flags so a caller can enable wrap_with_shadow_sampling() for an entire run without editing the config file. When set, the loader wraps each candidate adapter in ShadowingAdapter with the named baseline and the chosen rate.
Tests: monkeypatched baseline asserts the shadow path fires at shadow_rate=1.0 and skips at shadow_rate=0.0.

Acceptance

infospace-bench generate from-source ... --provider routing --routing-config <path> succeeds against the deterministic Lefevre fixture with a hand-crafted routing config and mocked adapters.
The generation report's ## Per-stage adapter choices section reflects the routed choices, and output/budget/usage.yaml buckets reflect the actual model that ran each call.
The static openrouter and fixture provider paths remain unchanged.
An optional live smoke test exists and is gated identically to the IB-WP-0016 OpenRouter smoke.
Documentation explains the config shape, the API-key resolution, and the difference between adaptive routing and shadow-mode sampling.

Risks and open questions

Adapter constructor surface. llm-connect's adapter constructors vary slightly per provider; the loader needs to keep a small but explicit allowlist of provider names rather than reflective magic.
API key plumbing. Today openrouter reads OPENROUTER_API_KEY directly. The config will name the env var explicitly to make multi-key setups workable; no key material belongs in the config file itself.
Schema versioning. Bump schema_version from day one so the loader can refuse mismatched configs once the shape stabilises.
Shadow grader choice. v1 will default the shadow grader to ExactMatchJudge because it has no extra cost. LLMJudge and EmbeddingSimilarityJudge configuration belongs in a follow-up.

Downstream effects

infospace-bench routing ledger <path> (already shipped via IB-WP-0018) becomes the natural companion CLI for inspecting the observations the routed runs accumulate.
A successful T03 + T04 lets us run a multi-chapter Lefevre live build using the adaptive router and validate the IB-WP-0016 reviewer checklist on real output without single-model lock-in.

7.7 KiB Raw Blame History