Files

tegwick 0a83e908ce IB-WP-0018-T01+T02+T05: routing bridge to llm-connect

T01 — task-type taxonomy. docs/routing-task-types.md names the five
generation stages as the default identity-mapped task types
(summarize-source, extract-entities, extract-relations,
evaluate-entity, synthesize-report) and records the recommended quality
floors per stage. The taxonomy explicitly does not decide which adapter
ships per task type, where the ledger lives, or what a quality score
means — those stay with the caller per the LLM-WP-0004 scope guardrail.

T02 — RoutingAssistedGenerationAdapter bridge in
src/infospace_bench/routing.py. Wraps any llm-connect RoutingPolicy or
AdaptiveRoutingPolicy as an infospace-bench AssistedGenerationAdapter:
maps stage_id -> task_type (overridable), resolves an LLMAdapter,
delegates execute_prompt with a configurable RunConfig, and surfaces
the resolved adapter id, task type, model, usage, and finish_reason
back on AssistedGenerationResult.metadata. Provider tag stays
back-compatible with the strings already used in run records and the
budget rollup (openrouter / claude_code / openai / gemini / mock /
routing).

T05 — eight tests in tests/test_routing_adapter.py cover: static-policy
per-stage resolution, stage_to_task_type overrides, default-mapping
completeness, fall-through for unmapped stage ids, the adaptive path
selecting the cheaper qualifying adapter when a quality_floor is set,
adaptive policy falling back to static when no floor is set, response
metadata round-trip with provider tagging, and estimated_cost_per_1k
pass-through.

Adds llm-connect as a path dependency on pyproject.toml and to the
pytest pythonpath. Static OpenRouter and fixture paths are unchanged;
this commit only adds the option of routing.

139 tests pass, 1 skipped (the OpenRouter live smoke, gated as before).

T03 (shadow-mode integration) and T04 (CLI + per-stage chosen-adapter
in the generation report) follow next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 11:33:58 +02:00

3.4 KiB

Raw Blame History

Task-Type Taxonomy for Routing

Workplan: IB-WP-0018 (T01) Depends on: llm-connect LLM-WP-0004 (RoutingPolicy, AdaptiveRoutingPolicy)

This file names the task types that infospace-bench emits when it routes each generation stage through llm-connect. The names are the consumer side of LLM-WP-0004's scope guardrail: llm-connect ships the routing primitives, infospace-bench owns the taxonomy.

Default identity mapping

RoutingAssistedGenerationAdapter (see src/infospace_bench/routing.py) maps stage ids to task types using the identity mapping below by default. Callers override individual entries via RoutingAssistedGenerationAdapter(..., stage_to_task_type={...}).

Stage id	Task type	Notes
`summarize-source`	`summarize-source`	One call per source chunk. Cheap, high-volume; small models usually clear the bar.
`extract-entities`	`extract-entities`	One call per source chunk. Quality matters most here — bad extractions cascade.
`extract-relations`	`extract-relations`	One call per source chunk. Quality close to extraction; relations rely on entity titles being stable.
`evaluate-entity`	`evaluate-entity`	One call per generated entity. Cheap, often a different model than extraction to avoid self-grading.
`synthesize-report`	`synthesize-report`	One call at the end. Volume-of-one; quality matters; cost negligible.

Quality expectations

AdaptiveRoutingPolicy.resolve(task_type, quality_floor=...) picks the cheapest adapter whose ledger-observed mean quality clears the floor. The recommended starting floors:

Task type	Quality floor	Rationale
`summarize-source`	0.70	Summaries are intermediate. Slight quality loss is recoverable downstream.
`extract-entities`	0.85	Entities are the durable output. Be strict.
`extract-relations`	0.80	Relations depend on entities; slightly looser is OK as long as evidence is intact.
`evaluate-entity`	0.80	Judge-level reliability. Self-grading bias is more of a concern than absolute score.
`synthesize-report`	0.70	The report is a review surface; tolerate looser language for cheaper models.

These are starting points. Bind them at the calling site (RoutingAssistedGenerationAdapter(..., quality_floor=0.85) for extraction stages) — they are not enforced by this taxonomy.

Common overrides

Callers may want to collapse task types to share observations across related stages, or split a task type to pin a specific model to a narrow workload. Two illustrative overrides:

# Collapse extraction stages so a single ledger drives both
stage_to_task_type = {
    "extract-entities": "extraction",
    "extract-relations": "extraction",
}

# Split entity evaluation by category — useful when a profile has very
# different quality bars for different entity categories (e.g.
# trading-literature's `evidence_bearing_claim` is harder to judge than
# `instrument`).
stage_to_task_type = {
    "evaluate-entity": "judge",
}

Anything not in the override map falls through to the identity mapping.

What this taxonomy does NOT decide

Which adapter ships per task type. That belongs to the caller's RoutingPolicy rule list.
Where the quality ledger lives. Caller-supplied path on the AdaptiveRoutingPolicy.
When to refresh observations. Caller decides via the ledger's TTL helpers in llm-connect.
What a quality score means. Each judge defines its own.

3.4 KiB Raw Blame History