Files
infospace-bench/docs/routing-task-types.md
tegwick 0a83e908ce IB-WP-0018-T01+T02+T05: routing bridge to llm-connect
T01 — task-type taxonomy. docs/routing-task-types.md names the five
generation stages as the default identity-mapped task types
(summarize-source, extract-entities, extract-relations,
evaluate-entity, synthesize-report) and records the recommended quality
floors per stage. The taxonomy explicitly does not decide which adapter
ships per task type, where the ledger lives, or what a quality score
means — those stay with the caller per the LLM-WP-0004 scope guardrail.

T02 — RoutingAssistedGenerationAdapter bridge in
src/infospace_bench/routing.py. Wraps any llm-connect RoutingPolicy or
AdaptiveRoutingPolicy as an infospace-bench AssistedGenerationAdapter:
maps stage_id -> task_type (overridable), resolves an LLMAdapter,
delegates execute_prompt with a configurable RunConfig, and surfaces
the resolved adapter id, task type, model, usage, and finish_reason
back on AssistedGenerationResult.metadata. Provider tag stays
back-compatible with the strings already used in run records and the
budget rollup (openrouter / claude_code / openai / gemini / mock /
routing).

T05 — eight tests in tests/test_routing_adapter.py cover: static-policy
per-stage resolution, stage_to_task_type overrides, default-mapping
completeness, fall-through for unmapped stage ids, the adaptive path
selecting the cheaper qualifying adapter when a quality_floor is set,
adaptive policy falling back to static when no floor is set, response
metadata round-trip with provider tagging, and estimated_cost_per_1k
pass-through.

Adds llm-connect as a path dependency on pyproject.toml and to the
pytest pythonpath. Static OpenRouter and fixture paths are unchanged;
this commit only adds the option of routing.

139 tests pass, 1 skipped (the OpenRouter live smoke, gated as before).

T03 (shadow-mode integration) and T04 (CLI + per-stage chosen-adapter
in the generation report) follow next.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 11:33:58 +02:00

3.4 KiB

Task-Type Taxonomy for Routing

Workplan: IB-WP-0018 (T01) Depends on: llm-connect LLM-WP-0004 (RoutingPolicy, AdaptiveRoutingPolicy)

This file names the task types that infospace-bench emits when it routes each generation stage through llm-connect. The names are the consumer side of LLM-WP-0004's scope guardrail: llm-connect ships the routing primitives, infospace-bench owns the taxonomy.

Default identity mapping

RoutingAssistedGenerationAdapter (see src/infospace_bench/routing.py) maps stage ids to task types using the identity mapping below by default. Callers override individual entries via RoutingAssistedGenerationAdapter(..., stage_to_task_type={...}).

Stage id Task type Notes
summarize-source summarize-source One call per source chunk. Cheap, high-volume; small models usually clear the bar.
extract-entities extract-entities One call per source chunk. Quality matters most here — bad extractions cascade.
extract-relations extract-relations One call per source chunk. Quality close to extraction; relations rely on entity titles being stable.
evaluate-entity evaluate-entity One call per generated entity. Cheap, often a different model than extraction to avoid self-grading.
synthesize-report synthesize-report One call at the end. Volume-of-one; quality matters; cost negligible.

Quality expectations

AdaptiveRoutingPolicy.resolve(task_type, quality_floor=...) picks the cheapest adapter whose ledger-observed mean quality clears the floor. The recommended starting floors:

Task type Quality floor Rationale
summarize-source 0.70 Summaries are intermediate. Slight quality loss is recoverable downstream.
extract-entities 0.85 Entities are the durable output. Be strict.
extract-relations 0.80 Relations depend on entities; slightly looser is OK as long as evidence is intact.
evaluate-entity 0.80 Judge-level reliability. Self-grading bias is more of a concern than absolute score.
synthesize-report 0.70 The report is a review surface; tolerate looser language for cheaper models.

These are starting points. Bind them at the calling site (RoutingAssistedGenerationAdapter(..., quality_floor=0.85) for extraction stages) — they are not enforced by this taxonomy.

Common overrides

Callers may want to collapse task types to share observations across related stages, or split a task type to pin a specific model to a narrow workload. Two illustrative overrides:

# Collapse extraction stages so a single ledger drives both
stage_to_task_type = {
    "extract-entities": "extraction",
    "extract-relations": "extraction",
}
# Split entity evaluation by category — useful when a profile has very
# different quality bars for different entity categories (e.g.
# trading-literature's `evidence_bearing_claim` is harder to judge than
# `instrument`).
stage_to_task_type = {
    "evaluate-entity": "judge",
}

Anything not in the override map falls through to the identity mapping.

What this taxonomy does NOT decide

  • Which adapter ships per task type. That belongs to the caller's RoutingPolicy rule list.
  • Where the quality ledger lives. Caller-supplied path on the AdaptiveRoutingPolicy.
  • When to refresh observations. Caller decides via the ledger's TTL helpers in llm-connect.
  • What a quality score means. Each judge defines its own.