generated from coulomb/repo-seed
IB-WP-0018-T01+T02+T05: routing bridge to llm-connect
T01 — task-type taxonomy. docs/routing-task-types.md names the five generation stages as the default identity-mapped task types (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report) and records the recommended quality floors per stage. The taxonomy explicitly does not decide which adapter ships per task type, where the ledger lives, or what a quality score means — those stay with the caller per the LLM-WP-0004 scope guardrail. T02 — RoutingAssistedGenerationAdapter bridge in src/infospace_bench/routing.py. Wraps any llm-connect RoutingPolicy or AdaptiveRoutingPolicy as an infospace-bench AssistedGenerationAdapter: maps stage_id -> task_type (overridable), resolves an LLMAdapter, delegates execute_prompt with a configurable RunConfig, and surfaces the resolved adapter id, task type, model, usage, and finish_reason back on AssistedGenerationResult.metadata. Provider tag stays back-compatible with the strings already used in run records and the budget rollup (openrouter / claude_code / openai / gemini / mock / routing). T05 — eight tests in tests/test_routing_adapter.py cover: static-policy per-stage resolution, stage_to_task_type overrides, default-mapping completeness, fall-through for unmapped stage ids, the adaptive path selecting the cheaper qualifying adapter when a quality_floor is set, adaptive policy falling back to static when no floor is set, response metadata round-trip with provider tagging, and estimated_cost_per_1k pass-through. Adds llm-connect as a path dependency on pyproject.toml and to the pytest pythonpath. Static OpenRouter and fixture paths are unchanged; this commit only adds the option of routing. 139 tests pass, 1 skipped (the OpenRouter live smoke, gated as before). T03 (shadow-mode integration) and T04 (CLI + per-stage chosen-adapter in the generation report) follow next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
78
docs/routing-task-types.md
Normal file
78
docs/routing-task-types.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Task-Type Taxonomy for Routing
|
||||
|
||||
Workplan: IB-WP-0018 (T01)
|
||||
Depends on: llm-connect LLM-WP-0004 (RoutingPolicy, AdaptiveRoutingPolicy)
|
||||
|
||||
This file names the task types that infospace-bench emits when it routes
|
||||
each generation stage through llm-connect. The names are the consumer
|
||||
side of LLM-WP-0004's scope guardrail: llm-connect ships the routing
|
||||
primitives, infospace-bench owns the taxonomy.
|
||||
|
||||
## Default identity mapping
|
||||
|
||||
`RoutingAssistedGenerationAdapter` (see `src/infospace_bench/routing.py`)
|
||||
maps stage ids to task types using the identity mapping below by
|
||||
default. Callers override individual entries via
|
||||
`RoutingAssistedGenerationAdapter(..., stage_to_task_type={...})`.
|
||||
|
||||
| Stage id | Task type | Notes |
|
||||
|---|---|---|
|
||||
| `summarize-source` | `summarize-source` | One call per source chunk. Cheap, high-volume; small models usually clear the bar. |
|
||||
| `extract-entities` | `extract-entities` | One call per source chunk. Quality matters most here — bad extractions cascade. |
|
||||
| `extract-relations` | `extract-relations` | One call per source chunk. Quality close to extraction; relations rely on entity titles being stable. |
|
||||
| `evaluate-entity` | `evaluate-entity` | One call per generated entity. Cheap, often a different model than extraction to avoid self-grading. |
|
||||
| `synthesize-report` | `synthesize-report` | One call at the end. Volume-of-one; quality matters; cost negligible. |
|
||||
|
||||
## Quality expectations
|
||||
|
||||
`AdaptiveRoutingPolicy.resolve(task_type, quality_floor=...)` picks the
|
||||
cheapest adapter whose ledger-observed mean quality clears the floor.
|
||||
The recommended starting floors:
|
||||
|
||||
| Task type | Quality floor | Rationale |
|
||||
|---|---|---|
|
||||
| `summarize-source` | 0.70 | Summaries are intermediate. Slight quality loss is recoverable downstream. |
|
||||
| `extract-entities` | 0.85 | Entities are the durable output. Be strict. |
|
||||
| `extract-relations` | 0.80 | Relations depend on entities; slightly looser is OK as long as evidence is intact. |
|
||||
| `evaluate-entity` | 0.80 | Judge-level reliability. Self-grading bias is more of a concern than absolute score. |
|
||||
| `synthesize-report` | 0.70 | The report is a review surface; tolerate looser language for cheaper models. |
|
||||
|
||||
These are starting points. Bind them at the calling site
|
||||
(`RoutingAssistedGenerationAdapter(..., quality_floor=0.85)` for
|
||||
extraction stages) — they are not enforced by this taxonomy.
|
||||
|
||||
## Common overrides
|
||||
|
||||
Callers may want to **collapse** task types to share observations across
|
||||
related stages, or **split** a task type to pin a specific model to a
|
||||
narrow workload. Two illustrative overrides:
|
||||
|
||||
```python
|
||||
# Collapse extraction stages so a single ledger drives both
|
||||
stage_to_task_type = {
|
||||
"extract-entities": "extraction",
|
||||
"extract-relations": "extraction",
|
||||
}
|
||||
```
|
||||
|
||||
```python
|
||||
# Split entity evaluation by category — useful when a profile has very
|
||||
# different quality bars for different entity categories (e.g.
|
||||
# trading-literature's `evidence_bearing_claim` is harder to judge than
|
||||
# `instrument`).
|
||||
stage_to_task_type = {
|
||||
"evaluate-entity": "judge",
|
||||
}
|
||||
```
|
||||
|
||||
Anything not in the override map falls through to the identity mapping.
|
||||
|
||||
## What this taxonomy does NOT decide
|
||||
|
||||
- **Which adapter ships per task type.** That belongs to the caller's
|
||||
`RoutingPolicy` rule list.
|
||||
- **Where the quality ledger lives.** Caller-supplied path on the
|
||||
`AdaptiveRoutingPolicy`.
|
||||
- **When to refresh observations.** Caller decides via the ledger's TTL
|
||||
helpers in llm-connect.
|
||||
- **What a quality score means.** Each judge defines its own.
|
||||
Reference in New Issue
Block a user