# Task-Type Taxonomy for Routing

Workplan: IB-WP-0018 (T01)
Depends on: llm-connect LLM-WP-0004 (RoutingPolicy, AdaptiveRoutingPolicy)

This file names the task types that infospace-bench emits when it routes
each generation stage through llm-connect. The names are the consumer
side of LLM-WP-0004's scope guardrail: llm-connect ships the routing
primitives, infospace-bench owns the taxonomy.

## Default identity mapping

`RoutingAssistedGenerationAdapter` (see `src/infospace_bench/routing.py`)
maps stage ids to task types using the identity mapping below by
default. Callers override individual entries via
`RoutingAssistedGenerationAdapter(..., stage_to_task_type={...})`.

| Stage id | Task type | Notes |
|---|---|---|
| `summarize-source` | `summarize-source` | One call per source chunk. Cheap, high-volume; small models usually clear the bar. |
| `extract-entities` | `extract-entities` | One call per source chunk. Quality matters most here — bad extractions cascade. |
| `extract-relations` | `extract-relations` | One call per source chunk. Quality close to extraction; relations rely on entity titles being stable. |
| `evaluate-entity` | `evaluate-entity` | One call per generated entity. Cheap, often a different model than extraction to avoid self-grading. |
| `synthesize-report` | `synthesize-report` | One call at the end. Volume-of-one; quality matters; cost negligible. |

## Quality expectations

`AdaptiveRoutingPolicy.resolve(task_type, quality_floor=...)` picks the
cheapest adapter whose ledger-observed mean quality clears the floor.
The recommended starting floors:

| Task type | Quality floor | Rationale |
|---|---|---|
| `summarize-source` | 0.70 | Summaries are intermediate. Slight quality loss is recoverable downstream. |
| `extract-entities` | 0.85 | Entities are the durable output. Be strict. |
| `extract-relations` | 0.80 | Relations depend on entities; slightly looser is OK as long as evidence is intact. |
| `evaluate-entity` | 0.80 | Judge-level reliability. Self-grading bias is more of a concern than absolute score. |
| `synthesize-report` | 0.70 | The report is a review surface; tolerate looser language for cheaper models. |

These are starting points. Bind them at the calling site
(`RoutingAssistedGenerationAdapter(..., quality_floor=0.85)` for
extraction stages) — they are not enforced by this taxonomy.

## Common overrides

Callers may want to **collapse** task types to share observations across
related stages, or **split** a task type to pin a specific model to a
narrow workload. Two illustrative overrides:

```python
# Collapse extraction stages so a single ledger drives both
stage_to_task_type = {
    "extract-entities": "extraction",
    "extract-relations": "extraction",
}
```

```python
# Split entity evaluation by category — useful when a profile has very
# different quality bars for different entity categories (e.g.
# trading-literature's `evidence_bearing_claim` is harder to judge than
# `instrument`).
stage_to_task_type = {
    "evaluate-entity": "judge",
}
```

Anything not in the override map falls through to the identity mapping.

## What this taxonomy does NOT decide

- **Which adapter ships per task type.** That belongs to the caller's
  `RoutingPolicy` rule list.
- **Where the quality ledger lives.** Caller-supplied path on the
  `AdaptiveRoutingPolicy`.
- **When to refresh observations.** Caller decides via the ledger's TTL
  helpers in llm-connect.
- **What a quality score means.** Each judge defines its own.