Files
llm-connect/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md
tegwick c11c6afa3f
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Implement-LLM-WP-0005-cost-model-estimators
2026-05-19 05:02:20 +02:00

334 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
id: LLM-WP-0005
type: workplan
title: "Cost Model and Problem-Class Token Estimators"
domain: custodian
repo: llm-connect
status: finished
owner: llm-connect
planning_priority: high
planning_order: 5
created: "2026-05-19"
updated: "2026-05-19"
depends_on_workplans:
- LLM-WP-0003
- LLM-WP-0004
related_workplans:
- IB-WP-0019
- IB-WP-0020
state_hub_workstream_id: "869196c5-551b-4eef-b8d8-cca6f770a9b0"
---
# LLM-WP-0005 — Cost Model and Problem-Class Token Estimators
**status:** finished
**owner:** llm-connect
## Purpose
Move two consumer-side concerns into llm-connect as first-class
primitives:
1. **Model rate registry.** Provider-specific USD-per-1k prompt and
completion rates plus capture provenance — a fact about the base
model itself, not the application using it.
2. **Problem-class token estimators.** Generic shapes ("summarise a
chunk of N words", "extract entities from M paragraphs", "judge an
artifact against K criteria") with a small base-variable surface and
a few tunable parameters. Consumers select the class, supply problem
dimensions, and get a predicted (prompt_tokens, completion_tokens)
estimate before any call goes out. The cost model then converts the
estimate into USD using the rate registry.
Today the cost-rate table and a coarse word-count estimator live in
consumer code (`infospace-bench/src/infospace_bench/model_rates.yaml`
and `plan_generation_summary` in `generator.py`). The Lefevre live-run
smoke surfaced two consequences:
- A user supplying `--cost-per-1k 0.30` as a blended rate produced a
plan estimate 1000× larger than the actual gpt-4o-mini bill, because
the consumer's estimator has no notion of per-model rates and the
user has no easy way to pick a right number.
- The token estimator multiplies word counts by a constant and ignores
problem shape — actual prompts ran ~3× larger than the word-count
estimate predicted because templates, profile content, and entity
context dominate the prompt body, not the chunk text.
Both gaps recur in every llm-connect consumer (infospace-bench today;
inter-hub, markitect, and future repos tomorrow). Owning the primitives
here means each consumer wires structure-specific dimensions and
parameters but never re-implements rates or generic shapes.
## Demand signal
`infospace-bench` is the first concrete consumer. The Lefevre Chapter-I
smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs
a planned $8.40 — the 1000× variance is entirely on the consumer side
of the estimator. `IB-WP-0019` (budget registry) already records the
shape llm-connect would need to learn from (`output/budget/usage.yaml`
+ the llm-connect `QualityLedger`).
## Architecture sketch (read before writing tasks)
Three new modules in llm-connect:
```
llm_connect/
rates.py # ModelRate, ModelRateRegistry, default registry
costs.py # CostModel: tokens × rate → USD
problem_classes.py # ProblemClass protocol, built-in classes, registry
```
### `rates.py`
```python
@dataclass(frozen=True)
class ModelRate:
model_id: str
prompt_per_1k: float
completion_per_1k: float
currency: str = "USD"
source_url: str = ""
captured_at: str = ""
class ModelRateRegistry:
def get(self, model_id: str) -> ModelRate | None: ...
def all(self) -> dict[str, ModelRate]: ...
@classmethod
def default(cls) -> "ModelRateRegistry": ...
@classmethod
def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ...
def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ...
```
The default registry ships a small handful of well-known OpenRouter
models (the same set infospace-bench has today). Consumers can load
overrides from their own YAML.
### `costs.py`
```python
@dataclass(frozen=True)
class CostEstimate:
cost_usd: float | None
cost_source: str # "rate_table:<model>" | "unknown" | "override"
prompt_cost_usd: float | None = None
completion_cost_usd: float | None = None
def estimate_cost(
model_id: str,
prompt_tokens: int,
completion_tokens: int = 0,
*,
registry: ModelRateRegistry | None = None,
) -> CostEstimate: ...
```
Pure function over rate registry and token counts. Useful both for
preview (plan time) and post-hoc verification (compare adapter-reported
cost against rate-table estimate).
### `problem_classes.py`
```python
@dataclass(frozen=True)
class TokenEstimate:
prompt_tokens: int
completion_tokens: int
confidence: float = 0.5 # 0..1, learned over time
class ProblemClass(Protocol):
name: str
base_dimensions: tuple[str, ...] # e.g. ("chunk_words",)
tunable_params: tuple[str, ...] # e.g. ("template_tokens", "completion_ratio")
def estimate(
self,
dimensions: dict[str, Any],
params: dict[str, Any] | None = None,
) -> TokenEstimate: ...
```
Built-in classes (one per common workflow shape):
| Class | Dimensions | Tunable params |
|---|---|---|
| `chunk-summarization` | `chunk_words`, `template_words` | `completion_ratio` (default 0.25) |
| `entity-extraction` | `chunk_words`, `template_words`, `expected_entities` | `tokens_per_entity` (default 70) |
| `relation-extraction` | `chunk_words`, `template_words`, `expected_relations` | `tokens_per_relation` (default 80) |
| `judge-eval` | `artifact_words`, `template_words`, `n_criteria` | `tokens_per_criterion` (default 35) |
| `report-synthesis` | `n_chunks`, `n_entities`, `n_relations`, `template_words` | `base_completion_tokens` (default 400) |
Each class registers under a name; a `ProblemClassRegistry` keeps the
mapping. Consumers reference classes by name; advanced consumers can
register their own.
### Defaults and learning
Each built-in class ships seed parameters chosen from common practice
(roughly aligned with what infospace-bench observed on the Lefevre
smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70
tokens per emitted entity). Two adaptation hooks:
- `ProblemClass.fit(observations: list[Observation])` adjusts params to
best fit observed (dimensions → actual_tokens) pairs.
- A small CLI `llm-connect rates show` / `llm-connect classes show`
inspects the current registry and learned parameters.
`Observation` reuses the `QualityLedger` row shape so existing infra
keeps working (LLM-WP-0004 owns the ledger).
## Tasks
```task
id: T01
title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models'
priority: high
status: done
state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671"
```
```task
id: T02
title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge'
priority: high
status: done
state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc"
```
```task
id: T03
title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry'
priority: high
status: done
state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727"
```
```task
id: T04
title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis'
priority: high
status: done
state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0"
```
```task
id: T05
title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations'
priority: medium
status: done
state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb"
```
```task
id: T06
title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit <ledger>'
priority: medium
status: done
state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6"
```
```task
id: T07
title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes'
priority: medium
status: done
state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9"
```
```task
id: T08
title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls'
priority: medium
status: done
state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb"
```
## Scope guardrails
In scope:
- Data models, the registries, the built-in classes, and the
fit-from-observations helper.
- A default rate registry seeded with publicly known OpenRouter list
prices (refresh policy is on the consumer).
- CLI helpers for inspection.
Out of scope:
- Per-call billing or accounting infrastructure (consumer's job — see
`IB-WP-0019` for infospace-bench's per-infospace budget log).
- Provider-specific tokenisers (we already use a coarse
chars-per-token estimator; tokeniser parity is a separate piece of
work that should land before this one's accuracy bar tightens).
- A rates *update* mechanism. The registry is a static snapshot;
refreshing rates is a consumer chore (or a separate CronCreate-fed
workflow against provider price pages, deliberately out of scope
here).
## Acceptance
- `ModelRateRegistry.default()` returns a registry with at least the
nine OpenRouter models infospace-bench bundles today, each carrying
a `captured_at` timestamp.
- `estimate_cost("openai/gpt-4o-mini", 28000, 7500)` returns a
``CostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini")``
matching the Lefevre Chapter-I smoke within 20%.
- Each built-in `ProblemClass.estimate(dimensions, params)` produces a
`TokenEstimate` and its `fit()` recovers seeded params from a small
synthetic observation set within 10%.
- The consumer guide in the workplan's notes shows what
`infospace-bench plan_generation_summary` would look like once it
delegates to these primitives.
## Risks and open questions
- **Rate drift.** OpenRouter / Anthropic / OpenAI prices change. Each
rate has a `captured_at`; consumers must decide their freshness
policy. The default registry's date is the source of truth; a stale
registry will under- or over-estimate cost but the structure is
unchanged.
- **Class taxonomy lock-in.** The built-in class names will appear in
consumer code and ledger tags. Bump a `schema_version` on the
ProblemClassRegistry from day one; breaking changes need migration.
- **Tokeniser parity.** Current consumer estimators use a coarse
chars-per-token heuristic; the real tokenisers diverge by ±20%
across providers. This workplan accepts that for v1; tightening the
per-class accuracy bar belongs to a follow-on workplan once usage
observations expose the gap.
- **Fitting on small samples.** Adaptation needs enough observations
to be statistically meaningful. The fit function should refuse (or
fall back to seeds) when sample sizes are below a configurable
threshold.
## Downstream effects
- `infospace-bench` `plan_generation_summary` becomes a thin caller
over llm-connect's `estimate_cost` and the relevant `ProblemClass`.
The follow-on consumer workplan (`INFOSPACE-WP-NNNN`) describes
exactly which class maps to each workflow stage.
- `IB-WP-0019`'s budget registry stops needing its own
`src/infospace_bench/model_rates.yaml`; the workspace override path
stays but the bundled default lives upstream.
- `IB-WP-0020`'s routing CLI can ask the model rate registry whether a
`--routing-config` cost cap is realistic for the candidate set —
enables defensible `--cost-cap` defaults later.
## Consumer-side follow-up
Once T01-T04 land in llm-connect, `infospace-bench` opens a thin
companion workplan to:
- Replace `_CALLS_PER_CHUNK_BY_WORKFLOW` + `_profile_template_words` +
`WORDS_PER_TOKEN_DEFAULT` math in `plan_generation_summary` with
problem-class lookups.
- Map each workflow stage to a problem class
(`summarize-source` → `chunk-summarization`, etc.) and surface the
mapping in `docs/routing-task-types.md`.
- Drop `src/infospace_bench/model_rates.yaml` and read from
`ModelRateRegistry.default()`; keep the workspace override path
pointed at the same registry.
- Carry forward the existing `--cost-per-1k` override flag (with
documented semantics: blended single-rate override that wins over
rate-table lookup when set).