generated from coulomb/repo-seed
334 lines
12 KiB
Markdown
334 lines
12 KiB
Markdown
---
|
||
id: LLM-WP-0005
|
||
type: workplan
|
||
title: "Cost Model and Problem-Class Token Estimators"
|
||
domain: custodian
|
||
repo: llm-connect
|
||
status: finished
|
||
owner: llm-connect
|
||
planning_priority: high
|
||
planning_order: 5
|
||
created: "2026-05-19"
|
||
updated: "2026-05-19"
|
||
depends_on_workplans:
|
||
- LLM-WP-0003
|
||
- LLM-WP-0004
|
||
related_workplans:
|
||
- IB-WP-0019
|
||
- IB-WP-0020
|
||
state_hub_workstream_id: "869196c5-551b-4eef-b8d8-cca6f770a9b0"
|
||
---
|
||
|
||
# LLM-WP-0005 — Cost Model and Problem-Class Token Estimators
|
||
|
||
**status:** finished
|
||
**owner:** llm-connect
|
||
|
||
## Purpose
|
||
|
||
Move two consumer-side concerns into llm-connect as first-class
|
||
primitives:
|
||
|
||
1. **Model rate registry.** Provider-specific USD-per-1k prompt and
|
||
completion rates plus capture provenance — a fact about the base
|
||
model itself, not the application using it.
|
||
2. **Problem-class token estimators.** Generic shapes ("summarise a
|
||
chunk of N words", "extract entities from M paragraphs", "judge an
|
||
artifact against K criteria") with a small base-variable surface and
|
||
a few tunable parameters. Consumers select the class, supply problem
|
||
dimensions, and get a predicted (prompt_tokens, completion_tokens)
|
||
estimate before any call goes out. The cost model then converts the
|
||
estimate into USD using the rate registry.
|
||
|
||
Today the cost-rate table and a coarse word-count estimator live in
|
||
consumer code (`infospace-bench/src/infospace_bench/model_rates.yaml`
|
||
and `plan_generation_summary` in `generator.py`). The Lefevre live-run
|
||
smoke surfaced two consequences:
|
||
|
||
- A user supplying `--cost-per-1k 0.30` as a blended rate produced a
|
||
plan estimate 1000× larger than the actual gpt-4o-mini bill, because
|
||
the consumer's estimator has no notion of per-model rates and the
|
||
user has no easy way to pick a right number.
|
||
- The token estimator multiplies word counts by a constant and ignores
|
||
problem shape — actual prompts ran ~3× larger than the word-count
|
||
estimate predicted because templates, profile content, and entity
|
||
context dominate the prompt body, not the chunk text.
|
||
|
||
Both gaps recur in every llm-connect consumer (infospace-bench today;
|
||
inter-hub, markitect, and future repos tomorrow). Owning the primitives
|
||
here means each consumer wires structure-specific dimensions and
|
||
parameters but never re-implements rates or generic shapes.
|
||
|
||
## Demand signal
|
||
|
||
`infospace-bench` is the first concrete consumer. The Lefevre Chapter-I
|
||
smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs
|
||
a planned $8.40 — the 1000× variance is entirely on the consumer side
|
||
of the estimator. `IB-WP-0019` (budget registry) already records the
|
||
shape llm-connect would need to learn from (`output/budget/usage.yaml`
|
||
+ the llm-connect `QualityLedger`).
|
||
|
||
## Architecture sketch (read before writing tasks)
|
||
|
||
Three new modules in llm-connect:
|
||
|
||
```
|
||
llm_connect/
|
||
rates.py # ModelRate, ModelRateRegistry, default registry
|
||
costs.py # CostModel: tokens × rate → USD
|
||
problem_classes.py # ProblemClass protocol, built-in classes, registry
|
||
```
|
||
|
||
### `rates.py`
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class ModelRate:
|
||
model_id: str
|
||
prompt_per_1k: float
|
||
completion_per_1k: float
|
||
currency: str = "USD"
|
||
source_url: str = ""
|
||
captured_at: str = ""
|
||
|
||
class ModelRateRegistry:
|
||
def get(self, model_id: str) -> ModelRate | None: ...
|
||
def all(self) -> dict[str, ModelRate]: ...
|
||
@classmethod
|
||
def default(cls) -> "ModelRateRegistry": ...
|
||
@classmethod
|
||
def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ...
|
||
def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ...
|
||
```
|
||
|
||
The default registry ships a small handful of well-known OpenRouter
|
||
models (the same set infospace-bench has today). Consumers can load
|
||
overrides from their own YAML.
|
||
|
||
### `costs.py`
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class CostEstimate:
|
||
cost_usd: float | None
|
||
cost_source: str # "rate_table:<model>" | "unknown" | "override"
|
||
prompt_cost_usd: float | None = None
|
||
completion_cost_usd: float | None = None
|
||
|
||
def estimate_cost(
|
||
model_id: str,
|
||
prompt_tokens: int,
|
||
completion_tokens: int = 0,
|
||
*,
|
||
registry: ModelRateRegistry | None = None,
|
||
) -> CostEstimate: ...
|
||
```
|
||
|
||
Pure function over rate registry and token counts. Useful both for
|
||
preview (plan time) and post-hoc verification (compare adapter-reported
|
||
cost against rate-table estimate).
|
||
|
||
### `problem_classes.py`
|
||
|
||
```python
|
||
@dataclass(frozen=True)
|
||
class TokenEstimate:
|
||
prompt_tokens: int
|
||
completion_tokens: int
|
||
confidence: float = 0.5 # 0..1, learned over time
|
||
|
||
class ProblemClass(Protocol):
|
||
name: str
|
||
base_dimensions: tuple[str, ...] # e.g. ("chunk_words",)
|
||
tunable_params: tuple[str, ...] # e.g. ("template_tokens", "completion_ratio")
|
||
|
||
def estimate(
|
||
self,
|
||
dimensions: dict[str, Any],
|
||
params: dict[str, Any] | None = None,
|
||
) -> TokenEstimate: ...
|
||
```
|
||
|
||
Built-in classes (one per common workflow shape):
|
||
|
||
| Class | Dimensions | Tunable params |
|
||
|---|---|---|
|
||
| `chunk-summarization` | `chunk_words`, `template_words` | `completion_ratio` (default 0.25) |
|
||
| `entity-extraction` | `chunk_words`, `template_words`, `expected_entities` | `tokens_per_entity` (default 70) |
|
||
| `relation-extraction` | `chunk_words`, `template_words`, `expected_relations` | `tokens_per_relation` (default 80) |
|
||
| `judge-eval` | `artifact_words`, `template_words`, `n_criteria` | `tokens_per_criterion` (default 35) |
|
||
| `report-synthesis` | `n_chunks`, `n_entities`, `n_relations`, `template_words` | `base_completion_tokens` (default 400) |
|
||
|
||
Each class registers under a name; a `ProblemClassRegistry` keeps the
|
||
mapping. Consumers reference classes by name; advanced consumers can
|
||
register their own.
|
||
|
||
### Defaults and learning
|
||
|
||
Each built-in class ships seed parameters chosen from common practice
|
||
(roughly aligned with what infospace-bench observed on the Lefevre
|
||
smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70
|
||
tokens per emitted entity). Two adaptation hooks:
|
||
|
||
- `ProblemClass.fit(observations: list[Observation])` adjusts params to
|
||
best fit observed (dimensions → actual_tokens) pairs.
|
||
- A small CLI `llm-connect rates show` / `llm-connect classes show`
|
||
inspects the current registry and learned parameters.
|
||
|
||
`Observation` reuses the `QualityLedger` row shape so existing infra
|
||
keeps working (LLM-WP-0004 owns the ledger).
|
||
|
||
## Tasks
|
||
|
||
```task
|
||
id: T01
|
||
title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671"
|
||
```
|
||
|
||
```task
|
||
id: T02
|
||
title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc"
|
||
```
|
||
|
||
```task
|
||
id: T03
|
||
title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727"
|
||
```
|
||
|
||
```task
|
||
id: T04
|
||
title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis'
|
||
priority: high
|
||
status: done
|
||
state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0"
|
||
```
|
||
|
||
```task
|
||
id: T05
|
||
title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb"
|
||
```
|
||
|
||
```task
|
||
id: T06
|
||
title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit <ledger>'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6"
|
||
```
|
||
|
||
```task
|
||
id: T07
|
||
title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9"
|
||
```
|
||
|
||
```task
|
||
id: T08
|
||
title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls'
|
||
priority: medium
|
||
status: done
|
||
state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb"
|
||
```
|
||
|
||
## Scope guardrails
|
||
|
||
In scope:
|
||
|
||
- Data models, the registries, the built-in classes, and the
|
||
fit-from-observations helper.
|
||
- A default rate registry seeded with publicly known OpenRouter list
|
||
prices (refresh policy is on the consumer).
|
||
- CLI helpers for inspection.
|
||
|
||
Out of scope:
|
||
|
||
- Per-call billing or accounting infrastructure (consumer's job — see
|
||
`IB-WP-0019` for infospace-bench's per-infospace budget log).
|
||
- Provider-specific tokenisers (we already use a coarse
|
||
chars-per-token estimator; tokeniser parity is a separate piece of
|
||
work that should land before this one's accuracy bar tightens).
|
||
- A rates *update* mechanism. The registry is a static snapshot;
|
||
refreshing rates is a consumer chore (or a separate CronCreate-fed
|
||
workflow against provider price pages, deliberately out of scope
|
||
here).
|
||
|
||
## Acceptance
|
||
|
||
- `ModelRateRegistry.default()` returns a registry with at least the
|
||
nine OpenRouter models infospace-bench bundles today, each carrying
|
||
a `captured_at` timestamp.
|
||
- `estimate_cost("openai/gpt-4o-mini", 28000, 7500)` returns a
|
||
``CostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini")``
|
||
matching the Lefevre Chapter-I smoke within 20%.
|
||
- Each built-in `ProblemClass.estimate(dimensions, params)` produces a
|
||
`TokenEstimate` and its `fit()` recovers seeded params from a small
|
||
synthetic observation set within 10%.
|
||
- The consumer guide in the workplan's notes shows what
|
||
`infospace-bench plan_generation_summary` would look like once it
|
||
delegates to these primitives.
|
||
|
||
## Risks and open questions
|
||
|
||
- **Rate drift.** OpenRouter / Anthropic / OpenAI prices change. Each
|
||
rate has a `captured_at`; consumers must decide their freshness
|
||
policy. The default registry's date is the source of truth; a stale
|
||
registry will under- or over-estimate cost but the structure is
|
||
unchanged.
|
||
- **Class taxonomy lock-in.** The built-in class names will appear in
|
||
consumer code and ledger tags. Bump a `schema_version` on the
|
||
ProblemClassRegistry from day one; breaking changes need migration.
|
||
- **Tokeniser parity.** Current consumer estimators use a coarse
|
||
chars-per-token heuristic; the real tokenisers diverge by ±20%
|
||
across providers. This workplan accepts that for v1; tightening the
|
||
per-class accuracy bar belongs to a follow-on workplan once usage
|
||
observations expose the gap.
|
||
- **Fitting on small samples.** Adaptation needs enough observations
|
||
to be statistically meaningful. The fit function should refuse (or
|
||
fall back to seeds) when sample sizes are below a configurable
|
||
threshold.
|
||
|
||
## Downstream effects
|
||
|
||
- `infospace-bench` `plan_generation_summary` becomes a thin caller
|
||
over llm-connect's `estimate_cost` and the relevant `ProblemClass`.
|
||
The follow-on consumer workplan (`INFOSPACE-WP-NNNN`) describes
|
||
exactly which class maps to each workflow stage.
|
||
- `IB-WP-0019`'s budget registry stops needing its own
|
||
`src/infospace_bench/model_rates.yaml`; the workspace override path
|
||
stays but the bundled default lives upstream.
|
||
- `IB-WP-0020`'s routing CLI can ask the model rate registry whether a
|
||
`--routing-config` cost cap is realistic for the candidate set —
|
||
enables defensible `--cost-cap` defaults later.
|
||
|
||
## Consumer-side follow-up
|
||
|
||
Once T01-T04 land in llm-connect, `infospace-bench` opens a thin
|
||
companion workplan to:
|
||
|
||
- Replace `_CALLS_PER_CHUNK_BY_WORKFLOW` + `_profile_template_words` +
|
||
`WORDS_PER_TOKEN_DEFAULT` math in `plan_generation_summary` with
|
||
problem-class lookups.
|
||
- Map each workflow stage to a problem class
|
||
(`summarize-source` → `chunk-summarization`, etc.) and surface the
|
||
mapping in `docs/routing-task-types.md`.
|
||
- Drop `src/infospace_bench/model_rates.yaml` and read from
|
||
`ModelRateRegistry.default()`; keep the workspace override path
|
||
pointed at the same registry.
|
||
- Carry forward the existing `--cost-per-1k` override flag (with
|
||
documented semantics: blended single-rate override that wins over
|
||
rate-table lookup when set).
|