12 KiB
id, type, title, domain, repo, status, owner, planning_priority, planning_order, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id
| id | type | title | domain | repo | status | owner | planning_priority | planning_order | created | updated | depends_on_workplans | related_workplans | state_hub_workstream_id | ||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM-WP-0005 | workplan | Cost Model and Problem-Class Token Estimators | custodian | llm-connect | finished | llm-connect | high | 5 | 2026-05-19 | 2026-05-19 |
|
|
869196c5-551b-4eef-b8d8-cca6f770a9b0 |
LLM-WP-0005 — Cost Model and Problem-Class Token Estimators
status: finished owner: llm-connect
Purpose
Move two consumer-side concerns into llm-connect as first-class primitives:
- Model rate registry. Provider-specific USD-per-1k prompt and completion rates plus capture provenance — a fact about the base model itself, not the application using it.
- Problem-class token estimators. Generic shapes ("summarise a chunk of N words", "extract entities from M paragraphs", "judge an artifact against K criteria") with a small base-variable surface and a few tunable parameters. Consumers select the class, supply problem dimensions, and get a predicted (prompt_tokens, completion_tokens) estimate before any call goes out. The cost model then converts the estimate into USD using the rate registry.
Today the cost-rate table and a coarse word-count estimator live in
consumer code (infospace-bench/src/infospace_bench/model_rates.yaml
and plan_generation_summary in generator.py). The Lefevre live-run
smoke surfaced two consequences:
- A user supplying
--cost-per-1k 0.30as a blended rate produced a plan estimate 1000× larger than the actual gpt-4o-mini bill, because the consumer's estimator has no notion of per-model rates and the user has no easy way to pick a right number. - The token estimator multiplies word counts by a constant and ignores problem shape — actual prompts ran ~3× larger than the word-count estimate predicted because templates, profile content, and entity context dominate the prompt body, not the chunk text.
Both gaps recur in every llm-connect consumer (infospace-bench today; inter-hub, markitect, and future repos tomorrow). Owning the primitives here means each consumer wires structure-specific dimensions and parameters but never re-implements rates or generic shapes.
Demand signal
infospace-bench is the first concrete consumer. The Lefevre Chapter-I
smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs
a planned $8.40 — the 1000× variance is entirely on the consumer side
of the estimator. IB-WP-0019 (budget registry) already records the
shape llm-connect would need to learn from (output/budget/usage.yaml
- the llm-connect
QualityLedger).
Architecture sketch (read before writing tasks)
Three new modules in llm-connect:
llm_connect/
rates.py # ModelRate, ModelRateRegistry, default registry
costs.py # CostModel: tokens × rate → USD
problem_classes.py # ProblemClass protocol, built-in classes, registry
rates.py
@dataclass(frozen=True)
class ModelRate:
model_id: str
prompt_per_1k: float
completion_per_1k: float
currency: str = "USD"
source_url: str = ""
captured_at: str = ""
class ModelRateRegistry:
def get(self, model_id: str) -> ModelRate | None: ...
def all(self) -> dict[str, ModelRate]: ...
@classmethod
def default(cls) -> "ModelRateRegistry": ...
@classmethod
def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ...
def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ...
The default registry ships a small handful of well-known OpenRouter models (the same set infospace-bench has today). Consumers can load overrides from their own YAML.
costs.py
@dataclass(frozen=True)
class CostEstimate:
cost_usd: float | None
cost_source: str # "rate_table:<model>" | "unknown" | "override"
prompt_cost_usd: float | None = None
completion_cost_usd: float | None = None
def estimate_cost(
model_id: str,
prompt_tokens: int,
completion_tokens: int = 0,
*,
registry: ModelRateRegistry | None = None,
) -> CostEstimate: ...
Pure function over rate registry and token counts. Useful both for preview (plan time) and post-hoc verification (compare adapter-reported cost against rate-table estimate).
problem_classes.py
@dataclass(frozen=True)
class TokenEstimate:
prompt_tokens: int
completion_tokens: int
confidence: float = 0.5 # 0..1, learned over time
class ProblemClass(Protocol):
name: str
base_dimensions: tuple[str, ...] # e.g. ("chunk_words",)
tunable_params: tuple[str, ...] # e.g. ("template_tokens", "completion_ratio")
def estimate(
self,
dimensions: dict[str, Any],
params: dict[str, Any] | None = None,
) -> TokenEstimate: ...
Built-in classes (one per common workflow shape):
| Class | Dimensions | Tunable params |
|---|---|---|
chunk-summarization |
chunk_words, template_words |
completion_ratio (default 0.25) |
entity-extraction |
chunk_words, template_words, expected_entities |
tokens_per_entity (default 70) |
relation-extraction |
chunk_words, template_words, expected_relations |
tokens_per_relation (default 80) |
judge-eval |
artifact_words, template_words, n_criteria |
tokens_per_criterion (default 35) |
report-synthesis |
n_chunks, n_entities, n_relations, template_words |
base_completion_tokens (default 400) |
Each class registers under a name; a ProblemClassRegistry keeps the
mapping. Consumers reference classes by name; advanced consumers can
register their own.
Defaults and learning
Each built-in class ships seed parameters chosen from common practice (roughly aligned with what infospace-bench observed on the Lefevre smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70 tokens per emitted entity). Two adaptation hooks:
ProblemClass.fit(observations: list[Observation])adjusts params to best fit observed (dimensions → actual_tokens) pairs.- A small CLI
llm-connect rates show/llm-connect classes showinspects the current registry and learned parameters.
Observation reuses the QualityLedger row shape so existing infra
keeps working (LLM-WP-0004 owns the ledger).
Tasks
id: T01
title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models'
priority: high
status: done
state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671"
id: T02
title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge'
priority: high
status: done
state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc"
id: T03
title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry'
priority: high
status: done
state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727"
id: T04
title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis'
priority: high
status: done
state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0"
id: T05
title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations'
priority: medium
status: done
state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb"
id: T06
title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit <ledger>'
priority: medium
status: done
state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6"
id: T07
title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes'
priority: medium
status: done
state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9"
id: T08
title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls'
priority: medium
status: done
state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb"
Scope guardrails
In scope:
- Data models, the registries, the built-in classes, and the fit-from-observations helper.
- A default rate registry seeded with publicly known OpenRouter list prices (refresh policy is on the consumer).
- CLI helpers for inspection.
Out of scope:
- Per-call billing or accounting infrastructure (consumer's job — see
IB-WP-0019for infospace-bench's per-infospace budget log). - Provider-specific tokenisers (we already use a coarse chars-per-token estimator; tokeniser parity is a separate piece of work that should land before this one's accuracy bar tightens).
- A rates update mechanism. The registry is a static snapshot; refreshing rates is a consumer chore (or a separate CronCreate-fed workflow against provider price pages, deliberately out of scope here).
Acceptance
ModelRateRegistry.default()returns a registry with at least the nine OpenRouter models infospace-bench bundles today, each carrying acaptured_attimestamp.estimate_cost("openai/gpt-4o-mini", 28000, 7500)returns aCostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini")matching the Lefevre Chapter-I smoke within 20%.- Each built-in
ProblemClass.estimate(dimensions, params)produces aTokenEstimateand itsfit()recovers seeded params from a small synthetic observation set within 10%. - The consumer guide in the workplan's notes shows what
infospace-bench plan_generation_summarywould look like once it delegates to these primitives.
Risks and open questions
- Rate drift. OpenRouter / Anthropic / OpenAI prices change. Each
rate has a
captured_at; consumers must decide their freshness policy. The default registry's date is the source of truth; a stale registry will under- or over-estimate cost but the structure is unchanged. - Class taxonomy lock-in. The built-in class names will appear in
consumer code and ledger tags. Bump a
schema_versionon the ProblemClassRegistry from day one; breaking changes need migration. - Tokeniser parity. Current consumer estimators use a coarse chars-per-token heuristic; the real tokenisers diverge by ±20% across providers. This workplan accepts that for v1; tightening the per-class accuracy bar belongs to a follow-on workplan once usage observations expose the gap.
- Fitting on small samples. Adaptation needs enough observations to be statistically meaningful. The fit function should refuse (or fall back to seeds) when sample sizes are below a configurable threshold.
Downstream effects
infospace-benchplan_generation_summarybecomes a thin caller over llm-connect'sestimate_costand the relevantProblemClass. The follow-on consumer workplan (INFOSPACE-WP-NNNN) describes exactly which class maps to each workflow stage.IB-WP-0019's budget registry stops needing its ownsrc/infospace_bench/model_rates.yaml; the workspace override path stays but the bundled default lives upstream.IB-WP-0020's routing CLI can ask the model rate registry whether a--routing-configcost cap is realistic for the candidate set — enables defensible--cost-capdefaults later.
Consumer-side follow-up
Once T01-T04 land in llm-connect, infospace-bench opens a thin
companion workplan to:
- Replace
_CALLS_PER_CHUNK_BY_WORKFLOW+_profile_template_words+WORDS_PER_TOKEN_DEFAULTmath inplan_generation_summarywith problem-class lookups. - Map each workflow stage to a problem class
(
summarize-source→chunk-summarization, etc.) and surface the mapping indocs/routing-task-types.md. - Drop
src/infospace_bench/model_rates.yamland read fromModelRateRegistry.default(); keep the workspace override path pointed at the same registry. - Carry forward the existing
--cost-per-1koverride flag (with documented semantics: blended single-rate override that wins over rate-table lookup when set).