Files
llm-connect/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md
tegwick c11c6afa3f
Some checks failed
CI / test (3.10) (push) Has been cancelled
CI / test (3.11) (push) Has been cancelled
CI / test (3.12) (push) Has been cancelled
Implement-LLM-WP-0005-cost-model-estimators
2026-05-19 05:02:20 +02:00

12 KiB
Raw Blame History

id, type, title, domain, repo, status, owner, planning_priority, planning_order, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id
id type title domain repo status owner planning_priority planning_order created updated depends_on_workplans related_workplans state_hub_workstream_id
LLM-WP-0005 workplan Cost Model and Problem-Class Token Estimators custodian llm-connect finished llm-connect high 5 2026-05-19 2026-05-19
LLM-WP-0003
LLM-WP-0004
IB-WP-0019
IB-WP-0020
869196c5-551b-4eef-b8d8-cca6f770a9b0

LLM-WP-0005 — Cost Model and Problem-Class Token Estimators

status: finished owner: llm-connect

Purpose

Move two consumer-side concerns into llm-connect as first-class primitives:

  1. Model rate registry. Provider-specific USD-per-1k prompt and completion rates plus capture provenance — a fact about the base model itself, not the application using it.
  2. Problem-class token estimators. Generic shapes ("summarise a chunk of N words", "extract entities from M paragraphs", "judge an artifact against K criteria") with a small base-variable surface and a few tunable parameters. Consumers select the class, supply problem dimensions, and get a predicted (prompt_tokens, completion_tokens) estimate before any call goes out. The cost model then converts the estimate into USD using the rate registry.

Today the cost-rate table and a coarse word-count estimator live in consumer code (infospace-bench/src/infospace_bench/model_rates.yaml and plan_generation_summary in generator.py). The Lefevre live-run smoke surfaced two consequences:

  • A user supplying --cost-per-1k 0.30 as a blended rate produced a plan estimate 1000× larger than the actual gpt-4o-mini bill, because the consumer's estimator has no notion of per-model rates and the user has no easy way to pick a right number.
  • The token estimator multiplies word counts by a constant and ignores problem shape — actual prompts ran ~3× larger than the word-count estimate predicted because templates, profile content, and entity context dominate the prompt body, not the chunk text.

Both gaps recur in every llm-connect consumer (infospace-bench today; inter-hub, markitect, and future repos tomorrow). Owning the primitives here means each consumer wires structure-specific dimensions and parameters but never re-implements rates or generic shapes.

Demand signal

infospace-bench is the first concrete consumer. The Lefevre Chapter-I smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs a planned $8.40 — the 1000× variance is entirely on the consumer side of the estimator. IB-WP-0019 (budget registry) already records the shape llm-connect would need to learn from (output/budget/usage.yaml

  • the llm-connect QualityLedger).

Architecture sketch (read before writing tasks)

Three new modules in llm-connect:

llm_connect/
  rates.py            # ModelRate, ModelRateRegistry, default registry
  costs.py            # CostModel: tokens × rate → USD
  problem_classes.py  # ProblemClass protocol, built-in classes, registry

rates.py

@dataclass(frozen=True)
class ModelRate:
    model_id: str
    prompt_per_1k: float
    completion_per_1k: float
    currency: str = "USD"
    source_url: str = ""
    captured_at: str = ""

class ModelRateRegistry:
    def get(self, model_id: str) -> ModelRate | None: ...
    def all(self) -> dict[str, ModelRate]: ...
    @classmethod
    def default(cls) -> "ModelRateRegistry": ...
    @classmethod
    def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ...
    def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ...

The default registry ships a small handful of well-known OpenRouter models (the same set infospace-bench has today). Consumers can load overrides from their own YAML.

costs.py

@dataclass(frozen=True)
class CostEstimate:
    cost_usd: float | None
    cost_source: str  # "rate_table:<model>" | "unknown" | "override"
    prompt_cost_usd: float | None = None
    completion_cost_usd: float | None = None

def estimate_cost(
    model_id: str,
    prompt_tokens: int,
    completion_tokens: int = 0,
    *,
    registry: ModelRateRegistry | None = None,
) -> CostEstimate: ...

Pure function over rate registry and token counts. Useful both for preview (plan time) and post-hoc verification (compare adapter-reported cost against rate-table estimate).

problem_classes.py

@dataclass(frozen=True)
class TokenEstimate:
    prompt_tokens: int
    completion_tokens: int
    confidence: float = 0.5   # 0..1, learned over time

class ProblemClass(Protocol):
    name: str
    base_dimensions: tuple[str, ...]   # e.g. ("chunk_words",)
    tunable_params: tuple[str, ...]    # e.g. ("template_tokens", "completion_ratio")

    def estimate(
        self,
        dimensions: dict[str, Any],
        params: dict[str, Any] | None = None,
    ) -> TokenEstimate: ...

Built-in classes (one per common workflow shape):

Class Dimensions Tunable params
chunk-summarization chunk_words, template_words completion_ratio (default 0.25)
entity-extraction chunk_words, template_words, expected_entities tokens_per_entity (default 70)
relation-extraction chunk_words, template_words, expected_relations tokens_per_relation (default 80)
judge-eval artifact_words, template_words, n_criteria tokens_per_criterion (default 35)
report-synthesis n_chunks, n_entities, n_relations, template_words base_completion_tokens (default 400)

Each class registers under a name; a ProblemClassRegistry keeps the mapping. Consumers reference classes by name; advanced consumers can register their own.

Defaults and learning

Each built-in class ships seed parameters chosen from common practice (roughly aligned with what infospace-bench observed on the Lefevre smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70 tokens per emitted entity). Two adaptation hooks:

  • ProblemClass.fit(observations: list[Observation]) adjusts params to best fit observed (dimensions → actual_tokens) pairs.
  • A small CLI llm-connect rates show / llm-connect classes show inspects the current registry and learned parameters.

Observation reuses the QualityLedger row shape so existing infra keeps working (LLM-WP-0004 owns the ledger).

Tasks

id: T01
title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models'
priority: high
status: done
state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671"
id: T02
title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge'
priority: high
status: done
state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc"
id: T03
title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry'
priority: high
status: done
state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727"
id: T04
title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis'
priority: high
status: done
state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0"
id: T05
title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations'
priority: medium
status: done
state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb"
id: T06
title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit <ledger>'
priority: medium
status: done
state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6"
id: T07
title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes'
priority: medium
status: done
state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9"
id: T08
title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls'
priority: medium
status: done
state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb"

Scope guardrails

In scope:

  • Data models, the registries, the built-in classes, and the fit-from-observations helper.
  • A default rate registry seeded with publicly known OpenRouter list prices (refresh policy is on the consumer).
  • CLI helpers for inspection.

Out of scope:

  • Per-call billing or accounting infrastructure (consumer's job — see IB-WP-0019 for infospace-bench's per-infospace budget log).
  • Provider-specific tokenisers (we already use a coarse chars-per-token estimator; tokeniser parity is a separate piece of work that should land before this one's accuracy bar tightens).
  • A rates update mechanism. The registry is a static snapshot; refreshing rates is a consumer chore (or a separate CronCreate-fed workflow against provider price pages, deliberately out of scope here).

Acceptance

  • ModelRateRegistry.default() returns a registry with at least the nine OpenRouter models infospace-bench bundles today, each carrying a captured_at timestamp.
  • estimate_cost("openai/gpt-4o-mini", 28000, 7500) returns a CostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini") matching the Lefevre Chapter-I smoke within 20%.
  • Each built-in ProblemClass.estimate(dimensions, params) produces a TokenEstimate and its fit() recovers seeded params from a small synthetic observation set within 10%.
  • The consumer guide in the workplan's notes shows what infospace-bench plan_generation_summary would look like once it delegates to these primitives.

Risks and open questions

  • Rate drift. OpenRouter / Anthropic / OpenAI prices change. Each rate has a captured_at; consumers must decide their freshness policy. The default registry's date is the source of truth; a stale registry will under- or over-estimate cost but the structure is unchanged.
  • Class taxonomy lock-in. The built-in class names will appear in consumer code and ledger tags. Bump a schema_version on the ProblemClassRegistry from day one; breaking changes need migration.
  • Tokeniser parity. Current consumer estimators use a coarse chars-per-token heuristic; the real tokenisers diverge by ±20% across providers. This workplan accepts that for v1; tightening the per-class accuracy bar belongs to a follow-on workplan once usage observations expose the gap.
  • Fitting on small samples. Adaptation needs enough observations to be statistically meaningful. The fit function should refuse (or fall back to seeds) when sample sizes are below a configurable threshold.

Downstream effects

  • infospace-bench plan_generation_summary becomes a thin caller over llm-connect's estimate_cost and the relevant ProblemClass. The follow-on consumer workplan (INFOSPACE-WP-NNNN) describes exactly which class maps to each workflow stage.
  • IB-WP-0019's budget registry stops needing its own src/infospace_bench/model_rates.yaml; the workspace override path stays but the bundled default lives upstream.
  • IB-WP-0020's routing CLI can ask the model rate registry whether a --routing-config cost cap is realistic for the candidate set — enables defensible --cost-cap defaults later.

Consumer-side follow-up

Once T01-T04 land in llm-connect, infospace-bench opens a thin companion workplan to:

  • Replace _CALLS_PER_CHUNK_BY_WORKFLOW + _profile_template_words + WORDS_PER_TOKEN_DEFAULT math in plan_generation_summary with problem-class lookups.
  • Map each workflow stage to a problem class (summarize-sourcechunk-summarization, etc.) and surface the mapping in docs/routing-task-types.md.
  • Drop src/infospace_bench/model_rates.yaml and read from ModelRateRegistry.default(); keep the workspace override path pointed at the same registry.
  • Carry forward the existing --cost-per-1k override flag (with documented semantics: blended single-rate override that wins over rate-table lookup when set).