coulomb/llm-connect

Fork 0

generated from coulomb/repo-seed

Files

tegwick c11c6afa3f

CI / test (3.10) (push) Has been cancelled

Details

CI / test (3.11) (push) Has been cancelled

Details

CI / test (3.12) (push) Has been cancelled

Details

Implement-LLM-WP-0005-cost-model-estimators

2026-05-19 05:02:20 +02:00

12 KiB

Raw Blame History

id, type, title, domain, repo, status, owner, planning_priority, planning_order, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id

type

title

domain

repo

status

owner

planning_priority

planning_order

created

updated

depends_on_workplans

related_workplans

state_hub_workstream_id

LLM-WP-0005

workplan

Cost Model and Problem-Class Token Estimators

custodian

llm-connect

finished

llm-connect

high

2026-05-19

LLM-WP-0003

LLM-WP-0004

IB-WP-0019

IB-WP-0020

869196c5-551b-4eef-b8d8-cca6f770a9b0

LLM-WP-0005 — Cost Model and Problem-Class Token Estimators

status: finished owner: llm-connect

Purpose

Move two consumer-side concerns into llm-connect as first-class primitives:

Model rate registry. Provider-specific USD-per-1k prompt and completion rates plus capture provenance — a fact about the base model itself, not the application using it.
Problem-class token estimators. Generic shapes ("summarise a chunk of N words", "extract entities from M paragraphs", "judge an artifact against K criteria") with a small base-variable surface and a few tunable parameters. Consumers select the class, supply problem dimensions, and get a predicted (prompt_tokens, completion_tokens) estimate before any call goes out. The cost model then converts the estimate into USD using the rate registry.

Today the cost-rate table and a coarse word-count estimator live in consumer code (infospace-bench/src/infospace_bench/model_rates.yaml and plan_generation_summary in generator.py). The Lefevre live-run smoke surfaced two consequences:

A user supplying --cost-per-1k 0.30 as a blended rate produced a plan estimate 1000× larger than the actual gpt-4o-mini bill, because the consumer's estimator has no notion of per-model rates and the user has no easy way to pick a right number.
The token estimator multiplies word counts by a constant and ignores problem shape — actual prompts ran ~3× larger than the word-count estimate predicted because templates, profile content, and entity context dominate the prompt body, not the chunk text.

Both gaps recur in every llm-connect consumer (infospace-bench today; inter-hub, markitect, and future repos tomorrow). Owning the primitives here means each consumer wires structure-specific dimensions and parameters but never re-implements rates or generic shapes.

Demand signal

infospace-bench is the first concrete consumer. The Lefevre Chapter-I smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs a planned $8.40 — the 1000× variance is entirely on the consumer side of the estimator. IB-WP-0019 (budget registry) already records the shape llm-connect would need to learn from (output/budget/usage.yaml

the llm-connect QualityLedger).

Architecture sketch (read before writing tasks)

Three new modules in llm-connect:

llm_connect/
  rates.py            # ModelRate, ModelRateRegistry, default registry
  costs.py            # CostModel: tokens × rate → USD
  problem_classes.py  # ProblemClass protocol, built-in classes, registry

`rates.py`

@dataclass(frozen=True)
class ModelRate:
    model_id: str
    prompt_per_1k: float
    completion_per_1k: float
    currency: str = "USD"
    source_url: str = ""
    captured_at: str = ""

class ModelRateRegistry:
    def get(self, model_id: str) -> ModelRate | None: ...
    def all(self) -> dict[str, ModelRate]: ...
    @classmethod
    def default(cls) -> "ModelRateRegistry": ...
    @classmethod
    def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ...
    def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ...

The default registry ships a small handful of well-known OpenRouter models (the same set infospace-bench has today). Consumers can load overrides from their own YAML.

`costs.py`

@dataclass(frozen=True)
class CostEstimate:
    cost_usd: float | None
    cost_source: str  # "rate_table:<model>" | "unknown" | "override"
    prompt_cost_usd: float | None = None
    completion_cost_usd: float | None = None

def estimate_cost(
    model_id: str,
    prompt_tokens: int,
    completion_tokens: int = 0,
    *,
    registry: ModelRateRegistry | None = None,
) -> CostEstimate: ...

Pure function over rate registry and token counts. Useful both for preview (plan time) and post-hoc verification (compare adapter-reported cost against rate-table estimate).

`problem_classes.py`

@dataclass(frozen=True)
class TokenEstimate:
    prompt_tokens: int
    completion_tokens: int
    confidence: float = 0.5   # 0..1, learned over time

class ProblemClass(Protocol):
    name: str
    base_dimensions: tuple[str, ...]   # e.g. ("chunk_words",)
    tunable_params: tuple[str, ...]    # e.g. ("template_tokens", "completion_ratio")

    def estimate(
        self,
        dimensions: dict[str, Any],
        params: dict[str, Any] | None = None,
    ) -> TokenEstimate: ...

Built-in classes (one per common workflow shape):

Class	Dimensions	Tunable params
`chunk-summarization`	`chunk_words`, `template_words`	`completion_ratio` (default 0.25)
`entity-extraction`	`chunk_words`, `template_words`, `expected_entities`	`tokens_per_entity` (default 70)
`relation-extraction`	`chunk_words`, `template_words`, `expected_relations`	`tokens_per_relation` (default 80)
`judge-eval`	`artifact_words`, `template_words`, `n_criteria`	`tokens_per_criterion` (default 35)
`report-synthesis`	`n_chunks`, `n_entities`, `n_relations`, `template_words`	`base_completion_tokens` (default 400)

Each class registers under a name; a ProblemClassRegistry keeps the mapping. Consumers reference classes by name; advanced consumers can register their own.

Defaults and learning

Each built-in class ships seed parameters chosen from common practice (roughly aligned with what infospace-bench observed on the Lefevre smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70 tokens per emitted entity). Two adaptation hooks:

ProblemClass.fit(observations: list[Observation]) adjusts params to best fit observed (dimensions → actual_tokens) pairs.
A small CLI llm-connect rates show / llm-connect classes show inspects the current registry and learned parameters.

Observation reuses the QualityLedger row shape so existing infra keeps working (LLM-WP-0004 owns the ledger).

Tasks

id: T01
title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models'
priority: high
status: done
state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671"

id: T02
title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge'
priority: high
status: done
state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc"

id: T03
title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry'
priority: high
status: done
state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727"

id: T04
title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis'
priority: high
status: done
state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0"

id: T05
title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations'
priority: medium
status: done
state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb"

id: T06
title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit <ledger>'
priority: medium
status: done
state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6"

id: T07
title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes'
priority: medium
status: done
state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9"

id: T08
title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls'
priority: medium
status: done
state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb"

Scope guardrails

In scope:

Data models, the registries, the built-in classes, and the fit-from-observations helper.
A default rate registry seeded with publicly known OpenRouter list prices (refresh policy is on the consumer).
CLI helpers for inspection.

Out of scope:

Per-call billing or accounting infrastructure (consumer's job — see IB-WP-0019 for infospace-bench's per-infospace budget log).
Provider-specific tokenisers (we already use a coarse chars-per-token estimator; tokeniser parity is a separate piece of work that should land before this one's accuracy bar tightens).
A rates update mechanism. The registry is a static snapshot; refreshing rates is a consumer chore (or a separate CronCreate-fed workflow against provider price pages, deliberately out of scope here).

Acceptance

ModelRateRegistry.default() returns a registry with at least the nine OpenRouter models infospace-bench bundles today, each carrying a captured_at timestamp.
estimate_cost("openai/gpt-4o-mini", 28000, 7500) returns a CostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini") matching the Lefevre Chapter-I smoke within 20%.
Each built-in ProblemClass.estimate(dimensions, params) produces a TokenEstimate and its fit() recovers seeded params from a small synthetic observation set within 10%.
The consumer guide in the workplan's notes shows what infospace-bench plan_generation_summary would look like once it delegates to these primitives.

Risks and open questions

Rate drift. OpenRouter / Anthropic / OpenAI prices change. Each rate has a captured_at; consumers must decide their freshness policy. The default registry's date is the source of truth; a stale registry will under- or over-estimate cost but the structure is unchanged.
Class taxonomy lock-in. The built-in class names will appear in consumer code and ledger tags. Bump a schema_version on the ProblemClassRegistry from day one; breaking changes need migration.
Tokeniser parity. Current consumer estimators use a coarse chars-per-token heuristic; the real tokenisers diverge by ±20% across providers. This workplan accepts that for v1; tightening the per-class accuracy bar belongs to a follow-on workplan once usage observations expose the gap.
Fitting on small samples. Adaptation needs enough observations to be statistically meaningful. The fit function should refuse (or fall back to seeds) when sample sizes are below a configurable threshold.

Downstream effects

infospace-bench plan_generation_summary becomes a thin caller over llm-connect's estimate_cost and the relevant ProblemClass. The follow-on consumer workplan (INFOSPACE-WP-NNNN) describes exactly which class maps to each workflow stage.
IB-WP-0019's budget registry stops needing its own src/infospace_bench/model_rates.yaml; the workspace override path stays but the bundled default lives upstream.
IB-WP-0020's routing CLI can ask the model rate registry whether a --routing-config cost cap is realistic for the candidate set — enables defensible --cost-cap defaults later.

Consumer-side follow-up

Once T01-T04 land in llm-connect, infospace-bench opens a thin companion workplan to:

Replace _CALLS_PER_CHUNK_BY_WORKFLOW + _profile_template_words + WORDS_PER_TOKEN_DEFAULT math in plan_generation_summary with problem-class lookups.
Map each workflow stage to a problem class (summarize-source → chunk-summarization, etc.) and surface the mapping in docs/routing-task-types.md.
Drop src/infospace_bench/model_rates.yaml and read from ModelRateRegistry.default(); keep the workspace override path pointed at the same registry.
Carry forward the existing --cost-per-1k override flag (with documented semantics: blended single-rate override that wins over rate-table lookup when set).

12 KiB Raw Blame History Unescape Escape

LLM-WP-0005 — Cost Model and Problem-Class Token Estimators

Purpose

Demand signal

Architecture sketch (read before writing tasks)

rates.py

costs.py

problem_classes.py

Defaults and learning

Tasks

Scope guardrails

Acceptance

Risks and open questions

Downstream effects

Consumer-side follow-up

12 KiB

Raw Blame History

`rates.py`

`costs.py`

`problem_classes.py`