plan: WP-0005 — cost model and problem-class token estimators

Drafted workplan to move two consumer-side concerns into llm-connect: - ModelRateRegistry: per-model USD-per-1k rates with provenance, a property of the base model, not the application. - ProblemClass token estimators: generic shapes (chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis) with base dimensions + tunable params; consumer supplies the shape of its problem and gets a TokenEstimate before any call. Demand signal: the 2026-05-18 infospace-bench Lefevre Chapter-I smoke ran 32 calls / 28k tokens / 0.009 USD actual against a planned 8.40 USD — the 1000x variance was entirely consumer-side because there is no rate table in llm-connect to delegate to. Three new modules (rates.py, costs.py, problem_classes.py), eight tasks, registered as workstream 869196c5-551b-4eef-b8d8-cca6f770a9b0 under the custodian topic. A follow-on consumer workplan in infospace-bench will migrate plan_generation_summary to delegate once T01-T04 land here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 04:30:52 +02:00
parent 4b685e849c
commit 0054afe689
1 changed files with 333 additions and 0 deletions
--- a/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md
+++ b/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md
@@ -0,0 +1,333 @@
+---
+id: LLM-WP-0005
+type: workplan
+title: "Cost Model and Problem-Class Token Estimators"
+domain: custodian
+repo: llm-connect
+status: proposed
+owner: llm-connect
+planning_priority: high
+planning_order: 5
+created: "2026-05-19"
+updated: "2026-05-19"
+depends_on_workplans:
+  - LLM-WP-0003
+  - LLM-WP-0004
+related_workplans:
+  - IB-WP-0019
+  - IB-WP-0020
+state_hub_workstream_id: "869196c5-551b-4eef-b8d8-cca6f770a9b0"
+---
+
+# LLM-WP-0005 — Cost Model and Problem-Class Token Estimators
+
+**status:** proposed
+**owner:** llm-connect
+
+## Purpose
+
+Move two consumer-side concerns into llm-connect as first-class
+primitives:
+
+1. **Model rate registry.** Provider-specific USD-per-1k prompt and
+   completion rates plus capture provenance — a fact about the base
+   model itself, not the application using it.
+2. **Problem-class token estimators.** Generic shapes ("summarise a
+   chunk of N words", "extract entities from M paragraphs", "judge an
+   artifact against K criteria") with a small base-variable surface and
+   a few tunable parameters. Consumers select the class, supply problem
+   dimensions, and get a predicted (prompt_tokens, completion_tokens)
+   estimate before any call goes out. The cost model then converts the
+   estimate into USD using the rate registry.
+
+Today the cost-rate table and a coarse word-count estimator live in
+consumer code (`infospace-bench/src/infospace_bench/model_rates.yaml`
+and `plan_generation_summary` in `generator.py`). The Lefevre live-run
+smoke surfaced two consequences:
+
+- A user supplying `--cost-per-1k 0.30` as a blended rate produced a
+  plan estimate 1000× larger than the actual gpt-4o-mini bill, because
+  the consumer's estimator has no notion of per-model rates and the
+  user has no easy way to pick a right number.
+- The token estimator multiplies word counts by a constant and ignores
+  problem shape — actual prompts ran ~3× larger than the word-count
+  estimate predicted because templates, profile content, and entity
+  context dominate the prompt body, not the chunk text.
+
+Both gaps recur in every llm-connect consumer (infospace-bench today;
+inter-hub, markitect, and future repos tomorrow). Owning the primitives
+here means each consumer wires structure-specific dimensions and
+parameters but never re-implements rates or generic shapes.
+
+## Demand signal
+
+`infospace-bench` is the first concrete consumer. The Lefevre Chapter-I
+smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs
+a planned $8.40 — the 1000× variance is entirely on the consumer side
+of the estimator. `IB-WP-0019` (budget registry) already records the
+shape llm-connect would need to learn from (`output/budget/usage.yaml`
+ the llm-connect `QualityLedger`).
+
+## Architecture sketch (read before writing tasks)
+
+Three new modules in llm-connect:
+
+```
+llm_connect/
+  rates.py            # ModelRate, ModelRateRegistry, default registry
+  costs.py            # CostModel: tokens × rate → USD
+  problem_classes.py  # ProblemClass protocol, built-in classes, registry
+```
+
+### `rates.py`
+
+```python
+@dataclass(frozen=True)
+class ModelRate:
+    model_id: str
+    prompt_per_1k: float
+    completion_per_1k: float
+    currency: str = "USD"
+    source_url: str = ""
+    captured_at: str = ""
+
+class ModelRateRegistry:
+    def get(self, model_id: str) -> ModelRate | None: ...
+    def all(self) -> dict[str, ModelRate]: ...
+    @classmethod
+    def default(cls) -> "ModelRateRegistry": ...
+    @classmethod
+    def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ...
+    def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ...
+```
+
+The default registry ships a small handful of well-known OpenRouter
+models (the same set infospace-bench has today). Consumers can load
+overrides from their own YAML.
+
+### `costs.py`
+
+```python
+@dataclass(frozen=True)
+class CostEstimate:
+    cost_usd: float | None
+    cost_source: str  # "rate_table:<model>" | "unknown" | "override"
+    prompt_cost_usd: float | None = None
+    completion_cost_usd: float | None = None
+
+def estimate_cost(
+    model_id: str,
+    prompt_tokens: int,
+    completion_tokens: int = 0,
+    *,
+    registry: ModelRateRegistry | None = None,
+) -> CostEstimate: ...
+```
+
+Pure function over rate registry and token counts. Useful both for
+preview (plan time) and post-hoc verification (compare adapter-reported
+cost against rate-table estimate).
+
+### `problem_classes.py`
+
+```python
+@dataclass(frozen=True)
+class TokenEstimate:
+    prompt_tokens: int
+    completion_tokens: int
+    confidence: float = 0.5   # 0..1, learned over time
+
+class ProblemClass(Protocol):
+    name: str
+    base_dimensions: tuple[str, ...]   # e.g. ("chunk_words",)
+    tunable_params: tuple[str, ...]    # e.g. ("template_tokens", "completion_ratio")
+
+    def estimate(
+        self,
+        dimensions: dict[str, Any],
+        params: dict[str, Any] | None = None,
+    ) -> TokenEstimate: ...
+```
+
+Built-in classes (one per common workflow shape):
+
+| Class | Dimensions | Tunable params |
+|---|---|---|
+| `chunk-summarization` | `chunk_words`, `template_words` | `completion_ratio` (default 0.25) |
+| `entity-extraction` | `chunk_words`, `template_words`, `expected_entities` | `tokens_per_entity` (default 70) |
+| `relation-extraction` | `chunk_words`, `template_words`, `expected_relations` | `tokens_per_relation` (default 80) |
+| `judge-eval` | `artifact_words`, `template_words`, `n_criteria` | `tokens_per_criterion` (default 35) |
+| `report-synthesis` | `n_chunks`, `n_entities`, `n_relations`, `template_words` | `base_completion_tokens` (default 400) |
+
+Each class registers under a name; a `ProblemClassRegistry` keeps the
+mapping. Consumers reference classes by name; advanced consumers can
+register their own.
+
+### Defaults and learning
+
+Each built-in class ships seed parameters chosen from common practice
+(roughly aligned with what infospace-bench observed on the Lefevre
+smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70
+tokens per emitted entity). Two adaptation hooks:
+
+- `ProblemClass.fit(observations: list[Observation])` adjusts params to
+  best fit observed (dimensions → actual_tokens) pairs.
+- A small CLI `llm-connect rates show` / `llm-connect classes show`
+  inspects the current registry and learned parameters.
+
+`Observation` reuses the `QualityLedger` row shape so existing infra
+keeps working (LLM-WP-0004 owns the ledger).
+
+## Tasks
+
+```task
+id: T01
+title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models'
+priority: high
+status: todo
+state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671"
+```
+
+```task
+id: T02
+title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge'
+priority: high
+status: todo
+state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc"
+```
+
+```task
+id: T03
+title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry'
+priority: high
+status: todo
+state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727"
+```
+
+```task
+id: T04
+title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis'
+priority: high
+status: todo
+state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0"
+```
+
+```task
+id: T05
+title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations'
+priority: medium
+status: todo
+state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb"
+```
+
+```task
+id: T06
+title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit <ledger>'
+priority: medium
+status: todo
+state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6"
+```
+
+```task
+id: T07
+title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes'
+priority: medium
+status: todo
+state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9"
+```
+
+```task
+id: T08
+title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls'
+priority: medium
+status: todo
+state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb"
+```
+
+## Scope guardrails
+
+In scope:
+
+- Data models, the registries, the built-in classes, and the
+  fit-from-observations helper.
+- A default rate registry seeded with publicly known OpenRouter list
+  prices (refresh policy is on the consumer).
+- CLI helpers for inspection.
+
+Out of scope:
+
+- Per-call billing or accounting infrastructure (consumer's job — see
+  `IB-WP-0019` for infospace-bench's per-infospace budget log).
+- Provider-specific tokenisers (we already use a coarse
+  chars-per-token estimator; tokeniser parity is a separate piece of
+  work that should land before this one's accuracy bar tightens).
+- A rates *update* mechanism. The registry is a static snapshot;
+  refreshing rates is a consumer chore (or a separate CronCreate-fed
+  workflow against provider price pages, deliberately out of scope
+  here).
+
+## Acceptance
+
+- `ModelRateRegistry.default()` returns a registry with at least the
+  nine OpenRouter models infospace-bench bundles today, each carrying
+  a `captured_at` timestamp.
+- `estimate_cost("openai/gpt-4o-mini", 28000, 7500)` returns a
+  ``CostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini")``
+  matching the Lefevre Chapter-I smoke within 20%.
+- Each built-in `ProblemClass.estimate(dimensions, params)` produces a
+  `TokenEstimate` and its `fit()` recovers seeded params from a small
+  synthetic observation set within 10%.
+- The consumer guide in the workplan's notes shows what
+  `infospace-bench plan_generation_summary` would look like once it
+  delegates to these primitives.
+
+## Risks and open questions
+
+- **Rate drift.** OpenRouter / Anthropic / OpenAI prices change. Each
+  rate has a `captured_at`; consumers must decide their freshness
+  policy. The default registry's date is the source of truth; a stale
+  registry will under- or over-estimate cost but the structure is
+  unchanged.
+- **Class taxonomy lock-in.** The built-in class names will appear in
+  consumer code and ledger tags. Bump a `schema_version` on the
+  ProblemClassRegistry from day one; breaking changes need migration.
+- **Tokeniser parity.** Current consumer estimators use a coarse
+  chars-per-token heuristic; the real tokenisers diverge by ±20%
+  across providers. This workplan accepts that for v1; tightening the
+  per-class accuracy bar belongs to a follow-on workplan once usage
+  observations expose the gap.
+- **Fitting on small samples.** Adaptation needs enough observations
+  to be statistically meaningful. The fit function should refuse (or
+  fall back to seeds) when sample sizes are below a configurable
+  threshold.
+
+## Downstream effects
+
+- `infospace-bench` `plan_generation_summary` becomes a thin caller
+  over llm-connect's `estimate_cost` and the relevant `ProblemClass`.
+  The follow-on consumer workplan (`INFOSPACE-WP-NNNN`) describes
+  exactly which class maps to each workflow stage.
+- `IB-WP-0019`'s budget registry stops needing its own
+  `src/infospace_bench/model_rates.yaml`; the workspace override path
+  stays but the bundled default lives upstream.
+- `IB-WP-0020`'s routing CLI can ask the model rate registry whether a
+  `--routing-config` cost cap is realistic for the candidate set —
+  enables defensible `--cost-cap` defaults later.
+
+## Consumer-side follow-up
+
+Once T01-T04 land in llm-connect, `infospace-bench` opens a thin
+companion workplan to:
+
+- Replace `_CALLS_PER_CHUNK_BY_WORKFLOW` + `_profile_template_words` +
+  `WORDS_PER_TOKEN_DEFAULT` math in `plan_generation_summary` with
+  problem-class lookups.
+- Map each workflow stage to a problem class
+  (`summarize-source` → `chunk-summarization`, etc.) and surface the
+  mapping in `docs/routing-task-types.md`.
+- Drop `src/infospace_bench/model_rates.yaml` and read from
+  `ModelRateRegistry.default()`; keep the workspace override path
+  pointed at the same registry.
+- Carry forward the existing `--cost-per-1k` override flag (with
+  documented semantics: blended single-rate override that wins over
+  rate-table lookup when set).