diff --git a/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md b/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md new file mode 100644 index 0000000..57ee49f --- /dev/null +++ b/workplans/llm-connect-WP-0005-cost-model-and-problem-class-estimators.md @@ -0,0 +1,333 @@ +--- +id: LLM-WP-0005 +type: workplan +title: "Cost Model and Problem-Class Token Estimators" +domain: custodian +repo: llm-connect +status: proposed +owner: llm-connect +planning_priority: high +planning_order: 5 +created: "2026-05-19" +updated: "2026-05-19" +depends_on_workplans: + - LLM-WP-0003 + - LLM-WP-0004 +related_workplans: + - IB-WP-0019 + - IB-WP-0020 +state_hub_workstream_id: "869196c5-551b-4eef-b8d8-cca6f770a9b0" +--- + +# LLM-WP-0005 — Cost Model and Problem-Class Token Estimators + +**status:** proposed +**owner:** llm-connect + +## Purpose + +Move two consumer-side concerns into llm-connect as first-class +primitives: + +1. **Model rate registry.** Provider-specific USD-per-1k prompt and + completion rates plus capture provenance — a fact about the base + model itself, not the application using it. +2. **Problem-class token estimators.** Generic shapes ("summarise a + chunk of N words", "extract entities from M paragraphs", "judge an + artifact against K criteria") with a small base-variable surface and + a few tunable parameters. Consumers select the class, supply problem + dimensions, and get a predicted (prompt_tokens, completion_tokens) + estimate before any call goes out. The cost model then converts the + estimate into USD using the rate registry. + +Today the cost-rate table and a coarse word-count estimator live in +consumer code (`infospace-bench/src/infospace_bench/model_rates.yaml` +and `plan_generation_summary` in `generator.py`). The Lefevre live-run +smoke surfaced two consequences: + +- A user supplying `--cost-per-1k 0.30` as a blended rate produced a + plan estimate 1000× larger than the actual gpt-4o-mini bill, because + the consumer's estimator has no notion of per-model rates and the + user has no easy way to pick a right number. +- The token estimator multiplies word counts by a constant and ignores + problem shape — actual prompts ran ~3× larger than the word-count + estimate predicted because templates, profile content, and entity + context dominate the prompt body, not the chunk text. + +Both gaps recur in every llm-connect consumer (infospace-bench today; +inter-hub, markitect, and future repos tomorrow). Owning the primitives +here means each consumer wires structure-specific dimensions and +parameters but never re-implements rates or generic shapes. + +## Demand signal + +`infospace-bench` is the first concrete consumer. The Lefevre Chapter-I +smoke (2026-05-18) ran 32 calls / 28k prompt tokens / $0.0088 actual vs +a planned $8.40 — the 1000× variance is entirely on the consumer side +of the estimator. `IB-WP-0019` (budget registry) already records the +shape llm-connect would need to learn from (`output/budget/usage.yaml` ++ the llm-connect `QualityLedger`). + +## Architecture sketch (read before writing tasks) + +Three new modules in llm-connect: + +``` +llm_connect/ + rates.py # ModelRate, ModelRateRegistry, default registry + costs.py # CostModel: tokens × rate → USD + problem_classes.py # ProblemClass protocol, built-in classes, registry +``` + +### `rates.py` + +```python +@dataclass(frozen=True) +class ModelRate: + model_id: str + prompt_per_1k: float + completion_per_1k: float + currency: str = "USD" + source_url: str = "" + captured_at: str = "" + +class ModelRateRegistry: + def get(self, model_id: str) -> ModelRate | None: ... + def all(self) -> dict[str, ModelRate]: ... + @classmethod + def default(cls) -> "ModelRateRegistry": ... + @classmethod + def from_yaml(cls, path: Path | str) -> "ModelRateRegistry": ... + def merged_with(self, override: "ModelRateRegistry") -> "ModelRateRegistry": ... +``` + +The default registry ships a small handful of well-known OpenRouter +models (the same set infospace-bench has today). Consumers can load +overrides from their own YAML. + +### `costs.py` + +```python +@dataclass(frozen=True) +class CostEstimate: + cost_usd: float | None + cost_source: str # "rate_table:" | "unknown" | "override" + prompt_cost_usd: float | None = None + completion_cost_usd: float | None = None + +def estimate_cost( + model_id: str, + prompt_tokens: int, + completion_tokens: int = 0, + *, + registry: ModelRateRegistry | None = None, +) -> CostEstimate: ... +``` + +Pure function over rate registry and token counts. Useful both for +preview (plan time) and post-hoc verification (compare adapter-reported +cost against rate-table estimate). + +### `problem_classes.py` + +```python +@dataclass(frozen=True) +class TokenEstimate: + prompt_tokens: int + completion_tokens: int + confidence: float = 0.5 # 0..1, learned over time + +class ProblemClass(Protocol): + name: str + base_dimensions: tuple[str, ...] # e.g. ("chunk_words",) + tunable_params: tuple[str, ...] # e.g. ("template_tokens", "completion_ratio") + + def estimate( + self, + dimensions: dict[str, Any], + params: dict[str, Any] | None = None, + ) -> TokenEstimate: ... +``` + +Built-in classes (one per common workflow shape): + +| Class | Dimensions | Tunable params | +|---|---|---| +| `chunk-summarization` | `chunk_words`, `template_words` | `completion_ratio` (default 0.25) | +| `entity-extraction` | `chunk_words`, `template_words`, `expected_entities` | `tokens_per_entity` (default 70) | +| `relation-extraction` | `chunk_words`, `template_words`, `expected_relations` | `tokens_per_relation` (default 80) | +| `judge-eval` | `artifact_words`, `template_words`, `n_criteria` | `tokens_per_criterion` (default 35) | +| `report-synthesis` | `n_chunks`, `n_entities`, `n_relations`, `template_words` | `base_completion_tokens` (default 400) | + +Each class registers under a name; a `ProblemClassRegistry` keeps the +mapping. Consumers reference classes by name; advanced consumers can +register their own. + +### Defaults and learning + +Each built-in class ships seed parameters chosen from common practice +(roughly aligned with what infospace-bench observed on the Lefevre +smoke: ~1.0 prompt-words-per-chunk-word, ~0.2 completion ratio, ~70 +tokens per emitted entity). Two adaptation hooks: + +- `ProblemClass.fit(observations: list[Observation])` adjusts params to + best fit observed (dimensions → actual_tokens) pairs. +- A small CLI `llm-connect rates show` / `llm-connect classes show` + inspects the current registry and learned parameters. + +`Observation` reuses the `QualityLedger` row shape so existing infra +keeps working (LLM-WP-0004 owns the ledger). + +## Tasks + +```task +id: T01 +title: 'ModelRate + ModelRateRegistry data model, YAML loader, default-registry seed of nine OpenRouter models' +priority: high +status: todo +state_hub_task_id: "535d3f12-911e-4b6a-87c3-b539c5986671" +``` + +```task +id: T02 +title: 'CostModel.estimate_cost() pure function; tests for known model, unknown model, registry override, zero-token edge' +priority: high +status: todo +state_hub_task_id: "691dd985-6a97-432d-8bf0-6cb99a9fbdcc" +``` + +```task +id: T03 +title: 'ProblemClass protocol + TokenEstimate + ProblemClassRegistry' +priority: high +status: todo +state_hub_task_id: "ecf263d2-f40a-460e-9195-4e01135ef727" +``` + +```task +id: T04 +title: 'Built-in classes: chunk-summarization, entity-extraction, relation-extraction, judge-eval, report-synthesis' +priority: high +status: todo +state_hub_task_id: "f1860b10-7467-4ce3-9775-ab293cef3ed0" +``` + +```task +id: T05 +title: 'ProblemClass.fit() adapts tunable params from QualityLedger observations' +priority: medium +status: todo +state_hub_task_id: "950b74e9-ede8-477a-b6b7-c7af423d4ebb" +``` + +```task +id: T06 +title: 'CLI helpers: llm-connect rates show, llm-connect classes show, llm-connect classes fit ' +priority: medium +status: todo +state_hub_task_id: "c47eca5f-4cb3-4f88-ac1b-38a9ae18e7e6" +``` + +```task +id: T07 +title: 'Functional contract docs under contracts/functional/ for rates, costs, and problem classes' +priority: medium +status: todo +state_hub_task_id: "c15fd1dc-48c3-40e9-abca-ba3ffe3684f9" +``` + +```task +id: T08 +title: 'Consumer migration note for infospace-bench: replace plan_generation_summary cost+token math with llm-connect calls' +priority: medium +status: todo +state_hub_task_id: "2993932a-334c-49f9-bb74-6ef4d3cbffcb" +``` + +## Scope guardrails + +In scope: + +- Data models, the registries, the built-in classes, and the + fit-from-observations helper. +- A default rate registry seeded with publicly known OpenRouter list + prices (refresh policy is on the consumer). +- CLI helpers for inspection. + +Out of scope: + +- Per-call billing or accounting infrastructure (consumer's job — see + `IB-WP-0019` for infospace-bench's per-infospace budget log). +- Provider-specific tokenisers (we already use a coarse + chars-per-token estimator; tokeniser parity is a separate piece of + work that should land before this one's accuracy bar tightens). +- A rates *update* mechanism. The registry is a static snapshot; + refreshing rates is a consumer chore (or a separate CronCreate-fed + workflow against provider price pages, deliberately out of scope + here). + +## Acceptance + +- `ModelRateRegistry.default()` returns a registry with at least the + nine OpenRouter models infospace-bench bundles today, each carrying + a `captured_at` timestamp. +- `estimate_cost("openai/gpt-4o-mini", 28000, 7500)` returns a + ``CostEstimate(cost_usd≈0.009, cost_source="rate_table:openai/gpt-4o-mini")`` + matching the Lefevre Chapter-I smoke within 20%. +- Each built-in `ProblemClass.estimate(dimensions, params)` produces a + `TokenEstimate` and its `fit()` recovers seeded params from a small + synthetic observation set within 10%. +- The consumer guide in the workplan's notes shows what + `infospace-bench plan_generation_summary` would look like once it + delegates to these primitives. + +## Risks and open questions + +- **Rate drift.** OpenRouter / Anthropic / OpenAI prices change. Each + rate has a `captured_at`; consumers must decide their freshness + policy. The default registry's date is the source of truth; a stale + registry will under- or over-estimate cost but the structure is + unchanged. +- **Class taxonomy lock-in.** The built-in class names will appear in + consumer code and ledger tags. Bump a `schema_version` on the + ProblemClassRegistry from day one; breaking changes need migration. +- **Tokeniser parity.** Current consumer estimators use a coarse + chars-per-token heuristic; the real tokenisers diverge by ±20% + across providers. This workplan accepts that for v1; tightening the + per-class accuracy bar belongs to a follow-on workplan once usage + observations expose the gap. +- **Fitting on small samples.** Adaptation needs enough observations + to be statistically meaningful. The fit function should refuse (or + fall back to seeds) when sample sizes are below a configurable + threshold. + +## Downstream effects + +- `infospace-bench` `plan_generation_summary` becomes a thin caller + over llm-connect's `estimate_cost` and the relevant `ProblemClass`. + The follow-on consumer workplan (`INFOSPACE-WP-NNNN`) describes + exactly which class maps to each workflow stage. +- `IB-WP-0019`'s budget registry stops needing its own + `src/infospace_bench/model_rates.yaml`; the workspace override path + stays but the bundled default lives upstream. +- `IB-WP-0020`'s routing CLI can ask the model rate registry whether a + `--routing-config` cost cap is realistic for the candidate set — + enables defensible `--cost-cap` defaults later. + +## Consumer-side follow-up + +Once T01-T04 land in llm-connect, `infospace-bench` opens a thin +companion workplan to: + +- Replace `_CALLS_PER_CHUNK_BY_WORKFLOW` + `_profile_template_words` + + `WORDS_PER_TOKEN_DEFAULT` math in `plan_generation_summary` with + problem-class lookups. +- Map each workflow stage to a problem class + (`summarize-source` → `chunk-summarization`, etc.) and surface the + mapping in `docs/routing-task-types.md`. +- Drop `src/infospace_bench/model_rates.yaml` and read from + `ModelRateRegistry.default()`; keep the workspace override path + pointed at the same registry. +- Carry forward the existing `--cost-per-1k` override flag (with + documented semantics: blended single-rate override that wins over + rate-table lookup when set).