diff --git a/workplans/IB-WP-0019-budget-and-usage-registry.md b/workplans/IB-WP-0019-budget-and-usage-registry.md new file mode 100644 index 0000000..600215d --- /dev/null +++ b/workplans/IB-WP-0019-budget-and-usage-registry.md @@ -0,0 +1,256 @@ +--- +id: IB-WP-0019 +type: workplan +title: "Budget and Usage Registry for Infospaces" +domain: markitect +repo: infospace-bench +status: todo +owner: markitect +topic_slug: markitect +created: "2026-05-17" +updated: "2026-05-17" +depends_on_workplans: [] +related_workplans: + - IB-WP-0016 + - IB-WP-0014 + - IB-WP-0018 + - LLM-WP-0004 +--- + +# IB-WP-0019 — Budget and Usage Registry for Infospaces + +## Goal + +Persist budget and usage signals at the per-infospace layer and emit +organizational rollups, so every infospace can answer "what did we +estimate, what did we actually spend, on which model, at what cost" +without scraping commit messages or state-hub events. + +This workplan owns the *recording and rollup* layer. It does **not** +own: + +- Adaptive routing decisions or per-task quality grading — those belong + to `llm-connect` `LLM-WP-0004` and the consumer workplan `IB-WP-0018`. +- Authoritative provider pricing — we read a small rate table and + combine it with adapter-returned usage; the table itself is a static + artifact that consumers refresh. + +## Why + +`IB-WP-0016-T03` made the planning estimates cheap to obtain (chunks, +calls, tokens, rough USD), but the numbers vanish after the JSON is +printed. Run records under `output/workflows/runs/*.yaml` capture +per-call `prompt_tokens` and `completion_tokens` but nothing rolls them +up, no cost is computed, and there is no plan-vs-actual variance. +Without this layer: + +- Each new infospace re-discovers the same cost surprises +- `LLM-WP-0004`'s adaptive policy has no per-application history to + learn from when it lands +- `IB-WP-0014`'s archive packages forget the budget shape of the work + that produced them +- State-hub's organizational token ledger stays blind to + infospace-bench runs + +## Non-Goals + +- Owning a cross-application quality ledger (that is `LLM-WP-0004`) +- Auto-refreshing provider price lists at runtime +- Failing a `generate run` when state-hub is unreachable +- Persisting full prompt text for retrospective replay (the existing + run records already keep what is needed) + +## Layered design (read first) + +Three layers, each owned by a different repo: + +| Layer | Lives in | Purpose | This workplan? | +|---|---|---|---| +| 1. Per-infospace budget log | `infospace-bench` (this workplan) | Plans + usage + variance, archived with the infospace | yes | +| 2. Cross-application observations | `llm-connect` (LLM-WP-0004) | Per-task per-adapter (cost, tokens, latency, quality) for adaptive routing | no | +| 3. Organizational rollup | `state-hub` (already exists) | `record_token_event` / `get_token_summary` across all projects | this workplan emits, hub stores | + +## Tasks + +### T01 — Plan snapshot persistence + +```task +id: IB-WP-0019-T01 +status: todo +priority: high +``` + +- Append the compact `plan_generation_summary` payload to + `output/budget/plans.yaml` on every `generate plan` invocation +- Include a stable `snapshot_id` (hash of relevant fields), the stage, + selection filters, and a `recorded_at` timestamp +- Cap the history length with a configurable retention (default keep + last 50 snapshots; older snapshots are pruned with a single rollup + entry preserved) +- Tests: round-trip, retention, repeat plans produce distinct snapshots + +### T02 — Usage rollup from run records + +```task +id: IB-WP-0019-T02 +status: todo +priority: high +``` + +- On `run` and `resume` completion, scan the run-record YAML written by + the workflow engine and aggregate per-call usage into + `output/budget/usage.yaml` +- Aggregate buckets: workflow, stage, provider, model +- Fields per bucket: `calls`, `prompt_tokens`, `completion_tokens`, + `total_tokens`, `cost_usd_known` (sum over calls with known cost), + `cost_usd_estimated` (computed via rate table fallback) +- Append a top-level `runs[]` entry per completed run with the run's + rollup, the `snapshot_id` of the plan it executed against (when one + exists), and the wall-clock duration +- Tests: aggregate across multiple stages, fixture-mode produces zero + cost without erroring, missing-usage entries do not abort the rollup + +### T03 — Cost computation from a rate table + +```task +id: IB-WP-0019-T03 +status: todo +priority: high +``` + +- Add `docs/model-rates.yaml` with `model -> {prompt_per_1k, + completion_per_1k, currency, source_url, captured_at}` for the + OpenRouter models we have actually used (start small: the ones + currently exercised in tests/smoke) +- Resolver order: adapter-returned cost (when present) > rate table > + unknown (recorded explicitly, not silently zeroed) +- Allow a per-workspace override via `${workspace}/model-rates.yaml` + for self-hosted or private-rate setups +- Tests: known model, unknown model surfaces as `cost_usd: null` with + `cost_status: "unknown"`, override file takes precedence + +### T04 — Plan-vs-actual variance and surfacing + +```task +id: IB-WP-0019-T04 +status: todo +priority: medium +``` + +- Compute a small variance record on each run: actual_calls / + estimated_calls, actual_tokens / estimated_tokens, actual_cost / + estimated_cost, plus per-stage variance +- Persist to `output/budget/summary.yaml` (overwrite each run; previous + versions live in usage.yaml history) +- Surface a one-line variance summary in + `reports/generation-summary.md` (touches T07 of IB-WP-0016) +- Add the variance summary to `generate status` JSON output +- Tests: zero-cost fixture run, known-model OpenRouter mock run, + missing-plan run (variance fields are null but the run still records) + +### T05 — State-hub token-event emission + +```task +id: IB-WP-0019-T05 +status: todo +priority: medium +``` + +- After each completed run, call state-hub `record_token_event` with + the run's rollup (tokens in/out, model, USD cost when known, + `infospace_slug`, `workspace`) +- Emit at most one event per run; tag the event with the workplan + context when available +- Failure isolation: a state-hub error must not fail the run; log the + failure and continue +- Honor an opt-out env var `INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS` +- Tests: monkey-patched hub client, opt-out flag respected, run + succeeds when the hub raises + +### T06 — Workspace-level rollup CLI + +```task +id: IB-WP-0019-T06 +status: todo +priority: medium +``` + +- Add `infospace-bench budget list ` that walks + `infospaces/*/output/budget/` and prints a JSON table: + `slug`, `plans_count`, `runs_count`, `total_tokens`, + `total_cost_usd_known`, `total_cost_usd_estimated`, `last_run_at` +- Add `infospace-bench budget show ` that prints the + full per-infospace budget structure +- Tests: empty workspace, multiple infospaces, missing budget dir is + treated as zero, not an error + +### T07 — Archive integration + +```task +id: IB-WP-0019-T07 +status: todo +priority: low +``` + +- Confirm `output/budget/` ends up inside the archive package built by + `IB-WP-0014`'s `archive_infospace()` (it should, via the existing + default-include rules — verify with a test) +- Add a `budget_summary` field to the archive manifest so + catalog-level tools can find the cost shape of an archived infospace + without unpacking it + +## Acceptance + +- A `generate plan` invocation persists a snapshot to + `output/budget/plans.yaml` and is idempotent across runs +- A `generate run` invocation appends a usage rollup to + `output/budget/usage.yaml`, writes a variance summary, and emits one + state-hub token event (when the hub is reachable) +- `generate status` and the generation-summary report surface the + plan-vs-actual variance for the most recent run +- `infospace-bench budget list ` returns a parseable rollup + across all infospaces in a workspace +- Archived infospace packages carry their budget log and expose a + `budget_summary` field in the archive manifest +- Tests cover plan persistence, run rollup, rate-table resolution, + variance, state-hub emission with hub-down isolation, and the + workspace CLI + +## Risks and open questions + +- **Rate-table drift.** Provider prices change. The rate table will go + stale unless someone refreshes it. Add `captured_at` to every entry + and surface "rate older than 90 days" as a warning in budget output; + do not block. +- **Multiple-provider cost.** When a single run mixes providers (e.g. + fixture for cheap stages + OpenRouter for expensive ones), the + rollup must split clearly. The model+provider bucketing in T02 + covers this; tests should pin the behaviour. +- **State-hub coupling.** Emitting token events introduces a + cross-repo write. T05 keeps it opt-outable and failure-isolated, but + callers running offline want zero coupling — make sure the default + is "emit if reachable, silent skip otherwise" rather than "fail if + unreachable". +- **Concurrency.** Two `generate run` invocations on the same + infospace would race on `usage.yaml`. Existing infospace workflows + assume sequential runs; document the constraint rather than building + locks. +- **Budget vs adaptive observations.** This workplan records *what + happened*. `LLM-WP-0004` records *what we learned about quality*. + Keep them as two distinct files / schemas so the layering stays + inspectable; do not merge. +- **Privacy.** Usage records do not include prompt or completion + text — only counts and identifiers. State-hub events likewise. If + this assumption later changes, add an explicit redaction hook before + doing so. + +## Downstream effects + +- `IB-WP-0018` (adaptive routing consumer) gains a local history to + cross-check against the `QualityLedger` once `LLM-WP-0004` lands +- `IB-WP-0016-T07` (review report and output policy) can pull the + variance summary directly instead of regenerating numbers +- `IB-WP-0014` archives become budget-bearing artifacts without code + changes beyond T07's manifest field +- State-hub's `get_token_summary` finally sees infospace-bench runs + alongside other domains' token spend