Files
infospace-bench/workplans/IB-WP-0019-budget-and-usage-registry.md
tegwick bb70b2f4b9 IB-WP-0019-T07: archive integration; close IB-WP-0019
The default archive include set already pulls output/ in wholesale, so
output/budget/ already lands inside the archive package with no code
change. Add a budget_summary block to ArchiveRecord.metadata so
catalog-level tools can see plans_count, runs_count, total_tokens,
total_cost_usd_known, total_cost_usd_estimated, and the
latest_snapshot_id without unpacking the archive. An infospace with no
budget data still archives cleanly with an empty metadata dict.

Closes IB-WP-0019 (Budget and Usage Registry): T01-T07 all done.
Three-layer design landed end-to-end — layer 1 (per-infospace
plans.yaml / usage.yaml / summary.yaml) and layer 3 (state-hub
record_token_event emission with failure isolation) live here; layer 2
(cross-application QualityLedger for adaptive routing) is parked in
llm-connect LLM-WP-0004 and infospace-bench IB-WP-0018 awaits it.

122 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 21:53:28 +02:00

10 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id
id type title domain repo status owner topic_slug created updated depends_on_workplans related_workplans state_hub_workstream_id
IB-WP-0019 workplan Budget and Usage Registry for Infospaces markitect infospace-bench done markitect markitect 2026-05-17 2026-05-17
IB-WP-0016
IB-WP-0014
IB-WP-0018
LLM-WP-0004
063c6285-a56e-476b-8666-109d6fa35858

IB-WP-0019 — Budget and Usage Registry for Infospaces

Goal

Persist budget and usage signals at the per-infospace layer and emit organizational rollups, so every infospace can answer "what did we estimate, what did we actually spend, on which model, at what cost" without scraping commit messages or state-hub events.

This workplan owns the recording and rollup layer. It does not own:

  • Adaptive routing decisions or per-task quality grading — those belong to llm-connect LLM-WP-0004 and the consumer workplan IB-WP-0018.
  • Authoritative provider pricing — we read a small rate table and combine it with adapter-returned usage; the table itself is a static artifact that consumers refresh.

Why

IB-WP-0016-T03 made the planning estimates cheap to obtain (chunks, calls, tokens, rough USD), but the numbers vanish after the JSON is printed. Run records under output/workflows/runs/*.yaml capture per-call prompt_tokens and completion_tokens but nothing rolls them up, no cost is computed, and there is no plan-vs-actual variance. Without this layer:

  • Each new infospace re-discovers the same cost surprises
  • LLM-WP-0004's adaptive policy has no per-application history to learn from when it lands
  • IB-WP-0014's archive packages forget the budget shape of the work that produced them
  • State-hub's organizational token ledger stays blind to infospace-bench runs

Non-Goals

  • Owning a cross-application quality ledger (that is LLM-WP-0004)
  • Auto-refreshing provider price lists at runtime
  • Failing a generate run when state-hub is unreachable
  • Persisting full prompt text for retrospective replay (the existing run records already keep what is needed)

Layered design (read first)

Three layers, each owned by a different repo:

Layer Lives in Purpose This workplan?
1. Per-infospace budget log infospace-bench (this workplan) Plans + usage + variance, archived with the infospace yes
2. Cross-application observations llm-connect (LLM-WP-0004) Per-task per-adapter (cost, tokens, latency, quality) for adaptive routing no
3. Organizational rollup state-hub (already exists) record_token_event / get_token_summary across all projects this workplan emits, hub stores

Tasks

T01 — Plan snapshot persistence

id: IB-WP-0019-T01
status: done
priority: high
state_hub_task_id: "7f1a4e0a-c1ad-49f3-aad1-6946de9b1219"
  • Append the compact plan_generation_summary payload to output/budget/plans.yaml on every generate plan invocation
  • Include a stable snapshot_id (hash of relevant fields), the stage, selection filters, and a recorded_at timestamp
  • Cap the history length with a configurable retention (default keep last 50 snapshots; older snapshots are pruned with a single rollup entry preserved)
  • Tests: round-trip, retention, repeat plans produce distinct snapshots

T02 — Usage rollup from run records

id: IB-WP-0019-T02
status: done
priority: high
state_hub_task_id: "a612f8d4-f96d-4fae-9aa6-66a7946414f5"
  • On run and resume completion, scan the run-record YAML written by the workflow engine and aggregate per-call usage into output/budget/usage.yaml
  • Aggregate buckets: workflow, stage, provider, model
  • Fields per bucket: calls, prompt_tokens, completion_tokens, total_tokens, cost_usd_known (sum over calls with known cost), cost_usd_estimated (computed via rate table fallback)
  • Append a top-level runs[] entry per completed run with the run's rollup, the snapshot_id of the plan it executed against (when one exists), and the wall-clock duration
  • Tests: aggregate across multiple stages, fixture-mode produces zero cost without erroring, missing-usage entries do not abort the rollup

T03 — Cost computation from a rate table

id: IB-WP-0019-T03
status: done
priority: high
state_hub_task_id: "688c590d-8885-455e-bcf6-61409a45e001"
  • Add docs/model-rates.yaml with model -> {prompt_per_1k, completion_per_1k, currency, source_url, captured_at} for the OpenRouter models we have actually used (start small: the ones currently exercised in tests/smoke)
  • Resolver order: adapter-returned cost (when present) > rate table > unknown (recorded explicitly, not silently zeroed)
  • Allow a per-workspace override via ${workspace}/model-rates.yaml for self-hosted or private-rate setups
  • Tests: known model, unknown model surfaces as cost_usd: null with cost_status: "unknown", override file takes precedence

T04 — Plan-vs-actual variance and surfacing

id: IB-WP-0019-T04
status: done
priority: medium
state_hub_task_id: "c6adc4fb-9062-4c81-a0b2-98d3166e047d"
  • Compute a small variance record on each run: actual_calls / estimated_calls, actual_tokens / estimated_tokens, actual_cost / estimated_cost, plus per-stage variance
  • Persist to output/budget/summary.yaml (overwrite each run; previous versions live in usage.yaml history)
  • Surface a one-line variance summary in reports/generation-summary.md (touches T07 of IB-WP-0016)
  • Add the variance summary to generate status JSON output
  • Tests: zero-cost fixture run, known-model OpenRouter mock run, missing-plan run (variance fields are null but the run still records)

T05 — State-hub token-event emission

id: IB-WP-0019-T05
status: done
priority: medium
state_hub_task_id: "968bca1d-63ff-4818-83bb-ca314b1e633c"
  • After each completed run, call state-hub record_token_event with the run's rollup (tokens in/out, model, USD cost when known, infospace_slug, workspace)
  • Emit at most one event per run; tag the event with the workplan context when available
  • Failure isolation: a state-hub error must not fail the run; log the failure and continue
  • Honor an opt-out env var INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS
  • Tests: monkey-patched hub client, opt-out flag respected, run succeeds when the hub raises

T06 — Workspace-level rollup CLI

id: IB-WP-0019-T06
status: done
priority: medium
state_hub_task_id: "7cb34bfc-c562-4dda-a6d4-b44158644e19"
  • Add infospace-bench budget list <workspace> that walks infospaces/*/output/budget/ and prints a JSON table: slug, plans_count, runs_count, total_tokens, total_cost_usd_known, total_cost_usd_estimated, last_run_at
  • Add infospace-bench budget show <infospace-root> that prints the full per-infospace budget structure
  • Tests: empty workspace, multiple infospaces, missing budget dir is treated as zero, not an error

T07 — Archive integration

id: IB-WP-0019-T07
status: done
priority: low
state_hub_task_id: "b97906e0-2835-4246-9868-840c02d64fae"
  • Confirm output/budget/ ends up inside the archive package built by IB-WP-0014's archive_infospace() (it should, via the existing default-include rules — verify with a test)
  • Add a budget_summary field to the archive manifest so catalog-level tools can find the cost shape of an archived infospace without unpacking it

Acceptance

  • A generate plan invocation persists a snapshot to output/budget/plans.yaml and is idempotent across runs
  • A generate run invocation appends a usage rollup to output/budget/usage.yaml, writes a variance summary, and emits one state-hub token event (when the hub is reachable)
  • generate status and the generation-summary report surface the plan-vs-actual variance for the most recent run
  • infospace-bench budget list <workspace> returns a parseable rollup across all infospaces in a workspace
  • Archived infospace packages carry their budget log and expose a budget_summary field in the archive manifest
  • Tests cover plan persistence, run rollup, rate-table resolution, variance, state-hub emission with hub-down isolation, and the workspace CLI

Risks and open questions

  • Rate-table drift. Provider prices change. The rate table will go stale unless someone refreshes it. Add captured_at to every entry and surface "rate older than 90 days" as a warning in budget output; do not block.
  • Multiple-provider cost. When a single run mixes providers (e.g. fixture for cheap stages + OpenRouter for expensive ones), the rollup must split clearly. The model+provider bucketing in T02 covers this; tests should pin the behaviour.
  • State-hub coupling. Emitting token events introduces a cross-repo write. T05 keeps it opt-outable and failure-isolated, but callers running offline want zero coupling — make sure the default is "emit if reachable, silent skip otherwise" rather than "fail if unreachable".
  • Concurrency. Two generate run invocations on the same infospace would race on usage.yaml. Existing infospace workflows assume sequential runs; document the constraint rather than building locks.
  • Budget vs adaptive observations. This workplan records what happened. LLM-WP-0004 records what we learned about quality. Keep them as two distinct files / schemas so the layering stays inspectable; do not merge.
  • Privacy. Usage records do not include prompt or completion text — only counts and identifiers. State-hub events likewise. If this assumption later changes, add an explicit redaction hook before doing so.

Downstream effects

  • IB-WP-0018 (adaptive routing consumer) gains a local history to cross-check against the QualityLedger once LLM-WP-0004 lands
  • IB-WP-0016-T07 (review report and output policy) can pull the variance summary directly instead of regenerating numbers
  • IB-WP-0014 archives become budget-bearing artifacts without code changes beyond T07's manifest field
  • State-hub's get_token_summary finally sees infospace-bench runs alongside other domains' token spend