Files

tegwick bb70b2f4b9 IB-WP-0019-T07: archive integration; close IB-WP-0019

The default archive include set already pulls output/ in wholesale, so
output/budget/ already lands inside the archive package with no code
change. Add a budget_summary block to ArchiveRecord.metadata so
catalog-level tools can see plans_count, runs_count, total_tokens,
total_cost_usd_known, total_cost_usd_estimated, and the
latest_snapshot_id without unpacking the archive. An infospace with no
budget data still archives cleanly with an empty metadata dict.

Closes IB-WP-0019 (Budget and Usage Registry): T01-T07 all done.
Three-layer design landed end-to-end — layer 1 (per-infospace
plans.yaml / usage.yaml / summary.yaml) and layer 3 (state-hub
record_token_event emission with failure isolation) live here; layer 2
(cross-application QualityLedger for adaptive routing) is parked in
llm-connect LLM-WP-0004 and infospace-bench IB-WP-0018 awaits it.

122 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-17 21:53:28 +02:00

10 KiB

Raw Permalink Blame History

id, type, title, domain, repo, status, owner, topic_slug, created, updated, depends_on_workplans, related_workplans, state_hub_workstream_id

type

title

domain

repo

status

owner

topic_slug

created

updated

depends_on_workplans

related_workplans

state_hub_workstream_id

IB-WP-0019

workplan

Budget and Usage Registry for Infospaces

markitect

infospace-bench

done

markitect

2026-05-17

IB-WP-0016

IB-WP-0014

IB-WP-0018

LLM-WP-0004

063c6285-a56e-476b-8666-109d6fa35858

IB-WP-0019 — Budget and Usage Registry for Infospaces

Goal

Persist budget and usage signals at the per-infospace layer and emit organizational rollups, so every infospace can answer "what did we estimate, what did we actually spend, on which model, at what cost" without scraping commit messages or state-hub events.

This workplan owns the recording and rollup layer. It does not own:

Adaptive routing decisions or per-task quality grading — those belong to llm-connect LLM-WP-0004 and the consumer workplan IB-WP-0018.
Authoritative provider pricing — we read a small rate table and combine it with adapter-returned usage; the table itself is a static artifact that consumers refresh.

Why

IB-WP-0016-T03 made the planning estimates cheap to obtain (chunks, calls, tokens, rough USD), but the numbers vanish after the JSON is printed. Run records under output/workflows/runs/*.yaml capture per-call prompt_tokens and completion_tokens but nothing rolls them up, no cost is computed, and there is no plan-vs-actual variance. Without this layer:

Each new infospace re-discovers the same cost surprises
LLM-WP-0004's adaptive policy has no per-application history to learn from when it lands
IB-WP-0014's archive packages forget the budget shape of the work that produced them
State-hub's organizational token ledger stays blind to infospace-bench runs

Non-Goals

Owning a cross-application quality ledger (that is LLM-WP-0004)
Auto-refreshing provider price lists at runtime
Failing a generate run when state-hub is unreachable
Persisting full prompt text for retrospective replay (the existing run records already keep what is needed)

Layered design (read first)

Three layers, each owned by a different repo:

Layer	Lives in	Purpose	This workplan?
1. Per-infospace budget log	`infospace-bench` (this workplan)	Plans + usage + variance, archived with the infospace	yes
2. Cross-application observations	`llm-connect` (LLM-WP-0004)	Per-task per-adapter (cost, tokens, latency, quality) for adaptive routing	no
3. Organizational rollup	`state-hub` (already exists)	`record_token_event` / `get_token_summary` across all projects	this workplan emits, hub stores

Tasks

T01 — Plan snapshot persistence

id: IB-WP-0019-T01
status: done
priority: high
state_hub_task_id: "7f1a4e0a-c1ad-49f3-aad1-6946de9b1219"

Append the compact plan_generation_summary payload to output/budget/plans.yaml on every generate plan invocation
Include a stable snapshot_id (hash of relevant fields), the stage, selection filters, and a recorded_at timestamp
Cap the history length with a configurable retention (default keep last 50 snapshots; older snapshots are pruned with a single rollup entry preserved)
Tests: round-trip, retention, repeat plans produce distinct snapshots

T02 — Usage rollup from run records

id: IB-WP-0019-T02
status: done
priority: high
state_hub_task_id: "a612f8d4-f96d-4fae-9aa6-66a7946414f5"

On run and resume completion, scan the run-record YAML written by the workflow engine and aggregate per-call usage into output/budget/usage.yaml
Aggregate buckets: workflow, stage, provider, model
Fields per bucket: calls, prompt_tokens, completion_tokens, total_tokens, cost_usd_known (sum over calls with known cost), cost_usd_estimated (computed via rate table fallback)
Append a top-level runs[] entry per completed run with the run's rollup, the snapshot_id of the plan it executed against (when one exists), and the wall-clock duration
Tests: aggregate across multiple stages, fixture-mode produces zero cost without erroring, missing-usage entries do not abort the rollup

T03 — Cost computation from a rate table

id: IB-WP-0019-T03
status: done
priority: high
state_hub_task_id: "688c590d-8885-455e-bcf6-61409a45e001"

Add docs/model-rates.yaml with model -> {prompt_per_1k, completion_per_1k, currency, source_url, captured_at} for the OpenRouter models we have actually used (start small: the ones currently exercised in tests/smoke)
Resolver order: adapter-returned cost (when present) > rate table > unknown (recorded explicitly, not silently zeroed)
Allow a per-workspace override via ${workspace}/model-rates.yaml for self-hosted or private-rate setups
Tests: known model, unknown model surfaces as cost_usd: null with cost_status: "unknown", override file takes precedence

T04 — Plan-vs-actual variance and surfacing

id: IB-WP-0019-T04
status: done
priority: medium
state_hub_task_id: "c6adc4fb-9062-4c81-a0b2-98d3166e047d"

Compute a small variance record on each run: actual_calls / estimated_calls, actual_tokens / estimated_tokens, actual_cost / estimated_cost, plus per-stage variance
Persist to output/budget/summary.yaml (overwrite each run; previous versions live in usage.yaml history)
Surface a one-line variance summary in reports/generation-summary.md (touches T07 of IB-WP-0016)
Add the variance summary to generate status JSON output
Tests: zero-cost fixture run, known-model OpenRouter mock run, missing-plan run (variance fields are null but the run still records)

T05 — State-hub token-event emission

id: IB-WP-0019-T05
status: done
priority: medium
state_hub_task_id: "968bca1d-63ff-4818-83bb-ca314b1e633c"

After each completed run, call state-hub record_token_event with the run's rollup (tokens in/out, model, USD cost when known, infospace_slug, workspace)
Emit at most one event per run; tag the event with the workplan context when available
Failure isolation: a state-hub error must not fail the run; log the failure and continue
Honor an opt-out env var INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS
Tests: monkey-patched hub client, opt-out flag respected, run succeeds when the hub raises

T06 — Workspace-level rollup CLI

id: IB-WP-0019-T06
status: done
priority: medium
state_hub_task_id: "7cb34bfc-c562-4dda-a6d4-b44158644e19"

Add infospace-bench budget list <workspace> that walks infospaces/*/output/budget/ and prints a JSON table: slug, plans_count, runs_count, total_tokens, total_cost_usd_known, total_cost_usd_estimated, last_run_at
Add infospace-bench budget show <infospace-root> that prints the full per-infospace budget structure
Tests: empty workspace, multiple infospaces, missing budget dir is treated as zero, not an error

T07 — Archive integration

id: IB-WP-0019-T07
status: done
priority: low
state_hub_task_id: "b97906e0-2835-4246-9868-840c02d64fae"

Confirm output/budget/ ends up inside the archive package built by IB-WP-0014's archive_infospace() (it should, via the existing default-include rules — verify with a test)
Add a budget_summary field to the archive manifest so catalog-level tools can find the cost shape of an archived infospace without unpacking it

Acceptance

A generate plan invocation persists a snapshot to output/budget/plans.yaml and is idempotent across runs
A generate run invocation appends a usage rollup to output/budget/usage.yaml, writes a variance summary, and emits one state-hub token event (when the hub is reachable)
generate status and the generation-summary report surface the plan-vs-actual variance for the most recent run
infospace-bench budget list <workspace> returns a parseable rollup across all infospaces in a workspace
Archived infospace packages carry their budget log and expose a budget_summary field in the archive manifest
Tests cover plan persistence, run rollup, rate-table resolution, variance, state-hub emission with hub-down isolation, and the workspace CLI

Risks and open questions

Rate-table drift. Provider prices change. The rate table will go stale unless someone refreshes it. Add captured_at to every entry and surface "rate older than 90 days" as a warning in budget output; do not block.
Multiple-provider cost. When a single run mixes providers (e.g. fixture for cheap stages + OpenRouter for expensive ones), the rollup must split clearly. The model+provider bucketing in T02 covers this; tests should pin the behaviour.
State-hub coupling. Emitting token events introduces a cross-repo write. T05 keeps it opt-outable and failure-isolated, but callers running offline want zero coupling — make sure the default is "emit if reachable, silent skip otherwise" rather than "fail if unreachable".
Concurrency. Two generate run invocations on the same infospace would race on usage.yaml. Existing infospace workflows assume sequential runs; document the constraint rather than building locks.
Budget vs adaptive observations. This workplan records what happened. LLM-WP-0004 records what we learned about quality. Keep them as two distinct files / schemas so the layering stays inspectable; do not merge.
Privacy. Usage records do not include prompt or completion text — only counts and identifiers. State-hub events likewise. If this assumption later changes, add an explicit redaction hook before doing so.

Downstream effects

IB-WP-0018 (adaptive routing consumer) gains a local history to cross-check against the QualityLedger once LLM-WP-0004 lands
IB-WP-0016-T07 (review report and output policy) can pull the variance summary directly instead of regenerating numbers
IB-WP-0014 archives become budget-bearing artifacts without code changes beyond T07's manifest field
State-hub's get_token_summary finally sees infospace-bench runs alongside other domains' token spend

10 KiB Raw Permalink Blame History