IB-WP-0019: budget and usage registry workplan (todo)

Open a separate workplan for the budget/usage recording layer surfaced by the T03 conversation. Three-layer design: layer 1 (per-infospace budget log) and layer 3 (state-hub emission) live here; layer 2 (cross-application quality observations for adaptive routing) stays in llm-connect LLM-WP-0004. Seven tasks cover plan snapshot persistence, run usage rollup, rate-table cost computation, plan-vs-actual variance, state-hub token events with hub-down isolation, a workspace-level rollup CLI, and archive integration so IB-WP-0014 packages carry their budget shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 18:30:49 +02:00
parent 74c52c6239
commit d2deebe081
1 changed files with 256 additions and 0 deletions
--- a/workplans/IB-WP-0019-budget-and-usage-registry.md
+++ b/workplans/IB-WP-0019-budget-and-usage-registry.md
@@ -0,0 +1,256 @@
+---
+id: IB-WP-0019
+type: workplan
+title: "Budget and Usage Registry for Infospaces"
+domain: markitect
+repo: infospace-bench
+status: todo
+owner: markitect
+topic_slug: markitect
+created: "2026-05-17"
+updated: "2026-05-17"
+depends_on_workplans: []
+related_workplans:
+  - IB-WP-0016
+  - IB-WP-0014
+  - IB-WP-0018
+  - LLM-WP-0004
+---
+
+# IB-WP-0019 — Budget and Usage Registry for Infospaces
+
+## Goal
+
+Persist budget and usage signals at the per-infospace layer and emit
+organizational rollups, so every infospace can answer "what did we
+estimate, what did we actually spend, on which model, at what cost"
+without scraping commit messages or state-hub events.
+
+This workplan owns the *recording and rollup* layer. It does **not**
+own:
+
+- Adaptive routing decisions or per-task quality grading — those belong
+  to `llm-connect` `LLM-WP-0004` and the consumer workplan `IB-WP-0018`.
+- Authoritative provider pricing — we read a small rate table and
+  combine it with adapter-returned usage; the table itself is a static
+  artifact that consumers refresh.
+
+## Why
+
+`IB-WP-0016-T03` made the planning estimates cheap to obtain (chunks,
+calls, tokens, rough USD), but the numbers vanish after the JSON is
+printed. Run records under `output/workflows/runs/*.yaml` capture
+per-call `prompt_tokens` and `completion_tokens` but nothing rolls them
+up, no cost is computed, and there is no plan-vs-actual variance.
+Without this layer:
+
+- Each new infospace re-discovers the same cost surprises
+- `LLM-WP-0004`'s adaptive policy has no per-application history to
+  learn from when it lands
+- `IB-WP-0014`'s archive packages forget the budget shape of the work
+  that produced them
+- State-hub's organizational token ledger stays blind to
+  infospace-bench runs
+
+## Non-Goals
+
+- Owning a cross-application quality ledger (that is `LLM-WP-0004`)
+- Auto-refreshing provider price lists at runtime
+- Failing a `generate run` when state-hub is unreachable
+- Persisting full prompt text for retrospective replay (the existing
+  run records already keep what is needed)
+
+## Layered design (read first)
+
+Three layers, each owned by a different repo:
+
+| Layer | Lives in | Purpose | This workplan? |
+|---|---|---|---|
+| 1. Per-infospace budget log | `infospace-bench` (this workplan) | Plans + usage + variance, archived with the infospace | yes |
+| 2. Cross-application observations | `llm-connect` (LLM-WP-0004) | Per-task per-adapter (cost, tokens, latency, quality) for adaptive routing | no |
+| 3. Organizational rollup | `state-hub` (already exists) | `record_token_event` / `get_token_summary` across all projects | this workplan emits, hub stores |
+
+## Tasks
+
+### T01 — Plan snapshot persistence
+
+```task
+id: IB-WP-0019-T01
+status: todo
+priority: high
+```
+
+- Append the compact `plan_generation_summary` payload to
+  `output/budget/plans.yaml` on every `generate plan` invocation
+- Include a stable `snapshot_id` (hash of relevant fields), the stage,
+  selection filters, and a `recorded_at` timestamp
+- Cap the history length with a configurable retention (default keep
+  last 50 snapshots; older snapshots are pruned with a single rollup
+  entry preserved)
+- Tests: round-trip, retention, repeat plans produce distinct snapshots
+
+### T02 — Usage rollup from run records
+
+```task
+id: IB-WP-0019-T02
+status: todo
+priority: high
+```
+
+- On `run` and `resume` completion, scan the run-record YAML written by
+  the workflow engine and aggregate per-call usage into
+  `output/budget/usage.yaml`
+- Aggregate buckets: workflow, stage, provider, model
+- Fields per bucket: `calls`, `prompt_tokens`, `completion_tokens`,
+  `total_tokens`, `cost_usd_known` (sum over calls with known cost),
+  `cost_usd_estimated` (computed via rate table fallback)
+- Append a top-level `runs[]` entry per completed run with the run's
+  rollup, the `snapshot_id` of the plan it executed against (when one
+  exists), and the wall-clock duration
+- Tests: aggregate across multiple stages, fixture-mode produces zero
+  cost without erroring, missing-usage entries do not abort the rollup
+
+### T03 — Cost computation from a rate table
+
+```task
+id: IB-WP-0019-T03
+status: todo
+priority: high
+```
+
+- Add `docs/model-rates.yaml` with `model -> {prompt_per_1k,
+  completion_per_1k, currency, source_url, captured_at}` for the
+  OpenRouter models we have actually used (start small: the ones
+  currently exercised in tests/smoke)
+- Resolver order: adapter-returned cost (when present) > rate table >
+  unknown (recorded explicitly, not silently zeroed)
+- Allow a per-workspace override via `${workspace}/model-rates.yaml`
+  for self-hosted or private-rate setups
+- Tests: known model, unknown model surfaces as `cost_usd: null` with
+  `cost_status: "unknown"`, override file takes precedence
+
+### T04 — Plan-vs-actual variance and surfacing
+
+```task
+id: IB-WP-0019-T04
+status: todo
+priority: medium
+```
+
+- Compute a small variance record on each run: actual_calls /
+  estimated_calls, actual_tokens / estimated_tokens, actual_cost /
+  estimated_cost, plus per-stage variance
+- Persist to `output/budget/summary.yaml` (overwrite each run; previous
+  versions live in usage.yaml history)
+- Surface a one-line variance summary in
+  `reports/generation-summary.md` (touches T07 of IB-WP-0016)
+- Add the variance summary to `generate status` JSON output
+- Tests: zero-cost fixture run, known-model OpenRouter mock run,
+  missing-plan run (variance fields are null but the run still records)
+
+### T05 — State-hub token-event emission
+
+```task
+id: IB-WP-0019-T05
+status: todo
+priority: medium
+```
+
+- After each completed run, call state-hub `record_token_event` with
+  the run's rollup (tokens in/out, model, USD cost when known,
+  `infospace_slug`, `workspace`)
+- Emit at most one event per run; tag the event with the workplan
+  context when available
+- Failure isolation: a state-hub error must not fail the run; log the
+  failure and continue
+- Honor an opt-out env var `INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS`
+- Tests: monkey-patched hub client, opt-out flag respected, run
+  succeeds when the hub raises
+
+### T06 — Workspace-level rollup CLI
+
+```task
+id: IB-WP-0019-T06
+status: todo
+priority: medium
+```
+
+- Add `infospace-bench budget list <workspace>` that walks
+  `infospaces/*/output/budget/` and prints a JSON table:
+  `slug`, `plans_count`, `runs_count`, `total_tokens`,
+  `total_cost_usd_known`, `total_cost_usd_estimated`, `last_run_at`
+- Add `infospace-bench budget show <infospace-root>` that prints the
+  full per-infospace budget structure
+- Tests: empty workspace, multiple infospaces, missing budget dir is
+  treated as zero, not an error
+
+### T07 — Archive integration
+
+```task
+id: IB-WP-0019-T07
+status: todo
+priority: low
+```
+
+- Confirm `output/budget/` ends up inside the archive package built by
+  `IB-WP-0014`'s `archive_infospace()` (it should, via the existing
+  default-include rules — verify with a test)
+- Add a `budget_summary` field to the archive manifest so
+  catalog-level tools can find the cost shape of an archived infospace
+  without unpacking it
+
+## Acceptance
+
+- A `generate plan` invocation persists a snapshot to
+  `output/budget/plans.yaml` and is idempotent across runs
+- A `generate run` invocation appends a usage rollup to
+  `output/budget/usage.yaml`, writes a variance summary, and emits one
+  state-hub token event (when the hub is reachable)
+- `generate status` and the generation-summary report surface the
+  plan-vs-actual variance for the most recent run
+- `infospace-bench budget list <workspace>` returns a parseable rollup
+  across all infospaces in a workspace
+- Archived infospace packages carry their budget log and expose a
+  `budget_summary` field in the archive manifest
+- Tests cover plan persistence, run rollup, rate-table resolution,
+  variance, state-hub emission with hub-down isolation, and the
+  workspace CLI
+
+## Risks and open questions
+
+- **Rate-table drift.** Provider prices change. The rate table will go
+  stale unless someone refreshes it. Add `captured_at` to every entry
+  and surface "rate older than 90 days" as a warning in budget output;
+  do not block.
+- **Multiple-provider cost.** When a single run mixes providers (e.g.
+  fixture for cheap stages + OpenRouter for expensive ones), the
+  rollup must split clearly. The model+provider bucketing in T02
+  covers this; tests should pin the behaviour.
+- **State-hub coupling.** Emitting token events introduces a
+  cross-repo write. T05 keeps it opt-outable and failure-isolated, but
+  callers running offline want zero coupling — make sure the default
+  is "emit if reachable, silent skip otherwise" rather than "fail if
+  unreachable".
+- **Concurrency.** Two `generate run` invocations on the same
+  infospace would race on `usage.yaml`. Existing infospace workflows
+  assume sequential runs; document the constraint rather than building
+  locks.
+- **Budget vs adaptive observations.** This workplan records *what
+  happened*. `LLM-WP-0004` records *what we learned about quality*.
+  Keep them as two distinct files / schemas so the layering stays
+  inspectable; do not merge.
+- **Privacy.** Usage records do not include prompt or completion
+  text — only counts and identifiers. State-hub events likewise. If
+  this assumption later changes, add an explicit redaction hook before
+  doing so.
+
+## Downstream effects
+
+- `IB-WP-0018` (adaptive routing consumer) gains a local history to
+  cross-check against the `QualityLedger` once `LLM-WP-0004` lands
+- `IB-WP-0016-T07` (review report and output policy) can pull the
+  variance summary directly instead of regenerating numbers
+- `IB-WP-0014` archives become budget-bearing artifacts without code
+  changes beyond T07's manifest field
+- State-hub's `get_token_summary` finally sees infospace-bench runs
+  alongside other domains' token spend