Commit Graph

7 Commits

Author SHA1 Message Date
bb70b2f4b9 IB-WP-0019-T07: archive integration; close IB-WP-0019
The default archive include set already pulls output/ in wholesale, so
output/budget/ already lands inside the archive package with no code
change. Add a budget_summary block to ArchiveRecord.metadata so
catalog-level tools can see plans_count, runs_count, total_tokens,
total_cost_usd_known, total_cost_usd_estimated, and the
latest_snapshot_id without unpacking the archive. An infospace with no
budget data still archives cleanly with an empty metadata dict.

Closes IB-WP-0019 (Budget and Usage Registry): T01-T07 all done.
Three-layer design landed end-to-end — layer 1 (per-infospace
plans.yaml / usage.yaml / summary.yaml) and layer 3 (state-hub
record_token_event emission with failure isolation) live here; layer 2
(cross-application QualityLedger for adaptive routing) is parked in
llm-connect LLM-WP-0004 and infospace-bench IB-WP-0018 awaits it.

122 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 21:53:28 +02:00
816a95b3ef IB-WP-0019-T06: workspace budget CLI
infospace-bench budget list <workspace> walks <workspace>/infospaces/*
and prints one row per infospace with slug, plans_count, runs_count,
total_tokens, total_cost_usd_known, total_cost_usd_estimated,
last_run_at, and latest_snapshot_id. infospace-bench budget show
<root> dumps the full plans/usage/summary structure for a single
infospace.

Missing budget directories are treated as zero rows rather than errors,
so the CLI is safe to run on partially-populated or fresh workspaces.

120 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 20:44:40 +02:00
110c78b9ad IB-WP-0019-T05: state-hub token-event emission with failure isolation
Emit one record_token_event payload per completed generate run, derived
from the just-recorded usage rollup. tokens_in/out come from the
rollup, model defaults to the dominant model used (or "mixed" when
buckets disagree), agent="infospace-bench", ref_type="session", and
ref_id="<slug>/run-<run_index>". The note carries the infospace slug,
workspace, snapshot_id, and any known/estimated cost so the hub event
is self-describing.

Failure isolation: any exception from the HTTP poster (hub down,
timeout, 5xx) is caught, logged to stderr, and reported as
status=failed; the generate run still completes. INFOSPACE_BENCH_HUB_URL
overrides the default http://127.0.0.1:8000 base;
INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS skips emission entirely.

Tests cover the happy path, the disable env var, poster failure, the
no-usage skip, multi-model coalescing to "mixed", and an end-to-end
run_generation against an unbindable hub port to prove the run survives
when the hub is unreachable. 116 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 20:33:29 +02:00
d4c9c56f5c IB-WP-0019-T04: plan-vs-actual variance and surfacing
After every generate run, compute variance between the executing plan
snapshot and the just-recorded usage rollup, persist it to
output/budget/summary.yaml (overwrite-on-run), and surface it both in
the generate status JSON (new budget_summary field) and as a "Plan
variance" line in reports/generation-summary.md.

Variance fields: calls / prompt_tokens / total_tokens each carry
{estimated, actual, delta, ratio}; cost_usd carries {estimated,
actual_known, actual_estimated_from_rates, actual_total, delta, ratio};
per_workflow rolls the per-bucket usage up to the same workflow_id grain
the plan reports. Runs whose snapshot_id cannot be resolved (no prior
plan, or pruned from the retention window) still record a variance row
with null comparison fields and snapshot_resolved=false, so the
consumer always sees a current summary.

Reordered run_generation so usage and variance are written before the
generation report, allowing the report to embed the variance line on
the same pass.

110 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 20:06:19 +02:00
a4dde53fc3 IB-WP-0019-T03: rate-table cost computation
Ship a starter model rate table at src/infospace_bench/model_rates.yaml
(prompt_per_1k / completion_per_1k for the OpenRouter models we have
actually touched: gpt-4o, gpt-4o-mini, gpt-4-turbo, claude 3.5 sonnet
and haiku, claude 3 opus, gemini 1.5 flash/pro, llama 3.1 70b) and a
load_rate_table() / estimate_cost_usd() pair that overlays an optional
<workspace>/model-rates.yaml on top of the bundled defaults.

generate run now passes a workspace-aware cost_resolver into
record_run_usage, so cost_usd_estimated lands on every usage bucket
whose model matches the table. Adapter-returned cost still wins
(cost_status="known"); rate-table cost is reported under
cost_status="estimated"; unmatched models are recorded as
cost_status="unknown" rather than silently zeroed. Rate-table file is
listed in pyproject.toml package-data so pip-installed users keep the
defaults.

106 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:54:30 +02:00
678508226a IB-WP-0019-T02: usage rollup from run records
Every completed generate run now aggregates per-call adapter usage from
the workflow-engine run records into output/budget/usage.yaml. Per-call
data is bucketed by (workflow_id, stage_id, provider, model) with
running totals for calls, prompt_tokens, completion_tokens,
total_tokens, and cost_usd_known (sum of adapter-reported cost when the
provider returns it; usually zero today). A run-level entry captures
run_index, started_at, completed_at, duration_seconds, the executing
plan snapshot_id (resolved from the latest plans.yaml entry), and the
workflow-level run_id / stage_count summaries.

cost_usd_estimated is left as None for this task; T03 wires the
rate-table resolver so the same bucket gets a model-priced fallback
when the adapter does not return cost directly.

Fixture-mode runs are recorded with provider='fixture', zero tokens,
and cost_status='unknown' rather than silently skipped, so the rollup
honestly reflects which stages actually ran.

102 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:46:40 +02:00
182f7011bb IB-WP-0019-T01: plan snapshot persistence
Every generate plan invocation now appends its compact summary to
output/budget/plans.yaml with a deterministic 12-char snapshot_id
hashed over the selection filters and the estimated call/token/cost
totals. Identical-fingerprint plans refresh the most recent entry's
recorded_at instead of stacking duplicates. Retention defaults to the
last 50 snapshots; older entries are pruned and counted on a top-level
pruned_count field.

The summary now echoes its input filters (chapter_filter, chunk_filter,
from_chapter, to_chapter) so reviewers can read the snapshot without
cross-referencing the CLI invocation.

New module src/infospace_bench/budget.py owns layer 1 (per-infospace
recording) of the IB-WP-0019 three-layer design; layer 2 still belongs
in llm-connect LLM-WP-0004 and layer 3 in state-hub.

99 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 19:19:35 +02:00