infospace-bench

coulomb/infospace-bench

Fork 0

generated from coulomb/repo-seed

Commit Graph

Author	SHA1	Message	Date
tegwick	3ca891de4a	fix: review findings from Lefevre live smoke Two small fixes informed by the 2026-05-18 live OpenRouter chapter-I run. 1. extract-entities templates (trading-literature and general-knowledge): the # Entity Title placeholder was interpreted by gpt-4o-mini as a literal heading prefix, so every entity came back as "# Entity Title: Bucket Shop" etc. The instruction now spells the placeholder out with concrete examples and an explicit "not the literal string" note, so smaller models hit the intended shape. 2. generate plan grows --model <id>. When supplied, the cost estimate pulls per-prompt and per-completion rates from the bundled model_rates.yaml instead of multiplying a single blended --cost-per-1k value across all tokens. The summary now also returns a separate estimated_completion_tokens field plus a cost_source tag ("rate_table:<model>" \| "cost_per_1k_blended" \| None). This is a stopgap. LLM-WP-0005 (proposed in llm-connect this round) will move the rate registry and token-shape problem classes upstream so consumers stop re-implementing them. The live smoke ran 28k prompt tokens / 7.5k completion / $0.0088 actual. With --model openai/gpt-4o-mini the plan estimate now lands at $0.0076 (within 14% of actual) versus the prior $8.40 estimate at --cost-per-1k 0.30. 181 tests pass, 2 skipped (both live OpenRouter smokes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 04:30:33 +02:00
tegwick	13f9c1895c	IB-WP-0016-T03: scale-aware planning Replace generate plan's full-prompt dump with a compact summary that reports selected-chunk counts, selected chapter numbers, per-workflow call counts, prompt-word and token estimates, and a rough USD cost when --cost-per-1k is supplied. Selection filters --chapter (label or number, repeatable), --from-chapter / --to-chapter (numeric range), and --chunk (repeatable id) shape the estimate. Budget caps --max-calls and --cost-cap are reported as exceeds_* booleans so callers can fail fast before run. The old full per-workflow plan with prompts remains available behind --full so deep inspection is opt-in instead of the default. Whole-Lefevre estimate at default max_words=800: 146 chunks, 730 calls, ~518k prompt tokens, ~$155 at $0.30/1k. Chapters 3-5 only: 19 chunks, 95 calls, ~64k tokens. 87 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 18:18:09 +02:00

Author

SHA1

Message

Date

tegwick

3ca891de4a

fix: review findings from Lefevre live smoke

Two small fixes informed by the 2026-05-18 live OpenRouter chapter-I run.

1. extract-entities templates (trading-literature and general-knowledge):
   the # Entity Title placeholder was interpreted by gpt-4o-mini as a
   literal heading prefix, so every entity came back as "# Entity Title:
   Bucket Shop" etc. The instruction now spells the placeholder out
   with concrete examples and an explicit "not the literal string"
   note, so smaller models hit the intended shape.

2. generate plan grows --model <id>. When supplied, the cost estimate
   pulls per-prompt and per-completion rates from the bundled
   model_rates.yaml instead of multiplying a single blended
   --cost-per-1k value across all tokens. The summary now also returns
   a separate estimated_completion_tokens field plus a cost_source tag
   ("rate_table:<model>" | "cost_per_1k_blended" | None).

This is a stopgap. LLM-WP-0005 (proposed in llm-connect this round)
will move the rate registry and token-shape problem classes upstream
so consumers stop re-implementing them.

The live smoke ran 28k prompt tokens / 7.5k completion / $0.0088
actual. With --model openai/gpt-4o-mini the plan estimate now lands at
$0.0076 (within 14% of actual) versus the prior $8.40 estimate at
--cost-per-1k 0.30.

181 tests pass, 2 skipped (both live OpenRouter smokes).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-19 04:30:33 +02:00

tegwick

13f9c1895c

IB-WP-0016-T03: scale-aware planning

Replace generate plan's full-prompt dump with a compact summary that
reports selected-chunk counts, selected chapter numbers, per-workflow
call counts, prompt-word and token estimates, and a rough USD cost when
--cost-per-1k is supplied. Selection filters --chapter (label or number,
repeatable), --from-chapter / --to-chapter (numeric range), and --chunk
(repeatable id) shape the estimate. Budget caps --max-calls and
--cost-cap are reported as exceeds_* booleans so callers can fail fast
before run.

The old full per-workflow plan with prompts remains available behind
--full so deep inspection is opt-in instead of the default.

Whole-Lefevre estimate at default max_words=800: 146 chunks, 730 calls,
~518k prompt tokens, ~$155 at $0.30/1k. Chapters 3-5 only: 19 chunks,
95 calls, ~64k tokens. 87 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-17 18:18:09 +02:00

2 Commits