Two small fixes informed by the 2026-05-18 live OpenRouter chapter-I run.
1. extract-entities templates (trading-literature and general-knowledge):
the # Entity Title placeholder was interpreted by gpt-4o-mini as a
literal heading prefix, so every entity came back as "# Entity Title:
Bucket Shop" etc. The instruction now spells the placeholder out
with concrete examples and an explicit "not the literal string"
note, so smaller models hit the intended shape.
2. generate plan grows --model <id>. When supplied, the cost estimate
pulls per-prompt and per-completion rates from the bundled
model_rates.yaml instead of multiplying a single blended
--cost-per-1k value across all tokens. The summary now also returns
a separate estimated_completion_tokens field plus a cost_source tag
("rate_table:<model>" | "cost_per_1k_blended" | None).
This is a stopgap. LLM-WP-0005 (proposed in llm-connect this round)
will move the rate registry and token-shape problem classes upstream
so consumers stop re-implementing them.
The live smoke ran 28k prompt tokens / 7.5k completion / $0.0088
actual. With --model openai/gpt-4o-mini the plan estimate now lands at
$0.0076 (within 14% of actual) versus the prior $8.40 estimate at
--cost-per-1k 0.30.
181 tests pass, 2 skipped (both live OpenRouter smokes).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
fix(evaluation_io): tolerate code-fenced frontmatter and varied score
shapes from small LLMs
Two bugs surfaced running the first live Lefevre chapter-I smoke
against openai/gpt-4o-mini.
1. _relative_to_root doubled artifact paths when --workspace was a
relative path (e.g. "."). The function received an already-CWD-
relative path like infospaces/foo/artifacts/sources/x.md and
re-prepended root, producing infospaces/foo/infospaces/foo/...
stored in artifacts/index.yaml — which then failed file reads on
the subsequent workflow stage. Fix: when raw is relative, try
CWD-relative resolution first (matches root / sub call shapes);
fall back to root-prefixing only when the CWD interpretation does
not land under root (matches bare relative-subpath call shapes
from rendered template outputs).
2. _read_frontmatter_markdown only accepted a literal ---/---
delimited block at the start of the file. gpt-4o-mini emitted three
other shapes across the seven evaluation files this chapter
produced:
- ```yaml ... ``` fence (no --- delimiters)
- ```markdown ... ``` outer fence wrapping --- frontmatter
- scores as mapping ({groundedness: 4, ...}) instead of the
canonical list of {name, value} dicts
- scores as list of single-key dicts ([{groundedness: 4}, ...])
Fix: _extract_frontmatter_block tolerates ```yaml fences and strips
```markdown outer fences; _normalise_scores rewrites mapping- and
single-key-dict shapes into the canonical form so ScoreEntry.from_dict
keeps working.
Both fixes are pure-Python; no API changes. 179 tests pass, 2 skipped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add --shadow-baseline <id> and --shadow-rate <float> opt-in flags to
generate run, generate resume, and generate from-source. When
--shadow-baseline names a candidate id from the routing config,
build_routing_policy_from_config wraps every other candidate in an
llm-connect ShadowingAdapter using that baseline plus a
PairedGrader(ExactMatchJudge()) and the workspace-resolved
QualityLedger. The baseline candidate itself is never wrapped — that
would shadow it against itself. --shadow-rate defaults to 0.1 when
--shadow-baseline is set; passing --shadow-rate without
--shadow-baseline fails fast with shadow_rate_without_baseline.
Setting --shadow-baseline without a ledger_path in the config fails
with missing_routing_ledger_for_shadow so observations have a place to
land before any call goes out.
run_generation grew shadow_baseline + shadow_rate kwargs and
_adapter_for("routing", ...) plumbs them into
build_routing_policy_from_config. The wrapped ShadowingAdapter slots
into the policy's prefer/fallback per task type via a
(candidate_id, task_type) reverse lookup, and adapters_by_id on the
adaptive policy gets the string-keyed entries.
Five new tests cover: shadow_rate without baseline fails fast, shadow
mode without a ledger fails fast, unknown shadow baseline id fails
fast, structural assertion that ShadowingAdapter wraps non-baseline
candidates and leaves the baseline raw, and a behavioural check that
shadow_rate=1.0 calls the baseline on every call while shadow_rate=0.0
skips entirely. Test forces async_shadow=False so the call counter is
deterministic.
Closes IB-WP-0020: T01-T05 all done. Workplan status flips from active
to finished. 179 tests pass, 2 skipped (both live OpenRouter smokes).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
examples/routing/trading-literature.yaml is the checked-in starting
config for a Lefevre-style run. It applies the IB-WP-0018 task-type
taxonomy: cheap candidates for summary + evaluation, smart candidates
for entity + relation extraction, and a separate baseline rule wiring
claude_code for a follow-on T05 ShadowingAdapter step. Workspace-
relative ledger_path keeps adaptive observations with the workspace.
tests/test_routing_config.py gains a regression test that asserts the
shipped example parses cleanly, every stage in stage_to_task_type maps
to a declared task type, and the baseline candidate uses the
claude_code provider — so the example will not bit-rot silently.
tests/test_openrouter_live.py gains test_provider_routing_one_chapter_live_smoke
gated on the same INFOSPACE_BENCH_ENABLE_LIVE_OPENROUTER + OPENROUTER_API_KEY
opt-in as the existing static smoke. It builds a one-candidate routing
config, runs a single chapter through --provider routing, and asserts
the per-stage adapter-choices report section names the routed model
and the routed artifacts carry adapter_id provenance.
docs/generic-source-generator.md gains a "Live runs with --provider
routing" subsection that walks through the one-command routed run,
explains the --quality-floor override, and points at the parallel
live smoke test.
174 tests pass, 2 skipped (both live smokes, correctly gated).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add --provider routing, --routing-config <yaml>, and --quality-floor
<float> to generate run, generate resume, and generate from-source.
The CLI flag wiring constructs a RoutingAssistedGenerationAdapter from
the parsed config, with the workspace handed in so any ledger_path in
the config resolves relative to it. --quality-floor overrides the
config-level default_quality_floor for a single invocation.
run_generation gains routing_config + quality_floor kwargs and
_adapter_for grew a "routing" branch. Missing --routing-config with
--provider routing fails fast with InfospaceError("missing_routing_config");
missing API key for any candidate fails fast with
InfospaceError("missing_routing_api_key").
Two small bug fixes surfaced while writing T03:
- routing._identify_adapter now also reads ``_model`` from llm-connect
adapters (their public attribute is private), so the per-stage
adapter-choice line shows the model id rather than just the class
name.
- budget.TOKEN_EVENTS_PATH corrected from /state/token-events to the
state-hub HTTP endpoint /token-events/ that actually exists; the
failure-isolation in emit_token_event already kept the prior typo
from breaking runs, but the hub never saw the events.
Five new tests cover: _adapter_for refusal on missing config,
_adapter_for happy path, run_generation end-to-end through routing
with a stubbed OpenRouterAdapter.execute_prompt (no network),
workspace-relative ledger resolution, and a CLI subprocess smoke
asserting fast-fail on missing API key.
173 tests pass, 1 skipped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
build_routing_policy_from_config(config, *, workspace=None, env=None,
adapter_factory=None) materialises a parsed RoutingConfig into a live
llm-connect routing policy:
- Static RoutingPolicy when the config has no adaptive signals; one
RoutingRule per task type, prefer = first candidate, fallback =
second candidate (when present), max_cost_per_1k pulled from the
preferred candidate.
- AdaptiveRoutingPolicy when default_quality_floor, any per-task
quality_floor, or ledger_path is set. ledger_path resolves relative
to the supplied workspace; parent directory is created so the
ledger writes never fail on first call.
- API-key resolution from env (default os.environ) using the
per-provider DEFAULT_API_KEY_ENV map; candidate.api_key_env overrides
the default. Missing key raises InfospaceError("missing_routing_api_key")
before any provider constructor runs.
- claude_code candidates need no API key (shells out to the local CLI).
- adapter_factory hook lets tests inject a sentinel-returning factory
so policy construction stays network- and llm-adapter-free.
Eight new tests cover: static-policy default, adaptive selection via
ledger_path, adaptive selection via quality_floor, multi-candidate
fallback rule, real-factory smoke (OpenRouterAdapter constructed with
env API key), missing-key fast-fail, claude_code zero-key path, and
custom api_key_env override.
168 tests pass, 1 skipped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add a small YAML routing config schema (schema_version 1) and a
parser-only loader at src/infospace_bench/routing_config.py. The
loader validates the declarative shape — task_types with candidates,
optional per-task quality_floor, optional default_quality_floor,
optional ledger_path, optional stage_to_task_type override map — and
refuses bad shapes before any network or workspace work happens.
Supported provider names: openrouter, claude_code, openai, gemini.
Unknown providers, missing required candidate fields, out-of-range
quality floors, negative max_cost_per_1k, duplicate candidate ids
within a task type, and non-mapping stage_to_task_type all raise
focused InfospaceError codes that callers can pattern-match.
docs/routing-config.md documents the schema with two annotated
examples (OpenRouter-only two-tier, and adaptive with a ClaudeCode
baseline) plus the full "what fails fast" list.
16 parser tests cover happy-path round-trip, file load, missing file,
malformed YAML, and every validation surface (wrong/missing schema
version, empty task_types, empty candidates, missing required fields,
unsupported provider, negative cost, out-of-range quality_floor,
duplicate ids, non-mapping stage_map, non-string ledger_path).
T02 will turn a RoutingConfig into a live llm-connect RoutingPolicy /
AdaptiveRoutingPolicy with constructed LLMAdapter instances.
160 tests pass, 1 skipped.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Open a workplan that turns the IB-WP-0018 RoutingAssistedGenerationAdapter
bridge into a first-class CLI option. Adds --provider routing, a YAML
routing config schema, --quality-floor and --shadow-rate /
--shadow-baseline opt-in flags so a real multi-chapter Lefevre live
run can use adaptive cost-quality routing without writing any Python.
Workstream registered with state-hub
(172bb082-610a-477b-b5e0-26c9f4bdfd95) with five tasks:
- T01 routing config file schema (medium)
- T02 routing config loader (high)
- T03 --provider routing + --routing-config + --quality-floor CLI flags
(high)
- T04 example config + optional live routing smoke test (medium)
- T05 --shadow-rate / --shadow-baseline opt-in flags (medium)
Depends on IB-WP-0018 (already done) and LLM-WP-0004 (already done in
~/llm-connect).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
T03 — wrap_with_shadow_sampling() helper in routing.py: builds a
llm-connect ShadowingAdapter around any candidate LLMAdapter with a
caller-supplied baseline, grader, and QualityLedger. async_shadow=True
by default so production load is not doubled; on_shadow_error escape
hatch keeps caller logs informed when a baseline outage swallows the
shadow path. The returned adapter is still an LLMAdapter so it slots
into a RoutingPolicy rule without further code change.
T04 — generation report enrichment plus a small CLI helper:
- _collect_adapter_choices walks artifact provenance, groups by
(stage_id, adapter_id), and surfaces calls + prompt/completion tokens
per (stage, adapter) pair in a new ## Per-stage adapter choices
section. Runs that did not go through the bridge have no
provider_metadata.adapter_id and emit an empty list, so fixture-only
reports stay terse.
- summarise_quality_ledger() rolls a llm-connect QualityLedger up by
(task_type, adapter_id) with mean quality, mean cost, observations,
and cumulative tokens.
- infospace-bench routing ledger <path> CLI prints the rollup as JSON.
Five new tests cover shadow happy-path, shadow failure isolation,
ledger rollup, the routing CLI, and the report's adapter-choice
aggregation. Closes IB-WP-0018: T01-T05 are all done and the workplan
status flips from blocked to done now that LLM-WP-0004's primitives
have shipped.
144 tests pass, 1 skipped (the OpenRouter live smoke, gated as before).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
T01 — task-type taxonomy. docs/routing-task-types.md names the five
generation stages as the default identity-mapped task types
(summarize-source, extract-entities, extract-relations,
evaluate-entity, synthesize-report) and records the recommended quality
floors per stage. The taxonomy explicitly does not decide which adapter
ships per task type, where the ledger lives, or what a quality score
means — those stay with the caller per the LLM-WP-0004 scope guardrail.
T02 — RoutingAssistedGenerationAdapter bridge in
src/infospace_bench/routing.py. Wraps any llm-connect RoutingPolicy or
AdaptiveRoutingPolicy as an infospace-bench AssistedGenerationAdapter:
maps stage_id -> task_type (overridable), resolves an LLMAdapter,
delegates execute_prompt with a configurable RunConfig, and surfaces
the resolved adapter id, task type, model, usage, and finish_reason
back on AssistedGenerationResult.metadata. Provider tag stays
back-compatible with the strings already used in run records and the
budget rollup (openrouter / claude_code / openai / gemini / mock /
routing).
T05 — eight tests in tests/test_routing_adapter.py cover: static-policy
per-stage resolution, stage_to_task_type overrides, default-mapping
completeness, fall-through for unmapped stage ids, the adaptive path
selecting the cheaper qualifying adapter when a quality_floor is set,
adaptive policy falling back to static when no floor is set, response
metadata round-trip with provider tagging, and estimated_cost_per_1k
pass-through.
Adds llm-connect as a path dependency on pyproject.toml and to the
pytest pythonpath. Static OpenRouter and fixture paths are unchanged;
this commit only adds the option of routing.
139 tests pass, 1 skipped (the OpenRouter live smoke, gated as before).
T03 (shadow-mode integration) and T04 (CLI + per-stage chosen-adapter
in the generation report) follow next.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Enrich reports/generation-summary.md with the review-oriented sections
that the 2026-05-17 smoke run flagged as missing: ## Chapter coverage
(per-chapter source/entity/relation/anchor counts), ## Entities (the
deduped title list), ## Unmapped source chunks (sources with no
downstream generated artifact), and ## Page anchors (total plus
deterministic sample). Sections are conditional on data being present
so generic non-Lefevre runs stay terse.
Add docs/lefevre-readiness.md as the final sign-off document for
IB-WP-0016: what is wired (T01-T06 recap), an output policy table
(checked-in fixture sources vs disposable generated infospaces vs
archive targets), a seven-item reviewer checklist (duplicate entities,
relation endpoints, weak evidence, overgeneralization, anchor
coverage, unmapped sources, plan-vs-actual variance), a scale-up plan
from one-chapter to full-book, and the load-bearing risks still
outstanding (cross-chunk dedup, whole-run resume, adaptive routing
deferred to LLM-WP-0004 / IB-WP-0018, rate-table drift).
Closes IB-WP-0016 (Lefevre EPUB3 Infospace Readiness Pilot): T01-T07
all done; the workplan is set to status=done.
131 tests pass, 1 skipped (live OpenRouter smoke, correctly gated).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Add --chapter / --from-chapter / --to-chapter / --chunk selection flags
to generate init and generate from-source, plumb them into
init_generation_infospace via a new _filter_chunks_by_chapter helper,
and refuse to create an infospace when the filters reject every chunk
(InfospaceError "empty_chapter_selection"). The flags use the same
T03/T02 plumbing (chapter labels, roman numerals, chunk ids) so a
single-chapter selection is a one-flag command.
OpenRouter run-record metadata (model, request_id, usage tokens,
retry_count, duration_seconds) already lands in
output/workflows/runs/*.yaml; this task just adds the smoke test that
proves it stays there, plus the parallel guarantee that the same
provider metadata reaches generated artifact provenance via
provenance.provider_metadata.
tests/test_openrouter_live.py covers:
- chapter-filter, from/to-chapter range, and empty-selection failure on
init (non-live, deterministic)
- CLI smoke through generate from-source with --chapter
- a pytest-skipped live OpenRouter one-chapter end-to-end gated by
OPENROUTER_API_KEY + INFOSPACE_BENCH_ENABLE_LIVE_OPENROUTER, with
INFOSPACE_BENCH_LIVE_MODEL override (default openai/gpt-4o-mini)
docs/generic-source-generator.md gains a "Live OpenRouter runs (handle
with care)" section that walks plan-before-run, single-chapter live
run, the budget/usage artifacts, and the checks a reviewer should run
before scaling to the full book.
129 tests pass, 1 skipped (the live smoke, correctly gated).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Check in a small Lefevre-shaped EPUB fixture as separate source files
under tests/fixtures/lefevre/sources/ (container.xml, OPF, nav, cover,
PG header, three roman-numeral chapters with page anchors,
transcriber notes, license, PG footer). The test helper assembles
these into an EPUB at test time so the inputs stay inspectable in git.
Fixture responses tuned to the trading-literature profile (T04) live
at tests/fixtures/lefevre/responses.yaml: trader / institution /
strategy categories on entities, strategy_outcome / actor_venue
relation types, and all four trading-tuned evaluation criteria.
Three tests cover the acceptance:
- end-to-end Python pipeline: stable chapter-NN source slugs, full
artifact tree (entities, relations, evaluations, metrics, history,
generation report), budget registry persisted, chapter_number
provenance round-trips through artifacts/index.yaml
- regression: PG boilerplate (cover, nav, header, notes, license,
footer) is excluded by default and only appears under
include_non_body=True
- CLI smoke through generate from-source --profile trading-literature
--fixture-responses ...
125 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The default archive include set already pulls output/ in wholesale, so
output/budget/ already lands inside the archive package with no code
change. Add a budget_summary block to ArchiveRecord.metadata so
catalog-level tools can see plans_count, runs_count, total_tokens,
total_cost_usd_known, total_cost_usd_estimated, and the
latest_snapshot_id without unpacking the archive. An infospace with no
budget data still archives cleanly with an empty metadata dict.
Closes IB-WP-0019 (Budget and Usage Registry): T01-T07 all done.
Three-layer design landed end-to-end — layer 1 (per-infospace
plans.yaml / usage.yaml / summary.yaml) and layer 3 (state-hub
record_token_event emission with failure isolation) live here; layer 2
(cross-application QualityLedger for adaptive routing) is parked in
llm-connect LLM-WP-0004 and infospace-bench IB-WP-0018 awaits it.
122 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
infospace-bench budget list <workspace> walks <workspace>/infospaces/*
and prints one row per infospace with slug, plans_count, runs_count,
total_tokens, total_cost_usd_known, total_cost_usd_estimated,
last_run_at, and latest_snapshot_id. infospace-bench budget show
<root> dumps the full plans/usage/summary structure for a single
infospace.
Missing budget directories are treated as zero rows rather than errors,
so the CLI is safe to run on partially-populated or fresh workspaces.
120 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Emit one record_token_event payload per completed generate run, derived
from the just-recorded usage rollup. tokens_in/out come from the
rollup, model defaults to the dominant model used (or "mixed" when
buckets disagree), agent="infospace-bench", ref_type="session", and
ref_id="<slug>/run-<run_index>". The note carries the infospace slug,
workspace, snapshot_id, and any known/estimated cost so the hub event
is self-describing.
Failure isolation: any exception from the HTTP poster (hub down,
timeout, 5xx) is caught, logged to stderr, and reported as
status=failed; the generate run still completes. INFOSPACE_BENCH_HUB_URL
overrides the default http://127.0.0.1:8000 base;
INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS skips emission entirely.
Tests cover the happy path, the disable env var, poster failure, the
no-usage skip, multi-model coalescing to "mixed", and an end-to-end
run_generation against an unbindable hub port to prove the run survives
when the hub is unreachable. 116 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
After every generate run, compute variance between the executing plan
snapshot and the just-recorded usage rollup, persist it to
output/budget/summary.yaml (overwrite-on-run), and surface it both in
the generate status JSON (new budget_summary field) and as a "Plan
variance" line in reports/generation-summary.md.
Variance fields: calls / prompt_tokens / total_tokens each carry
{estimated, actual, delta, ratio}; cost_usd carries {estimated,
actual_known, actual_estimated_from_rates, actual_total, delta, ratio};
per_workflow rolls the per-bucket usage up to the same workflow_id grain
the plan reports. Runs whose snapshot_id cannot be resolved (no prior
plan, or pruned from the retention window) still record a variance row
with null comparison fields and snapshot_resolved=false, so the
consumer always sees a current summary.
Reordered run_generation so usage and variance are written before the
generation report, allowing the report to embed the variance line on
the same pass.
110 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ship a starter model rate table at src/infospace_bench/model_rates.yaml
(prompt_per_1k / completion_per_1k for the OpenRouter models we have
actually touched: gpt-4o, gpt-4o-mini, gpt-4-turbo, claude 3.5 sonnet
and haiku, claude 3 opus, gemini 1.5 flash/pro, llama 3.1 70b) and a
load_rate_table() / estimate_cost_usd() pair that overlays an optional
<workspace>/model-rates.yaml on top of the bundled defaults.
generate run now passes a workspace-aware cost_resolver into
record_run_usage, so cost_usd_estimated lands on every usage bucket
whose model matches the table. Adapter-returned cost still wins
(cost_status="known"); rate-table cost is reported under
cost_status="estimated"; unmatched models are recorded as
cost_status="unknown" rather than silently zeroed. Rate-table file is
listed in pyproject.toml package-data so pip-installed users keep the
defaults.
106 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every completed generate run now aggregates per-call adapter usage from
the workflow-engine run records into output/budget/usage.yaml. Per-call
data is bucketed by (workflow_id, stage_id, provider, model) with
running totals for calls, prompt_tokens, completion_tokens,
total_tokens, and cost_usd_known (sum of adapter-reported cost when the
provider returns it; usually zero today). A run-level entry captures
run_index, started_at, completed_at, duration_seconds, the executing
plan snapshot_id (resolved from the latest plans.yaml entry), and the
workflow-level run_id / stage_count summaries.
cost_usd_estimated is left as None for this task; T03 wires the
rate-table resolver so the same bucket gets a model-priced fallback
when the adapter does not return cost directly.
Fixture-mode runs are recorded with provider='fixture', zero tokens,
and cost_status='unknown' rather than silently skipped, so the rollup
honestly reflects which stages actually ran.
102 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every generate plan invocation now appends its compact summary to
output/budget/plans.yaml with a deterministic 12-char snapshot_id
hashed over the selection filters and the estimated call/token/cost
totals. Identical-fingerprint plans refresh the most recent entry's
recorded_at instead of stacking duplicates. Retention defaults to the
last 50 snapshots; older entries are pruned and counted on a top-level
pruned_count field.
The summary now echoes its input filters (chapter_filter, chunk_filter,
from_chapter, to_chapter) so reviewers can read the snapshot without
cross-referencing the CLI invocation.
New module src/infospace_bench/budget.py owns layer 1 (per-infospace
recording) of the IB-WP-0019 three-layer design; layer 2 still belongs
in llm-connect LLM-WP-0004 and layer 3 in state-hub.
99 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ship a specialized profile for trading memoirs and market-structure
texts. The profile names eight entity categories (trader, market,
strategy, error, psychological_pattern, institution, instrument,
evidence_bearing_claim), five relation types (cause_effect,
lesson_evidence, risk_mitigation, actor_venue, strategy_outcome), and
four evaluation criteria (groundedness, lesson_clarity,
historical_context, overgeneralization_risk). Each is reflected in the
prompts and contracts so the LLM is steered toward operator-level
findings rather than biographical detail or moralising.
The generic profile remains the default. A 2-chapter Lefevre smoke run
with --profile trading-literature completes end-to-end with viable
metrics; 93 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Open a separate workplan for the budget/usage recording layer surfaced
by the T03 conversation. Three-layer design: layer 1 (per-infospace
budget log) and layer 3 (state-hub emission) live here; layer 2
(cross-application quality observations for adaptive routing) stays in
llm-connect LLM-WP-0004.
Seven tasks cover plan snapshot persistence, run usage rollup,
rate-table cost computation, plan-vs-actual variance, state-hub token
events with hub-down isolation, a workspace-level rollup CLI, and
archive integration so IB-WP-0014 packages carry their budget shape.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace generate plan's full-prompt dump with a compact summary that
reports selected-chunk counts, selected chapter numbers, per-workflow
call counts, prompt-word and token estimates, and a rough USD cost when
--cost-per-1k is supplied. Selection filters --chapter (label or number,
repeatable), --from-chapter / --to-chapter (numeric range), and --chunk
(repeatable id) shape the estimate. Budget caps --max-calls and
--cost-cap are reported as exceeds_* booleans so callers can fail fast
before run.
The old full per-workflow plan with prompts remains available behind
--full so deep inspection is opt-in instead of the default.
Whole-Lefevre estimate at default max_words=800: 146 chunks, 730 calls,
~518k prompt tokens, ~$155 at $0.30/1k. Chapters 3-5 only: 19 chunks,
95 calls, ~64k tokens. 87 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Blocked stub that names the dependency on llm-connect WP-0004 (adaptive
cost-quality routing). Activates once T01..T03 of that workplan land
and the QualityLedger / BaselineGrader / AdaptiveRoutingPolicy APIs are
stable.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Run a fixture-backed end-to-end smoke against the real Lefevre EPUB
(max-chunks 3) and capture the result in the validation note and the
workplan. The pipeline produces a complete infospace with stable
chapter-01-part-NNN source IDs, full chapter/book/anchor provenance on
every source artifact, viable metrics, and exact-title entity dedupe.
Refresh the workplan validation baseline to reflect the post-T01/T02
state, and add a remaining-gaps section that maps the open issues to the
right follow-on tasks: cost/scope controls and plan preview to T03, the
trading-literature profile to T04, chunk-level resume to T06, and a
richer generation-summary report (entity titles, chapter coverage,
anchor links) to T07.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolve chapter labels from EPUB nav entries (when present) and from the
first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N"
labels into numeric chapter indices, and generate stable IDs of the form
chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The
chunker now operates on cleaned body text, distributes id="Page_*" page
anchors per part via inline markers extracted before splitting, and
supports a configurable overlap_words evidence window between adjacent
parts of the same chapter. Reclassify body sections whose chapter label
matches contents/transcriber-notes/license/colophon tokens so they leave
the body stream by default. Strip <head>...</head> from HTML body
extraction to stop the <title> tag from duplicating heading text in the
chunk markdown.
Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable
chapter-NN IDs, distributes Page_N anchors across multi-part chapters,
and reclassifies Contents and Transcriber's Notes out of body
(role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2).
82 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Parse META-INF/container.xml and the OPF package document, then iterate
documents in spine reading order instead of archive-name sort. Classify
each spine item (body, cover, nav, toc, header, footer, notes, license,
auxiliary) and exclude non-body sections by default; include_non_body=True
opts them back in for inspection. Capture OPF book metadata (title,
creator, language, subjects, rights, identifier, source_url, modified)
onto every chunk and propagate it through source artifact provenance.
Preserve the legacy zip-without-OPF fallback for malformed EPUBs.
Real Lefevre EPUB now yields 148 body chunks in spine order (was 155
mixed, archive-sorted) with cover=1, header=1, footer=4 detected and
dropped. 78 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two of yesterday's archives silently dropped infospace content: the default
include set was missing contracts/, so wealth-vsm-generation-pilot (16 files)
and wealth-vsm-legacy-slice (12 files) were preserved as 14 and 10 files
respectively. Fix the include set and make silent drops visible.
- DEFAULT_INCLUDE now: infospace.yaml, artifacts, contracts, schemas,
workflows, output, reports, exports
- ArchiveRecord gains skipped_top_level: top-level entries present in the
live root that are not in the include set, not excluded, and not auto-
hidden (hidden dotfiles, empty dirs, .store/index.yaml). Surfaces in
index.yaml only when non-empty.
- Re-archived the two affected pilots with correct counts. Prior records
remain in each index.yaml as history.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Capture the in-flight and legacy pilots as artifact-store packages, all at
retention class release-evidence (default expiry 2033-05-15).
- wealth-vsm-generation-pilot — pkg ed977a9c, 14 files (in flight, IB-WP-0013)
- wealth-vsm-legacy-slice — pkg 9d114264, 10 files (legacy parity ref)
- bootstrap-pilot — pkg fb31721e, 9 files (initial scaffold ref)
Each infospace now has its own self-contained .store/ (gitignored) and an
output/archives/index.yaml pointer log (tracked).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>