infospace-bench

Author	SHA1	Message	Date
tegwick	71177379bc	Add capability registry scaffold (REUSE-WP-0014-T05 B03)	2026-06-16 01:53:47 +02:00
tegwick	344fb4f9f9	Implement scope intent reconciliation	2026-06-05 01:13:22 +02:00
tegwick	799a1a370a	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-05: - update .custodian-brief.md for infospace-bench	2026-06-05 01:11:53 +02:00
tegwick	6018c4a2b3	Add scope intent reconciliation workplan	2026-06-05 01:00:43 +02:00
tegwick	9e1d964f4b	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-06-05: - update .custodian-brief.md for infospace-bench	2026-06-05 00:57:44 +02:00
tegwick	20996150d5	New infospace lefevre-reminiscences-of-a-stock-operator	2026-05-20 23:51:59 +02:00
tegwick	5bb4b40b86	Add security architecture pattern infospace	2026-05-19 07:12:07 +02:00
tegwick	3ca891de4a	fix: review findings from Lefevre live smoke Two small fixes informed by the 2026-05-18 live OpenRouter chapter-I run. 1. extract-entities templates (trading-literature and general-knowledge): the # Entity Title placeholder was interpreted by gpt-4o-mini as a literal heading prefix, so every entity came back as "# Entity Title: Bucket Shop" etc. The instruction now spells the placeholder out with concrete examples and an explicit "not the literal string" note, so smaller models hit the intended shape. 2. generate plan grows --model <id>. When supplied, the cost estimate pulls per-prompt and per-completion rates from the bundled model_rates.yaml instead of multiplying a single blended --cost-per-1k value across all tokens. The summary now also returns a separate estimated_completion_tokens field plus a cost_source tag ("rate_table:<model>" \| "cost_per_1k_blended" \| None). This is a stopgap. LLM-WP-0005 (proposed in llm-connect this round) will move the rate registry and token-shape problem classes upstream so consumers stop re-implementing them. The live smoke ran 28k prompt tokens / 7.5k completion / $0.0088 actual. With --model openai/gpt-4o-mini the plan estimate now lands at $0.0076 (within 14% of actual) versus the prior $8.40 estimate at --cost-per-1k 0.30. 181 tests pass, 2 skipped (both live OpenRouter smokes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 04:30:33 +02:00
tegwick	9404831069	fix(lifecycle): _relative_to_root path doubling with relative workspace fix(evaluation_io): tolerate code-fenced frontmatter and varied score shapes from small LLMs Two bugs surfaced running the first live Lefevre chapter-I smoke against openai/gpt-4o-mini. 1. _relative_to_root doubled artifact paths when --workspace was a relative path (e.g. "."). The function received an already-CWD- relative path like infospaces/foo/artifacts/sources/x.md and re-prepended root, producing infospaces/foo/infospaces/foo/... stored in artifacts/index.yaml — which then failed file reads on the subsequent workflow stage. Fix: when raw is relative, try CWD-relative resolution first (matches root / sub call shapes); fall back to root-prefixing only when the CWD interpretation does not land under root (matches bare relative-subpath call shapes from rendered template outputs). 2. _read_frontmatter_markdown only accepted a literal ---/--- delimited block at the start of the file. gpt-4o-mini emitted three other shapes across the seven evaluation files this chapter produced: - ```yaml ... ``` fence (no --- delimiters) - ```markdown ... ``` outer fence wrapping --- frontmatter - scores as mapping ({groundedness: 4, ...}) instead of the canonical list of {name, value} dicts - scores as list of single-key dicts ([{groundedness: 4}, ...]) Fix: _extract_frontmatter_block tolerates ```yaml fences and strips ```markdown outer fences; _normalise_scores rewrites mapping- and single-key-dict shapes into the canonical form so ScoreEntry.from_dict keeps working. Both fixes are pure-Python; no API changes. 179 tests pass, 2 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-19 03:26:55 +02:00
tegwick	08ecefe309	started a new infospace on it-sec knowledge	2026-05-19 02:04:09 +02:00
tegwick	b0d67ae79e	IB-WP-0020-T05: shadow-mode CLI flags; close IB-WP-0020 Add --shadow-baseline <id> and --shadow-rate <float> opt-in flags to generate run, generate resume, and generate from-source. When --shadow-baseline names a candidate id from the routing config, build_routing_policy_from_config wraps every other candidate in an llm-connect ShadowingAdapter using that baseline plus a PairedGrader(ExactMatchJudge()) and the workspace-resolved QualityLedger. The baseline candidate itself is never wrapped — that would shadow it against itself. --shadow-rate defaults to 0.1 when --shadow-baseline is set; passing --shadow-rate without --shadow-baseline fails fast with shadow_rate_without_baseline. Setting --shadow-baseline without a ledger_path in the config fails with missing_routing_ledger_for_shadow so observations have a place to land before any call goes out. run_generation grew shadow_baseline + shadow_rate kwargs and _adapter_for("routing", ...) plumbs them into build_routing_policy_from_config. The wrapped ShadowingAdapter slots into the policy's prefer/fallback per task type via a (candidate_id, task_type) reverse lookup, and adapters_by_id on the adaptive policy gets the string-keyed entries. Five new tests cover: shadow_rate without baseline fails fast, shadow mode without a ledger fails fast, unknown shadow baseline id fails fast, structural assertion that ShadowingAdapter wraps non-baseline candidates and leaves the baseline raw, and a behavioural check that shadow_rate=1.0 calls the baseline on every call while shadow_rate=0.0 skips entirely. Test forces async_shadow=False so the call counter is deterministic. Closes IB-WP-0020: T01-T05 all done. Workplan status flips from active to finished. 179 tests pass, 2 skipped (both live OpenRouter smokes). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 23:30:36 +02:00
tegwick	debd2b8e69	IB-WP-0020-T04: example routing config + live routing smoke examples/routing/trading-literature.yaml is the checked-in starting config for a Lefevre-style run. It applies the IB-WP-0018 task-type taxonomy: cheap candidates for summary + evaluation, smart candidates for entity + relation extraction, and a separate baseline rule wiring claude_code for a follow-on T05 ShadowingAdapter step. Workspace- relative ledger_path keeps adaptive observations with the workspace. tests/test_routing_config.py gains a regression test that asserts the shipped example parses cleanly, every stage in stage_to_task_type maps to a declared task type, and the baseline candidate uses the claude_code provider — so the example will not bit-rot silently. tests/test_openrouter_live.py gains test_provider_routing_one_chapter_live_smoke gated on the same INFOSPACE_BENCH_ENABLE_LIVE_OPENROUTER + OPENROUTER_API_KEY opt-in as the existing static smoke. It builds a one-candidate routing config, runs a single chapter through --provider routing, and asserts the per-stage adapter-choices report section names the routed model and the routed artifacts carry adapter_id provenance. docs/generic-source-generator.md gains a "Live runs with --provider routing" subsection that walks through the one-command routed run, explains the --quality-floor override, and points at the parallel live smoke test. 174 tests pass, 2 skipped (both live smokes, correctly gated). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 22:19:54 +02:00
tegwick	d3562454d7	IB-WP-0020-T03: routing CLI flags Add --provider routing, --routing-config <yaml>, and --quality-floor <float> to generate run, generate resume, and generate from-source. The CLI flag wiring constructs a RoutingAssistedGenerationAdapter from the parsed config, with the workspace handed in so any ledger_path in the config resolves relative to it. --quality-floor overrides the config-level default_quality_floor for a single invocation. run_generation gains routing_config + quality_floor kwargs and _adapter_for grew a "routing" branch. Missing --routing-config with --provider routing fails fast with InfospaceError("missing_routing_config"); missing API key for any candidate fails fast with InfospaceError("missing_routing_api_key"). Two small bug fixes surfaced while writing T03: - routing._identify_adapter now also reads ``_model`` from llm-connect adapters (their public attribute is private), so the per-stage adapter-choice line shows the model id rather than just the class name. - budget.TOKEN_EVENTS_PATH corrected from /state/token-events to the state-hub HTTP endpoint /token-events/ that actually exists; the failure-isolation in emit_token_event already kept the prior typo from breaking runs, but the hub never saw the events. Five new tests cover: _adapter_for refusal on missing config, _adapter_for happy path, run_generation end-to-end through routing with a stubbed OpenRouterAdapter.execute_prompt (no network), workspace-relative ledger resolution, and a CLI subprocess smoke asserting fast-fail on missing API key. 173 tests pass, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 22:08:51 +02:00
tegwick	82468c2165	IB-WP-0020-T02: routing config loader build_routing_policy_from_config(config, *, workspace=None, env=None, adapter_factory=None) materialises a parsed RoutingConfig into a live llm-connect routing policy: - Static RoutingPolicy when the config has no adaptive signals; one RoutingRule per task type, prefer = first candidate, fallback = second candidate (when present), max_cost_per_1k pulled from the preferred candidate. - AdaptiveRoutingPolicy when default_quality_floor, any per-task quality_floor, or ledger_path is set. ledger_path resolves relative to the supplied workspace; parent directory is created so the ledger writes never fail on first call. - API-key resolution from env (default os.environ) using the per-provider DEFAULT_API_KEY_ENV map; candidate.api_key_env overrides the default. Missing key raises InfospaceError("missing_routing_api_key") before any provider constructor runs. - claude_code candidates need no API key (shells out to the local CLI). - adapter_factory hook lets tests inject a sentinel-returning factory so policy construction stays network- and llm-adapter-free. Eight new tests cover: static-policy default, adaptive selection via ledger_path, adaptive selection via quality_floor, multi-candidate fallback rule, real-factory smoke (OpenRouterAdapter constructed with env API key), missing-key fast-fail, claude_code zero-key path, and custom api_key_env override. 168 tests pass, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 19:58:15 +02:00
tegwick	c11a942bb7	IB-WP-0020-T01: routing config schema and parser Add a small YAML routing config schema (schema_version 1) and a parser-only loader at src/infospace_bench/routing_config.py. The loader validates the declarative shape — task_types with candidates, optional per-task quality_floor, optional default_quality_floor, optional ledger_path, optional stage_to_task_type override map — and refuses bad shapes before any network or workspace work happens. Supported provider names: openrouter, claude_code, openai, gemini. Unknown providers, missing required candidate fields, out-of-range quality floors, negative max_cost_per_1k, duplicate candidate ids within a task type, and non-mapping stage_to_task_type all raise focused InfospaceError codes that callers can pattern-match. docs/routing-config.md documents the schema with two annotated examples (OpenRouter-only two-tier, and adaptive with a ClaudeCode baseline) plus the full "what fails fast" list. 16 parser tests cover happy-path round-trip, file load, missing file, malformed YAML, and every validation surface (wrong/missing schema version, empty task_types, empty candidates, missing required fields, unsupported provider, negative cost, out-of-range quality_floor, duplicate ids, non-mapping stage_map, non-string ledger_path). T02 will turn a RoutingConfig into a live llm-connect RoutingPolicy / AdaptiveRoutingPolicy with constructed LLMAdapter instances. 160 tests pass, 1 skipped. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 18:09:28 +02:00
tegwick	706ace3661	Refresh agent instruction files	2026-05-18 16:55:43 +02:00
tegwick	a95322051f	IB-WP-0020: provider routing CLI workplan (todo) Open a workplan that turns the IB-WP-0018 RoutingAssistedGenerationAdapter bridge into a first-class CLI option. Adds --provider routing, a YAML routing config schema, --quality-floor and --shadow-rate / --shadow-baseline opt-in flags so a real multi-chapter Lefevre live run can use adaptive cost-quality routing without writing any Python. Workstream registered with state-hub (172bb082-610a-477b-b5e0-26c9f4bdfd95) with five tasks: - T01 routing config file schema (medium) - T02 routing config loader (high) - T03 --provider routing + --routing-config + --quality-floor CLI flags (high) - T04 example config + optional live routing smoke test (medium) - T05 --shadow-rate / --shadow-baseline opt-in flags (medium) Depends on IB-WP-0018 (already done) and LLM-WP-0004 (already done in ~/llm-connect). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 13:50:26 +02:00
tegwick	f818acfc62	IB-WP-0018-T03+T04: shadow sampling + report/CLI surfacing; close IB-WP-0018 T03 — wrap_with_shadow_sampling() helper in routing.py: builds a llm-connect ShadowingAdapter around any candidate LLMAdapter with a caller-supplied baseline, grader, and QualityLedger. async_shadow=True by default so production load is not doubled; on_shadow_error escape hatch keeps caller logs informed when a baseline outage swallows the shadow path. The returned adapter is still an LLMAdapter so it slots into a RoutingPolicy rule without further code change. T04 — generation report enrichment plus a small CLI helper: - _collect_adapter_choices walks artifact provenance, groups by (stage_id, adapter_id), and surfaces calls + prompt/completion tokens per (stage, adapter) pair in a new ## Per-stage adapter choices section. Runs that did not go through the bridge have no provider_metadata.adapter_id and emit an empty list, so fixture-only reports stay terse. - summarise_quality_ledger() rolls a llm-connect QualityLedger up by (task_type, adapter_id) with mean quality, mean cost, observations, and cumulative tokens. - infospace-bench routing ledger <path> CLI prints the rollup as JSON. Five new tests cover shadow happy-path, shadow failure isolation, ledger rollup, the routing CLI, and the report's adapter-choice aggregation. Closes IB-WP-0018: T01-T05 are all done and the workplan status flips from blocked to done now that LLM-WP-0004's primitives have shipped. 144 tests pass, 1 skipped (the OpenRouter live smoke, gated as before). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 11:52:05 +02:00
tegwick	0a83e908ce	IB-WP-0018-T01+T02+T05: routing bridge to llm-connect T01 — task-type taxonomy. docs/routing-task-types.md names the five generation stages as the default identity-mapped task types (summarize-source, extract-entities, extract-relations, evaluate-entity, synthesize-report) and records the recommended quality floors per stage. The taxonomy explicitly does not decide which adapter ships per task type, where the ledger lives, or what a quality score means — those stay with the caller per the LLM-WP-0004 scope guardrail. T02 — RoutingAssistedGenerationAdapter bridge in src/infospace_bench/routing.py. Wraps any llm-connect RoutingPolicy or AdaptiveRoutingPolicy as an infospace-bench AssistedGenerationAdapter: maps stage_id -> task_type (overridable), resolves an LLMAdapter, delegates execute_prompt with a configurable RunConfig, and surfaces the resolved adapter id, task type, model, usage, and finish_reason back on AssistedGenerationResult.metadata. Provider tag stays back-compatible with the strings already used in run records and the budget rollup (openrouter / claude_code / openai / gemini / mock / routing). T05 — eight tests in tests/test_routing_adapter.py cover: static-policy per-stage resolution, stage_to_task_type overrides, default-mapping completeness, fall-through for unmapped stage ids, the adaptive path selecting the cheaper qualifying adapter when a quality_floor is set, adaptive policy falling back to static when no floor is set, response metadata round-trip with provider tagging, and estimated_cost_per_1k pass-through. Adds llm-connect as a path dependency on pyproject.toml and to the pytest pythonpath. Static OpenRouter and fixture paths are unchanged; this commit only adds the option of routing. 139 tests pass, 1 skipped (the OpenRouter live smoke, gated as before). T03 (shadow-mode integration) and T04 (CLI + per-stage chosen-adapter in the generation report) follow next. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 11:33:58 +02:00
tegwick	1d62dffae9	IB-WP-0016-T07: review report and output policy; close IB-WP-0016 Enrich reports/generation-summary.md with the review-oriented sections that the 2026-05-17 smoke run flagged as missing: ## Chapter coverage (per-chapter source/entity/relation/anchor counts), ## Entities (the deduped title list), ## Unmapped source chunks (sources with no downstream generated artifact), and ## Page anchors (total plus deterministic sample). Sections are conditional on data being present so generic non-Lefevre runs stay terse. Add docs/lefevre-readiness.md as the final sign-off document for IB-WP-0016: what is wired (T01-T06 recap), an output policy table (checked-in fixture sources vs disposable generated infospaces vs archive targets), a seven-item reviewer checklist (duplicate entities, relation endpoints, weak evidence, overgeneralization, anchor coverage, unmapped sources, plan-vs-actual variance), a scale-up plan from one-chapter to full-book, and the load-bearing risks still outstanding (cross-chunk dedup, whole-run resume, adaptive routing deferred to LLM-WP-0004 / IB-WP-0018, rate-table drift). Closes IB-WP-0016 (Lefevre EPUB3 Infospace Readiness Pilot): T01-T07 all done; the workplan is set to status=done. 131 tests pass, 1 skipped (live OpenRouter smoke, correctly gated). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-18 01:22:41 +02:00
tegwick	ab23c5873e	IB-WP-0016-T06: OpenRouter live-run guardrails Add --chapter / --from-chapter / --to-chapter / --chunk selection flags to generate init and generate from-source, plumb them into init_generation_infospace via a new _filter_chunks_by_chapter helper, and refuse to create an infospace when the filters reject every chunk (InfospaceError "empty_chapter_selection"). The flags use the same T03/T02 plumbing (chapter labels, roman numerals, chunk ids) so a single-chapter selection is a one-flag command. OpenRouter run-record metadata (model, request_id, usage tokens, retry_count, duration_seconds) already lands in output/workflows/runs/*.yaml; this task just adds the smoke test that proves it stays there, plus the parallel guarantee that the same provider metadata reaches generated artifact provenance via provenance.provider_metadata. tests/test_openrouter_live.py covers: - chapter-filter, from/to-chapter range, and empty-selection failure on init (non-live, deterministic) - CLI smoke through generate from-source with --chapter - a pytest-skipped live OpenRouter one-chapter end-to-end gated by OPENROUTER_API_KEY + INFOSPACE_BENCH_ENABLE_LIVE_OPENROUTER, with INFOSPACE_BENCH_LIVE_MODEL override (default openai/gpt-4o-mini) docs/generic-source-generator.md gains a "Live OpenRouter runs (handle with care)" section that walks plan-before-run, single-chapter live run, the budget/usage artifacts, and the checks a reviewer should run before scaling to the full book. 129 tests pass, 1 skipped (the live smoke, correctly gated). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 23:04:19 +02:00
tegwick	348deca9f2	IB-WP-0016-T05: deterministic Lefevre acceptance fixture Check in a small Lefevre-shaped EPUB fixture as separate source files under tests/fixtures/lefevre/sources/ (container.xml, OPF, nav, cover, PG header, three roman-numeral chapters with page anchors, transcriber notes, license, PG footer). The test helper assembles these into an EPUB at test time so the inputs stay inspectable in git. Fixture responses tuned to the trading-literature profile (T04) live at tests/fixtures/lefevre/responses.yaml: trader / institution / strategy categories on entities, strategy_outcome / actor_venue relation types, and all four trading-tuned evaluation criteria. Three tests cover the acceptance: - end-to-end Python pipeline: stable chapter-NN source slugs, full artifact tree (entities, relations, evaluations, metrics, history, generation report), budget registry persisted, chapter_number provenance round-trips through artifacts/index.yaml - regression: PG boilerplate (cover, nav, header, notes, license, footer) is excluded by default and only appears under include_non_body=True - CLI smoke through generate from-source --profile trading-literature --fixture-responses ... 125 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 22:31:17 +02:00
tegwick	bb70b2f4b9	IB-WP-0019-T07: archive integration; close IB-WP-0019 The default archive include set already pulls output/ in wholesale, so output/budget/ already lands inside the archive package with no code change. Add a budget_summary block to ArchiveRecord.metadata so catalog-level tools can see plans_count, runs_count, total_tokens, total_cost_usd_known, total_cost_usd_estimated, and the latest_snapshot_id without unpacking the archive. An infospace with no budget data still archives cleanly with an empty metadata dict. Closes IB-WP-0019 (Budget and Usage Registry): T01-T07 all done. Three-layer design landed end-to-end — layer 1 (per-infospace plans.yaml / usage.yaml / summary.yaml) and layer 3 (state-hub record_token_event emission with failure isolation) live here; layer 2 (cross-application QualityLedger for adaptive routing) is parked in llm-connect LLM-WP-0004 and infospace-bench IB-WP-0018 awaits it. 122 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 21:53:28 +02:00
tegwick	816a95b3ef	IB-WP-0019-T06: workspace budget CLI infospace-bench budget list <workspace> walks <workspace>/infospaces/* and prints one row per infospace with slug, plans_count, runs_count, total_tokens, total_cost_usd_known, total_cost_usd_estimated, last_run_at, and latest_snapshot_id. infospace-bench budget show <root> dumps the full plans/usage/summary structure for a single infospace. Missing budget directories are treated as zero rows rather than errors, so the CLI is safe to run on partially-populated or fresh workspaces. 120 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 20:44:40 +02:00
tegwick	110c78b9ad	IB-WP-0019-T05: state-hub token-event emission with failure isolation Emit one record_token_event payload per completed generate run, derived from the just-recorded usage rollup. tokens_in/out come from the rollup, model defaults to the dominant model used (or "mixed" when buckets disagree), agent="infospace-bench", ref_type="session", and ref_id="<slug>/run-<run_index>". The note carries the infospace slug, workspace, snapshot_id, and any known/estimated cost so the hub event is self-describing. Failure isolation: any exception from the HTTP poster (hub down, timeout, 5xx) is caught, logged to stderr, and reported as status=failed; the generate run still completes. INFOSPACE_BENCH_HUB_URL overrides the default http://127.0.0.1:8000 base; INFOSPACE_BENCH_DISABLE_HUB_TOKEN_EVENTS skips emission entirely. Tests cover the happy path, the disable env var, poster failure, the no-usage skip, multi-model coalescing to "mixed", and an end-to-end run_generation against an unbindable hub port to prove the run survives when the hub is unreachable. 116 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 20:33:29 +02:00
tegwick	d4c9c56f5c	IB-WP-0019-T04: plan-vs-actual variance and surfacing After every generate run, compute variance between the executing plan snapshot and the just-recorded usage rollup, persist it to output/budget/summary.yaml (overwrite-on-run), and surface it both in the generate status JSON (new budget_summary field) and as a "Plan variance" line in reports/generation-summary.md. Variance fields: calls / prompt_tokens / total_tokens each carry {estimated, actual, delta, ratio}; cost_usd carries {estimated, actual_known, actual_estimated_from_rates, actual_total, delta, ratio}; per_workflow rolls the per-bucket usage up to the same workflow_id grain the plan reports. Runs whose snapshot_id cannot be resolved (no prior plan, or pruned from the retention window) still record a variance row with null comparison fields and snapshot_resolved=false, so the consumer always sees a current summary. Reordered run_generation so usage and variance are written before the generation report, allowing the report to embed the variance line on the same pass. 110 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 20:06:19 +02:00
tegwick	a4dde53fc3	IB-WP-0019-T03: rate-table cost computation Ship a starter model rate table at src/infospace_bench/model_rates.yaml (prompt_per_1k / completion_per_1k for the OpenRouter models we have actually touched: gpt-4o, gpt-4o-mini, gpt-4-turbo, claude 3.5 sonnet and haiku, claude 3 opus, gemini 1.5 flash/pro, llama 3.1 70b) and a load_rate_table() / estimate_cost_usd() pair that overlays an optional <workspace>/model-rates.yaml on top of the bundled defaults. generate run now passes a workspace-aware cost_resolver into record_run_usage, so cost_usd_estimated lands on every usage bucket whose model matches the table. Adapter-returned cost still wins (cost_status="known"); rate-table cost is reported under cost_status="estimated"; unmatched models are recorded as cost_status="unknown" rather than silently zeroed. Rate-table file is listed in pyproject.toml package-data so pip-installed users keep the defaults. 106 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 19:54:30 +02:00
tegwick	678508226a	IB-WP-0019-T02: usage rollup from run records Every completed generate run now aggregates per-call adapter usage from the workflow-engine run records into output/budget/usage.yaml. Per-call data is bucketed by (workflow_id, stage_id, provider, model) with running totals for calls, prompt_tokens, completion_tokens, total_tokens, and cost_usd_known (sum of adapter-reported cost when the provider returns it; usually zero today). A run-level entry captures run_index, started_at, completed_at, duration_seconds, the executing plan snapshot_id (resolved from the latest plans.yaml entry), and the workflow-level run_id / stage_count summaries. cost_usd_estimated is left as None for this task; T03 wires the rate-table resolver so the same bucket gets a model-priced fallback when the adapter does not return cost directly. Fixture-mode runs are recorded with provider='fixture', zero tokens, and cost_status='unknown' rather than silently skipped, so the rollup honestly reflects which stages actually ran. 102 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 19:46:40 +02:00
tegwick	37bbaf9fab	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 19:45:00 +02:00
tegwick	182f7011bb	IB-WP-0019-T01: plan snapshot persistence Every generate plan invocation now appends its compact summary to output/budget/plans.yaml with a deterministic 12-char snapshot_id hashed over the selection filters and the estimated call/token/cost totals. Identical-fingerprint plans refresh the most recent entry's recorded_at instead of stacking duplicates. Retention defaults to the last 50 snapshots; older entries are pruned and counted on a top-level pruned_count field. The summary now echoes its input filters (chapter_filter, chunk_filter, from_chapter, to_chapter) so reviewers can read the snapshot without cross-referencing the CLI invocation. New module src/infospace_bench/budget.py owns layer 1 (per-infospace recording) of the IB-WP-0019 three-layer design; layer 2 still belongs in llm-connect LLM-WP-0004 and layer 3 in state-hub. 99 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 19:19:35 +02:00
tegwick	df87e212a2	IB-WP-0016-T04: trading-literature profile Ship a specialized profile for trading memoirs and market-structure texts. The profile names eight entity categories (trader, market, strategy, error, psychological_pattern, institution, instrument, evidence_bearing_claim), five relation types (cause_effect, lesson_evidence, risk_mitigation, actor_venue, strategy_outcome), and four evaluation criteria (groundedness, lesson_clarity, historical_context, overgeneralization_risk). Each is reflected in the prompts and contracts so the LLM is steered toward operator-level findings rather than biographical detail or moralising. The generic profile remains the default. A 2-chapter Lefevre smoke run with --profile trading-literature completes end-to-end with viable metrics; 93 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 18:59:45 +02:00
tegwick	ba8c0a100c	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 18:54:23 +02:00
tegwick	d2deebe081	IB-WP-0019: budget and usage registry workplan (todo) Open a separate workplan for the budget/usage recording layer surfaced by the T03 conversation. Three-layer design: layer 1 (per-infospace budget log) and layer 3 (state-hub emission) live here; layer 2 (cross-application quality observations for adaptive routing) stays in llm-connect LLM-WP-0004. Seven tasks cover plan snapshot persistence, run usage rollup, rate-table cost computation, plan-vs-actual variance, state-hub token events with hub-down isolation, a workspace-level rollup CLI, and archive integration so IB-WP-0014 packages carry their budget shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 18:30:49 +02:00
tegwick	74c52c6239	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 18:18:51 +02:00
tegwick	13f9c1895c	IB-WP-0016-T03: scale-aware planning Replace generate plan's full-prompt dump with a compact summary that reports selected-chunk counts, selected chapter numbers, per-workflow call counts, prompt-word and token estimates, and a rough USD cost when --cost-per-1k is supplied. Selection filters --chapter (label or number, repeatable), --from-chapter / --to-chapter (numeric range), and --chunk (repeatable id) shape the estimate. Budget caps --max-calls and --cost-cap are reported as exceeds_* booleans so callers can fail fast before run. The old full per-workflow plan with prompts remains available behind --full so deep inspection is opt-in instead of the default. Whole-Lefevre estimate at default max_words=800: 146 chunks, 730 calls, ~518k prompt tokens, ~$155 at $0.30/1k. Chapters 3-5 only: 19 chunks, 95 calls, ~64k tokens. 87 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 18:18:09 +02:00
tegwick	f8289699e7	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 17:42:49 +02:00
tegwick	c38eca0186	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 17:27:27 +02:00
tegwick	a4fac2943d	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - IB-WP-0016-T03: todo → in_progress	2026-05-17 17:27:26 +02:00
tegwick	cb37a7f408	IB-WP-0018: stub workplan for adaptive LLM routing consumer wiring Blocked stub that names the dependency on llm-connect WP-0004 (adaptive cost-quality routing). Activates once T01..T03 of that workplan land and the QualityLedger / BaselineGrader / AdaptiveRoutingPolicy APIs are stable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 17:26:36 +02:00
tegwick	001b64d67b	IB-WP-0016: refresh validation baseline after T01/T02 smoke run Run a fixture-backed end-to-end smoke against the real Lefevre EPUB (max-chunks 3) and capture the result in the validation note and the workplan. The pipeline produces a complete infospace with stable chapter-01-part-NNN source IDs, full chapter/book/anchor provenance on every source artifact, viable metrics, and exact-title entity dedupe. Refresh the workplan validation baseline to reflect the post-T01/T02 state, and add a remaining-gaps section that maps the open issues to the right follow-on tasks: cost/scope controls and plan preview to T03, the trading-literature profile to T04, chunk-level resume to T06, and a richer generation-summary report (entity titles, chapter coverage, anchor links) to T07. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 16:13:39 +02:00
tegwick	745edc8b81	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 16:00:10 +02:00
tegwick	b9173b6569	IB-WP-0016-T02: chapter-aware chunking and stable IDs Resolve chapter labels from EPUB nav entries (when present) and from the first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N" labels into numeric chapter indices, and generate stable IDs of the form chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The chunker now operates on cleaned body text, distributes id="Page_*" page anchors per part via inline markers extracted before splitting, and supports a configurable overlap_words evidence window between adjacent parts of the same chapter. Reclassify body sections whose chapter label matches contents/transcriber-notes/license/colophon tokens so they leave the body stream by default. Strip <head>...</head> from HTML body extraction to stop the <title> tag from duplicating heading text in the chunk markdown. Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable chapter-NN IDs, distributes Page_N anchors across multi-part chapters, and reclassifies Contents and Transcriber's Notes out of body (role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2). 82 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 15:52:47 +02:00
tegwick	ef19aa6de7	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 13:55:50 +02:00
tegwick	a696f75280	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - IB-WP-0016-T02: todo → in_progress	2026-05-17 13:55:49 +02:00
tegwick	5b6a63fb7a	IB-WP-0016-T01: spine-aware EPUB3 intake Parse META-INF/container.xml and the OPF package document, then iterate documents in spine reading order instead of archive-name sort. Classify each spine item (body, cover, nav, toc, header, footer, notes, license, auxiliary) and exclude non-body sections by default; include_non_body=True opts them back in for inspection. Capture OPF book metadata (title, creator, language, subjects, rights, identifier, source_url, modified) onto every chunk and propagate it through source artifact provenance. Preserve the legacy zip-without-OPF fallback for malformed EPUBs. Real Lefevre EPUB now yields 148 body chunks in spine order (was 155 mixed, archive-sorted) with cover=1, header=1, footer=4 detected and dropped. 78 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 13:52:24 +02:00
tegwick	ead2f335f3	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 13:38:30 +02:00
tegwick	e05cdab042	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - update .custodian-brief.md for infospace-bench	2026-05-17 12:28:39 +02:00
tegwick	2bcd9396f8	chore(consistency): sync task status from DB [auto] Updated by fix-consistency on 2026-05-17: - IB-WP-0016-T01: todo → in_progress	2026-05-17 12:28:38 +02:00
tegwick	37c28d2298	archive: include contracts/, schemas/; report skipped top-level dirs Two of yesterday's archives silently dropped infospace content: the default include set was missing contracts/, so wealth-vsm-generation-pilot (16 files) and wealth-vsm-legacy-slice (12 files) were preserved as 14 and 10 files respectively. Fix the include set and make silent drops visible. - DEFAULT_INCLUDE now: infospace.yaml, artifacts, contracts, schemas, workflows, output, reports, exports - ArchiveRecord gains skipped_top_level: top-level entries present in the live root that are not in the include set, not excluded, and not auto- hidden (hidden dotfiles, empty dirs, .store/index.yaml). Surfaces in index.yaml only when non-empty. - Re-archived the two affected pilots with correct counts. Prior records remain in each index.yaml as history. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 12:21:19 +02:00
tegwick	523db6d341	IB-WP-0014: archive remaining three infospaces Capture the in-flight and legacy pilots as artifact-store packages, all at retention class release-evidence (default expiry 2033-05-15). - wealth-vsm-generation-pilot — pkg ed977a9c, 14 files (in flight, IB-WP-0013) - wealth-vsm-legacy-slice — pkg 9d114264, 10 files (legacy parity ref) - bootstrap-pilot — pkg fb31721e, 9 files (initial scaffold ref) Each infospace now has its own self-contained .store/ (gitignored) and an output/archives/index.yaml pointer log (tracked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-17 12:14:54 +02:00

1 2 3

109 Commits