Every generate plan invocation now appends its compact summary to
output/budget/plans.yaml with a deterministic 12-char snapshot_id
hashed over the selection filters and the estimated call/token/cost
totals. Identical-fingerprint plans refresh the most recent entry's
recorded_at instead of stacking duplicates. Retention defaults to the
last 50 snapshots; older entries are pruned and counted on a top-level
pruned_count field.
The summary now echoes its input filters (chapter_filter, chunk_filter,
from_chapter, to_chapter) so reviewers can read the snapshot without
cross-referencing the CLI invocation.
New module src/infospace_bench/budget.py owns layer 1 (per-infospace
recording) of the IB-WP-0019 three-layer design; layer 2 still belongs
in llm-connect LLM-WP-0004 and layer 3 in state-hub.
99 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Ship a specialized profile for trading memoirs and market-structure
texts. The profile names eight entity categories (trader, market,
strategy, error, psychological_pattern, institution, instrument,
evidence_bearing_claim), five relation types (cause_effect,
lesson_evidence, risk_mitigation, actor_venue, strategy_outcome), and
four evaluation criteria (groundedness, lesson_clarity,
historical_context, overgeneralization_risk). Each is reflected in the
prompts and contracts so the LLM is steered toward operator-level
findings rather than biographical detail or moralising.
The generic profile remains the default. A 2-chapter Lefevre smoke run
with --profile trading-literature completes end-to-end with viable
metrics; 93 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace generate plan's full-prompt dump with a compact summary that
reports selected-chunk counts, selected chapter numbers, per-workflow
call counts, prompt-word and token estimates, and a rough USD cost when
--cost-per-1k is supplied. Selection filters --chapter (label or number,
repeatable), --from-chapter / --to-chapter (numeric range), and --chunk
(repeatable id) shape the estimate. Budget caps --max-calls and
--cost-cap are reported as exceeds_* booleans so callers can fail fast
before run.
The old full per-workflow plan with prompts remains available behind
--full so deep inspection is opt-in instead of the default.
Whole-Lefevre estimate at default max_words=800: 146 chunks, 730 calls,
~518k prompt tokens, ~$155 at $0.30/1k. Chapters 3-5 only: 19 chunks,
95 calls, ~64k tokens. 87 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Resolve chapter labels from EPUB nav entries (when present) and from the
first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N"
labels into numeric chapter indices, and generate stable IDs of the form
chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The
chunker now operates on cleaned body text, distributes id="Page_*" page
anchors per part via inline markers extracted before splitting, and
supports a configurable overlap_words evidence window between adjacent
parts of the same chapter. Reclassify body sections whose chapter label
matches contents/transcriber-notes/license/colophon tokens so they leave
the body stream by default. Strip <head>...</head> from HTML body
extraction to stop the <title> tag from duplicating heading text in the
chunk markdown.
Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable
chapter-NN IDs, distributes Page_N anchors across multi-part chapters,
and reclassifies Contents and Transcriber's Notes out of body
(role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2).
82 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Parse META-INF/container.xml and the OPF package document, then iterate
documents in spine reading order instead of archive-name sort. Classify
each spine item (body, cover, nav, toc, header, footer, notes, license,
auxiliary) and exclude non-body sections by default; include_non_body=True
opts them back in for inspection. Capture OPF book metadata (title,
creator, language, subjects, rights, identifier, source_url, modified)
onto every chunk and propagate it through source artifact provenance.
Preserve the legacy zip-without-OPF fallback for malformed EPUBs.
Real Lefevre EPUB now yields 148 body chunks in spine order (was 155
mixed, archive-sorted) with cover=1, header=1, footer=4 detected and
dropped. 78 tests pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two of yesterday's archives silently dropped infospace content: the default
include set was missing contracts/, so wealth-vsm-generation-pilot (16 files)
and wealth-vsm-legacy-slice (12 files) were preserved as 14 and 10 files
respectively. Fix the include set and make silent drops visible.
- DEFAULT_INCLUDE now: infospace.yaml, artifacts, contracts, schemas,
workflows, output, reports, exports
- ArchiveRecord gains skipped_top_level: top-level entries present in the
live root that are not in the include set, not excluded, and not auto-
hidden (hidden dotfiles, empty dirs, .store/index.yaml). Surfaces in
index.yaml only when non-empty.
- Re-archived the two affected pilots with correct counts. Prior records
remain in each index.yaml as history.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round out IB-WP-0014 with the remaining archive operations and docs.
- restore_archive() and `infospace-bench restore <pkg> --target <dir>` round-trip
a finalized package's bytes back to disk. Refuses to overwrite a non-empty
target unless --force. --from <infospace-root> resolves the store location.
- archive-list CLI with --with-retention flag; annotate_retention() opens the
per-infospace registry and joins each record with its current retention
state (effective class, expires, holds, eligibility).
- docs/archive-integration.md covers when to archive, the include set,
retention classes, storage layout, credentials policy, and the explicit
non-goal that S3/git backends live in artifact-store.
- SCOPE.md cross-links the new doc.
- Workplan flipped to status: done. Full pytest suite: 72 passed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reframe IB-WP-0014 from "in-repo S3/git backend adapters" to "durable archive
surface via artifact-store". The live infospace stays in a local working folder;
finalized snapshots are bundled into content-addressed artifact-store packages.
- New module infospace_bench.archive: archive_infospace(), list_archives(),
ArchiveRecord. Self-bootstraps a SQLite + local-FS registry under
output/archives/.store/ when no Registry is passed in.
- New output/archives/index.yaml records each archive event (package id,
manifest digest, retention class, included paths, file count, note).
- artifactstore added as a path dep; Python floor bumped to 3.12 to match.
- Makefile for venv-based dev setup; stack-and-commands.md updated.
- tests/test_archive.py covers index write, list, recursive-capture guard,
caller-supplied include, and empty-include error. Full suite 65 passed.
Remaining tasks (T03 list CLI, T04 restore, T05 docs) tracked in the workplan.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>