--- id: IB-WP-0016 type: workplan title: "Lefevre EPUB3 Infospace Readiness Pilot" domain: markitect repo: infospace-bench status: active owner: markitect topic_slug: markitect created: "2026-05-14" updated: "2026-05-17" state_hub_workstream_slug: "ib-wp-0016-lefevre-ebook-infospace-readiness" state_hub_workstream_id: "23be7d20-b01f-4b17-9851-4d540e4c0984" depends_on_workplans: - IB-WP-0015 related_workplans: - IB-WP-0014 --- # IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot ## Goal Use Edwin Lefevre's `Reminiscences of a Stock Operator` EPUB3 as the next real ebook example for `infospace-bench`, and close the gaps that prevent a serious OpenRouter-backed full-book infospace build. This workplan should leave us able to run a bounded command like: ```bash infospace-bench generate from-source \ /mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \ --slug reminiscences-stock-operator \ --name "Reminiscences of a Stock Operator" \ --profile trading-literature \ --provider openrouter \ --model \ --chapter I \ --cost-cap \ --apply ``` and then scale from one reviewed chapter to the full book without losing provenance, reviewability, or cost control. ## Validation Baseline Validation note: `docs/lefevre-epub3-validation.md`. Current WP-0015 infrastructure can initialize the local EPUB and run source-only metrics in a disposable workspace: - source chunks: 155 - entity count: 0 - relation count: 0 - evaluation count: 0 - source-only metrics history can be written without provider calls The run proves the basic intake path works, but also shows why a live all-book run should wait: - most generated chunk titles collapse to the same Gutenberg page title - EPUB spine/chapter metadata is not yet honored deeply enough - archive-order sorting risks confusing reading order - non-body sections such as cover/header/footer/license need explicit policy - plan output is too prompt-heavy for cost review on a 155-chunk book - long-book resume needs chunk-level state, not only whole-run skip - generated entities need cross-chunk dedupe/merge policy ## Non-Goals - Do not commit a full generated Lefevre infospace before review. - Do not make live OpenRouter calls in the default test suite. - Do not store API keys or provider secrets in the infospace. - Do not build a general-purpose EPUB conversion suite beyond what the infospace generator needs. ## Tasks ### T01 - Spine-aware EPUB3 intake ```task id: IB-WP-0016-T01 status: done priority: high state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9" ``` - Parse `META-INF/container.xml` to find the package document - Parse OPF metadata, manifest, and spine - Follow spine reading order instead of archive-name sorting - Preserve book title, creator, source URL, subjects, language, rights, and modified timestamp in source provenance - Exclude or tag cover, nav, table-of-contents, Project Gutenberg header, transcriber notes, and license/footer material by explicit policy - Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer ### T02 - Chapter-aware chunking and IDs ```task id: IB-WP-0016-T02 status: done priority: high state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03" ``` - Resolve chapter labels from EPUB nav entries and in-document headings - Generate stable IDs like `chapter-01`, `chapter-01-part-002`, not repeated Gutenberg document titles - Chunk within chapter boundaries with a configurable word limit - Consider overlap or evidence-window context without duplicating headings - Preserve page anchors where available as optional provenance - Add tests showing `Reminiscences`-style roman numeral chapters become stable ordered source chunks ### T03 - Scale-aware generation planning ```task id: IB-WP-0016-T03 status: todo priority: high state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26" ``` - Add compact plan output for long sources - Report estimated chunks, workflow stages, provider call count, prompt word or token estimate, and rough cost inputs - Add CLI selection filters such as `--chapter`, `--chunk`, `--from-chapter`, `--to-chapter`, `--max-calls`, and `--cost-cap` - Keep full prompt inspection available, but do not make it the default for large corpora - Add tests proving plan output is compact and does not dump hundreds of prompts ### T04 - Trading-literature profile ```task id: IB-WP-0016-T04 status: todo priority: medium state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3" ``` - Add or specialize a profile for trading memoir and market-structure texts - Tune entity prompts for traders, markets, strategies, errors, psychological patterns, institutions, instruments, and evidence-bearing claims - Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation, actor/venue, and strategy/outcome links - Tune evaluation criteria for groundedness, lesson clarity, historical context, and overgeneralization risk - Keep the generic profile usable for non-trading books ### T05 - Deterministic Lefevre acceptance fixture ```task id: IB-WP-0016-T05 status: todo priority: high state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd" ``` - Add a small checked-in EPUB-like or extracted chapter fixture derived from public-domain Lefevre structure - Add deterministic fixture responses for source summary, entity extraction, relation extraction, and evaluation - Prove the fixture generates a manifest-backed infospace with stable source, entity, relation, evaluation, metrics, history, and report artifacts - Include a regression test for excluding Gutenberg boilerplate when requested ### T06 - OpenRouter live-run guardrails ```task id: IB-WP-0016-T06 status: todo priority: high state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2" ``` - Add an optional live smoke test path that is skipped unless credentials and an explicit opt-in environment variable are present - Support a one-chapter OpenRouter run with selected model, bounded retries, cost/call cap, provider metadata, and resume - Record provider model, request IDs, timing, usage, and retry counts in run records and generated artifact provenance - Document how to run the smoke safely and how to stop before a full-book build ### T07 - Example output and review policy ```task id: IB-WP-0016-T07 status: todo priority: medium state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc" ``` - Define where generated Lefevre outputs live - Decide what is committed, what remains disposable, and what needs human review - Add a review checklist for duplicate entities, relation endpoints, weak evidence, and over-broad trading lessons - Add a final readiness report before generating the full book ## Acceptance - Current local EPUB can be inspected as EPUB3 with metadata and ordered body sections - `generate init` can import the book as body-only ordered chapter chunks - Chunk titles and IDs are stable, readable, and not dominated by Project Gutenberg boilerplate - `generate plan` gives compact cost/call planning for the full book - A deterministic Lefevre-style fixture generates a complete infospace without network access - Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and skipped by default - A full-book run has documented review and output policy before execution