generated from coulomb/repo-seed
Resolve chapter labels from EPUB nav entries (when present) and from the first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N" labels into numeric chapter indices, and generate stable IDs of the form chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The chunker now operates on cleaned body text, distributes id="Page_*" page anchors per part via inline markers extracted before splitting, and supports a configurable overlap_words evidence window between adjacent parts of the same chapter. Reclassify body sections whose chapter label matches contents/transcriber-notes/license/colophon tokens so they leave the body stream by default. Strip <head>...</head> from HTML body extraction to stop the <title> tag from duplicating heading text in the chunk markdown. Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable chapter-NN IDs, distributes Page_N anchors across multi-part chapters, and reclassifies Contents and Transcriber's Notes out of body (role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2). 82 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7.2 KiB
7.2 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_slug, state_hub_workstream_id, depends_on_workplans, related_workplans
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_slug | state_hub_workstream_id | depends_on_workplans | related_workplans | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IB-WP-0016 | workplan | Lefevre EPUB3 Infospace Readiness Pilot | markitect | infospace-bench | active | markitect | markitect | 2026-05-14 | 2026-05-17 | ib-wp-0016-lefevre-ebook-infospace-readiness | 23be7d20-b01f-4b17-9851-4d540e4c0984 |
|
|
IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot
Goal
Use Edwin Lefevre's Reminiscences of a Stock Operator EPUB3 as the next real
ebook example for infospace-bench, and close the gaps that prevent a serious
OpenRouter-backed full-book infospace build.
This workplan should leave us able to run a bounded command like:
infospace-bench generate from-source \
/mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \
--slug reminiscences-stock-operator \
--name "Reminiscences of a Stock Operator" \
--profile trading-literature \
--provider openrouter \
--model <openrouter-model-id> \
--chapter I \
--cost-cap <cap> \
--apply
and then scale from one reviewed chapter to the full book without losing provenance, reviewability, or cost control.
Validation Baseline
Validation note: docs/lefevre-epub3-validation.md.
Current WP-0015 infrastructure can initialize the local EPUB and run source-only metrics in a disposable workspace:
- source chunks: 155
- entity count: 0
- relation count: 0
- evaluation count: 0
- source-only metrics history can be written without provider calls
The run proves the basic intake path works, but also shows why a live all-book run should wait:
- most generated chunk titles collapse to the same Gutenberg page title
- EPUB spine/chapter metadata is not yet honored deeply enough
- archive-order sorting risks confusing reading order
- non-body sections such as cover/header/footer/license need explicit policy
- plan output is too prompt-heavy for cost review on a 155-chunk book
- long-book resume needs chunk-level state, not only whole-run skip
- generated entities need cross-chunk dedupe/merge policy
Non-Goals
- Do not commit a full generated Lefevre infospace before review.
- Do not make live OpenRouter calls in the default test suite.
- Do not store API keys or provider secrets in the infospace.
- Do not build a general-purpose EPUB conversion suite beyond what the infospace generator needs.
Tasks
T01 - Spine-aware EPUB3 intake
id: IB-WP-0016-T01
status: done
priority: high
state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9"
- Parse
META-INF/container.xmlto find the package document - Parse OPF metadata, manifest, and spine
- Follow spine reading order instead of archive-name sorting
- Preserve book title, creator, source URL, subjects, language, rights, and modified timestamp in source provenance
- Exclude or tag cover, nav, table-of-contents, Project Gutenberg header, transcriber notes, and license/footer material by explicit policy
- Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer
T02 - Chapter-aware chunking and IDs
id: IB-WP-0016-T02
status: done
priority: high
state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03"
- Resolve chapter labels from EPUB nav entries and in-document headings
- Generate stable IDs like
chapter-01,chapter-01-part-002, not repeated Gutenberg document titles - Chunk within chapter boundaries with a configurable word limit
- Consider overlap or evidence-window context without duplicating headings
- Preserve page anchors where available as optional provenance
- Add tests showing
Reminiscences-style roman numeral chapters become stable ordered source chunks
T03 - Scale-aware generation planning
id: IB-WP-0016-T03
status: todo
priority: high
state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26"
- Add compact plan output for long sources
- Report estimated chunks, workflow stages, provider call count, prompt word or token estimate, and rough cost inputs
- Add CLI selection filters such as
--chapter,--chunk,--from-chapter,--to-chapter,--max-calls, and--cost-cap - Keep full prompt inspection available, but do not make it the default for large corpora
- Add tests proving plan output is compact and does not dump hundreds of prompts
T04 - Trading-literature profile
id: IB-WP-0016-T04
status: todo
priority: medium
state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3"
- Add or specialize a profile for trading memoir and market-structure texts
- Tune entity prompts for traders, markets, strategies, errors, psychological patterns, institutions, instruments, and evidence-bearing claims
- Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation, actor/venue, and strategy/outcome links
- Tune evaluation criteria for groundedness, lesson clarity, historical context, and overgeneralization risk
- Keep the generic profile usable for non-trading books
T05 - Deterministic Lefevre acceptance fixture
id: IB-WP-0016-T05
status: todo
priority: high
state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd"
- Add a small checked-in EPUB-like or extracted chapter fixture derived from public-domain Lefevre structure
- Add deterministic fixture responses for source summary, entity extraction, relation extraction, and evaluation
- Prove the fixture generates a manifest-backed infospace with stable source, entity, relation, evaluation, metrics, history, and report artifacts
- Include a regression test for excluding Gutenberg boilerplate when requested
T06 - OpenRouter live-run guardrails
id: IB-WP-0016-T06
status: todo
priority: high
state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2"
- Add an optional live smoke test path that is skipped unless credentials and an explicit opt-in environment variable are present
- Support a one-chapter OpenRouter run with selected model, bounded retries, cost/call cap, provider metadata, and resume
- Record provider model, request IDs, timing, usage, and retry counts in run records and generated artifact provenance
- Document how to run the smoke safely and how to stop before a full-book build
T07 - Example output and review policy
id: IB-WP-0016-T07
status: todo
priority: medium
state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc"
- Define where generated Lefevre outputs live
- Decide what is committed, what remains disposable, and what needs human review
- Add a review checklist for duplicate entities, relation endpoints, weak evidence, and over-broad trading lessons
- Add a final readiness report before generating the full book
Acceptance
- Current local EPUB can be inspected as EPUB3 with metadata and ordered body sections
generate initcan import the book as body-only ordered chapter chunks- Chunk titles and IDs are stable, readable, and not dominated by Project Gutenberg boilerplate
generate plangives compact cost/call planning for the full book- A deterministic Lefevre-style fixture generates a complete infospace without network access
- Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and skipped by default
- A full-book run has documented review and output policy before execution