generated from coulomb/repo-seed
Check in a small Lefevre-shaped EPUB fixture as separate source files under tests/fixtures/lefevre/sources/ (container.xml, OPF, nav, cover, PG header, three roman-numeral chapters with page anchors, transcriber notes, license, PG footer). The test helper assembles these into an EPUB at test time so the inputs stay inspectable in git. Fixture responses tuned to the trading-literature profile (T04) live at tests/fixtures/lefevre/responses.yaml: trader / institution / strategy categories on entities, strategy_outcome / actor_venue relation types, and all four trading-tuned evaluation criteria. Three tests cover the acceptance: - end-to-end Python pipeline: stable chapter-NN source slugs, full artifact tree (entities, relations, evaluations, metrics, history, generation report), budget registry persisted, chapter_number provenance round-trips through artifacts/index.yaml - regression: PG boilerplate (cover, nav, header, notes, license, footer) is excluded by default and only appears under include_non_body=True - CLI smoke through generate from-source --profile trading-literature --fixture-responses ... 125 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8.6 KiB
8.6 KiB
id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_slug, state_hub_workstream_id, depends_on_workplans, related_workplans
| id | type | title | domain | repo | status | owner | topic_slug | created | updated | state_hub_workstream_slug | state_hub_workstream_id | depends_on_workplans | related_workplans | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IB-WP-0016 | workplan | Lefevre EPUB3 Infospace Readiness Pilot | markitect | infospace-bench | active | markitect | markitect | 2026-05-14 | 2026-05-17 | ib-wp-0016-lefevre-ebook-infospace-readiness | 23be7d20-b01f-4b17-9851-4d540e4c0984 |
|
|
IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot
Goal
Use Edwin Lefevre's Reminiscences of a Stock Operator EPUB3 as the next real
ebook example for infospace-bench, and close the gaps that prevent a serious
OpenRouter-backed full-book infospace build.
This workplan should leave us able to run a bounded command like:
infospace-bench generate from-source \
/mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \
--slug reminiscences-stock-operator \
--name "Reminiscences of a Stock Operator" \
--profile trading-literature \
--provider openrouter \
--model <openrouter-model-id> \
--chapter I \
--cost-cap <cap> \
--apply
and then scale from one reviewed chapter to the full book without losing provenance, reviewability, or cost control.
Validation Baseline
Validation note: docs/lefevre-epub3-validation.md (includes T01 and T02
result sections).
After T01 and T02, the local Lefevre EPUB is intake-ready:
- 67 body chunks at default
max_words=800, all 24 roman-numeral chapters detected, stable IDschapter-01..chapter-24with-part-NNNsuffix - Cover, PG header/footer, Contents, Transcriber's Notes, and license sections classified out of the body stream by default
- Per-chunk provenance carries full OPF book metadata, chapter label and number, page anchors, and spine index
Smoke Run (2026-05-17)
A fixture-backed end-to-end smoke run with --max-chunks 3 against the
real EPUB produced a complete infospace:
- 3 source chunks (
chapter-01-part-001..003), 3 entities, 3 relations, 3 evaluations, 1 generation-summary report - All chapter/book/anchor provenance fields land in
artifacts/index.yaml(verified:chapter_label=I,chapter_number=1,page_anchors=[Page_1, Page_2, Page_3]on the first chunk) - Metrics viable:
coverage=1.0,redundancy=0.0,granularity_entropy=1.79, viability gates pass - Same-title entities returned by repeated stages were upserted to single artifact files — basic dedupe works for exact-title matches
Remaining Gaps
These are the gaps a serious full-book run still hits:
- No compact
planoutput for cost/call preview on a 67-chunk run (~5 stages per chunk = ~335 provider calls at defaultmax_words) — T03 - No
--chapter,--from-chapter,--to-chapter,--cost-cap, or--max-callsselection — T03 - Generic profile produces sensible structure but does not push concepts toward traders, markets, lessons, or strategies — T04
- The generation-summary report only shows counts and metrics; it should surface entity titles, chapter coverage, page-anchor links, and unmapped chunks for human review — T07
- Long-book resume is still whole-run-skip, not chunk-level — T06
- Near-duplicate entities across chunks (e.g. "Larry Livingston" vs "the narrator") need cross-chunk merge/dedupe policy before a 24-chapter run
Non-Goals
- Do not commit a full generated Lefevre infospace before review.
- Do not make live OpenRouter calls in the default test suite.
- Do not store API keys or provider secrets in the infospace.
- Do not build a general-purpose EPUB conversion suite beyond what the infospace generator needs.
Tasks
T01 - Spine-aware EPUB3 intake
id: IB-WP-0016-T01
status: done
priority: high
state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9"
- Parse
META-INF/container.xmlto find the package document - Parse OPF metadata, manifest, and spine
- Follow spine reading order instead of archive-name sorting
- Preserve book title, creator, source URL, subjects, language, rights, and modified timestamp in source provenance
- Exclude or tag cover, nav, table-of-contents, Project Gutenberg header, transcriber notes, and license/footer material by explicit policy
- Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer
T02 - Chapter-aware chunking and IDs
id: IB-WP-0016-T02
status: done
priority: high
state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03"
- Resolve chapter labels from EPUB nav entries and in-document headings
- Generate stable IDs like
chapter-01,chapter-01-part-002, not repeated Gutenberg document titles - Chunk within chapter boundaries with a configurable word limit
- Consider overlap or evidence-window context without duplicating headings
- Preserve page anchors where available as optional provenance
- Add tests showing
Reminiscences-style roman numeral chapters become stable ordered source chunks
T03 - Scale-aware generation planning
id: IB-WP-0016-T03
status: done
priority: high
state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26"
- Add compact plan output for long sources
- Report estimated chunks, workflow stages, provider call count, prompt word or token estimate, and rough cost inputs
- Add CLI selection filters such as
--chapter,--chunk,--from-chapter,--to-chapter,--max-calls, and--cost-cap - Keep full prompt inspection available, but do not make it the default for large corpora
- Add tests proving plan output is compact and does not dump hundreds of prompts
T04 - Trading-literature profile
id: IB-WP-0016-T04
status: done
priority: medium
state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3"
- Add or specialize a profile for trading memoir and market-structure texts
- Tune entity prompts for traders, markets, strategies, errors, psychological patterns, institutions, instruments, and evidence-bearing claims
- Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation, actor/venue, and strategy/outcome links
- Tune evaluation criteria for groundedness, lesson clarity, historical context, and overgeneralization risk
- Keep the generic profile usable for non-trading books
T05 - Deterministic Lefevre acceptance fixture
id: IB-WP-0016-T05
status: done
priority: high
state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd"
- Add a small checked-in EPUB-like or extracted chapter fixture derived from public-domain Lefevre structure
- Add deterministic fixture responses for source summary, entity extraction, relation extraction, and evaluation
- Prove the fixture generates a manifest-backed infospace with stable source, entity, relation, evaluation, metrics, history, and report artifacts
- Include a regression test for excluding Gutenberg boilerplate when requested
T06 - OpenRouter live-run guardrails
id: IB-WP-0016-T06
status: todo
priority: high
state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2"
- Add an optional live smoke test path that is skipped unless credentials and an explicit opt-in environment variable are present
- Support a one-chapter OpenRouter run with selected model, bounded retries, cost/call cap, provider metadata, and resume
- Record provider model, request IDs, timing, usage, and retry counts in run records and generated artifact provenance
- Document how to run the smoke safely and how to stop before a full-book build
T07 - Example output and review policy
id: IB-WP-0016-T07
status: todo
priority: medium
state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc"
- Define where generated Lefevre outputs live
- Decide what is committed, what remains disposable, and what needs human review
- Add a review checklist for duplicate entities, relation endpoints, weak evidence, and over-broad trading lessons
- Add a final readiness report before generating the full book
- Enrich
reports/generation-summary.mdbeyond counts and metrics: list entity titles, per-chapter coverage, page-anchor links, and any unmapped source chunks (gap found in the 2026-05-17 smoke run)
Acceptance
- Current local EPUB can be inspected as EPUB3 with metadata and ordered body sections
generate initcan import the book as body-only ordered chapter chunks- Chunk titles and IDs are stable, readable, and not dominated by Project Gutenberg boilerplate
generate plangives compact cost/call planning for the full book- A deterministic Lefevre-style fixture generates a complete infospace without network access
- Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and skipped by default
- A full-book run has documented review and output policy before execution