Workplan for practical example

2026-05-14 22:05:10 +02:00
parent 937acde0b7
commit 9d1a2088aa
3 changed files with 285 additions and 0 deletions
--- a/workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md
+++ b/workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md
@@ -0,0 +1,214 @@
+---
+id: IB-WP-0016
+type: workplan
+title: "Lefevre EPUB3 Infospace Readiness Pilot"
+domain: markitect
+repo: infospace-bench
+status: active
+owner: markitect
+topic_slug: markitect
+created: "2026-05-14"
+updated: "2026-05-14"
+state_hub_workstream_slug: "ib-wp-0016-lefevre-ebook-infospace-readiness"
+state_hub_workstream_id: "23be7d20-b01f-4b17-9851-4d540e4c0984"
+depends_on_workplans:
+  - IB-WP-0015
+related_workplans:
+  - IB-WP-0014
+---
+
+# IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot
+
+## Goal
+
+Use Edwin Lefevre's `Reminiscences of a Stock Operator` EPUB3 as the next real
+ebook example for `infospace-bench`, and close the gaps that prevent a serious
+OpenRouter-backed full-book infospace build.
+
+This workplan should leave us able to run a bounded command like:
+
+```bash
+infospace-bench generate from-source \
+  /mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \
+  --slug reminiscences-stock-operator \
+  --name "Reminiscences of a Stock Operator" \
+  --profile trading-literature \
+  --provider openrouter \
+  --model <openrouter-model-id> \
+  --chapter I \
+  --cost-cap <cap> \
+  --apply
+```
+
+and then scale from one reviewed chapter to the full book without losing
+provenance, reviewability, or cost control.
+
+## Validation Baseline
+
+Validation note: `docs/lefevre-epub3-validation.md`.
+
+Current WP-0015 infrastructure can initialize the local EPUB and run
+source-only metrics in a disposable workspace:
+
+- source chunks: 155
+- entity count: 0
+- relation count: 0
+- evaluation count: 0
+- source-only metrics history can be written without provider calls
+
+The run proves the basic intake path works, but also shows why a live all-book
+run should wait:
+
+- most generated chunk titles collapse to the same Gutenberg page title
+- EPUB spine/chapter metadata is not yet honored deeply enough
+- archive-order sorting risks confusing reading order
+- non-body sections such as cover/header/footer/license need explicit policy
+- plan output is too prompt-heavy for cost review on a 155-chunk book
+- long-book resume needs chunk-level state, not only whole-run skip
+- generated entities need cross-chunk dedupe/merge policy
+
+## Non-Goals
+
+- Do not commit a full generated Lefevre infospace before review.
+- Do not make live OpenRouter calls in the default test suite.
+- Do not store API keys or provider secrets in the infospace.
+- Do not build a general-purpose EPUB conversion suite beyond what the
+  infospace generator needs.
+
+## Tasks
+
+### T01 - Spine-aware EPUB3 intake
+
+```task
+id: IB-WP-0016-T01
+status: todo
+priority: high
+state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9"
+```
+
+- Parse `META-INF/container.xml` to find the package document
+- Parse OPF metadata, manifest, and spine
+- Follow spine reading order instead of archive-name sorting
+- Preserve book title, creator, source URL, subjects, language, rights, and
+  modified timestamp in source provenance
+- Exclude or tag cover, nav, table-of-contents, Project Gutenberg header,
+  transcriber notes, and license/footer material by explicit policy
+- Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer
+
+### T02 - Chapter-aware chunking and IDs
+
+```task
+id: IB-WP-0016-T02
+status: todo
+priority: high
+state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03"
+```
+
+- Resolve chapter labels from EPUB nav entries and in-document headings
+- Generate stable IDs like `chapter-01`, `chapter-01-part-002`, not repeated
+  Gutenberg document titles
+- Chunk within chapter boundaries with a configurable word limit
+- Consider overlap or evidence-window context without duplicating headings
+- Preserve page anchors where available as optional provenance
+- Add tests showing `Reminiscences`-style roman numeral chapters become stable
+  ordered source chunks
+
+### T03 - Scale-aware generation planning
+
+```task
+id: IB-WP-0016-T03
+status: todo
+priority: high
+state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26"
+```
+
+- Add compact plan output for long sources
+- Report estimated chunks, workflow stages, provider call count, prompt word or
+  token estimate, and rough cost inputs
+- Add CLI selection filters such as `--chapter`, `--chunk`, `--from-chapter`,
+  `--to-chapter`, `--max-calls`, and `--cost-cap`
+- Keep full prompt inspection available, but do not make it the default for
+  large corpora
+- Add tests proving plan output is compact and does not dump hundreds of prompts
+
+### T04 - Trading-literature profile
+
+```task
+id: IB-WP-0016-T04
+status: todo
+priority: medium
+state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3"
+```
+
+- Add or specialize a profile for trading memoir and market-structure texts
+- Tune entity prompts for traders, markets, strategies, errors, psychological
+  patterns, institutions, instruments, and evidence-bearing claims
+- Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation,
+  actor/venue, and strategy/outcome links
+- Tune evaluation criteria for groundedness, lesson clarity, historical context,
+  and overgeneralization risk
+- Keep the generic profile usable for non-trading books
+
+### T05 - Deterministic Lefevre acceptance fixture
+
+```task
+id: IB-WP-0016-T05
+status: todo
+priority: high
+state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd"
+```
+
+- Add a small checked-in EPUB-like or extracted chapter fixture derived from
+  public-domain Lefevre structure
+- Add deterministic fixture responses for source summary, entity extraction,
+  relation extraction, and evaluation
+- Prove the fixture generates a manifest-backed infospace with stable source,
+  entity, relation, evaluation, metrics, history, and report artifacts
+- Include a regression test for excluding Gutenberg boilerplate when requested
+
+### T06 - OpenRouter live-run guardrails
+
+```task
+id: IB-WP-0016-T06
+status: todo
+priority: high
+state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2"
+```
+
+- Add an optional live smoke test path that is skipped unless credentials and an
+  explicit opt-in environment variable are present
+- Support a one-chapter OpenRouter run with selected model, bounded retries,
+  cost/call cap, provider metadata, and resume
+- Record provider model, request IDs, timing, usage, and retry counts in run
+  records and generated artifact provenance
+- Document how to run the smoke safely and how to stop before a full-book build
+
+### T07 - Example output and review policy
+
+```task
+id: IB-WP-0016-T07
+status: todo
+priority: medium
+state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc"
+```
+
+- Define where generated Lefevre outputs live
+- Decide what is committed, what remains disposable, and what needs human review
+- Add a review checklist for duplicate entities, relation endpoints, weak
+  evidence, and over-broad trading lessons
+- Add a final readiness report before generating the full book
+
+## Acceptance
+
+- Current local EPUB can be inspected as EPUB3 with metadata and ordered body
+  sections
+- `generate init` can import the book as body-only ordered chapter chunks
+- Chunk titles and IDs are stable, readable, and not dominated by Project
+  Gutenberg boilerplate
+- `generate plan` gives compact cost/call planning for the full book
+- A deterministic Lefevre-style fixture generates a complete infospace without
+  network access
+- Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and
+  skipped by default
+- A full-book run has documented review and output policy before execution
+