From 9d1a2088aabffeae25a8972dd23bdae1b99edb39 Mon Sep 17 00:00:00 2001 From: tegwick Date: Thu, 14 May 2026 22:05:10 +0200 Subject: [PATCH] Workplan for practical example --- README.md | 1 + docs/lefevre-epub3-validation.md | 70 ++++++ ...-0016-lefevre-ebook-infospace-readiness.md | 214 ++++++++++++++++++ 3 files changed, 285 insertions(+) create mode 100644 docs/lefevre-epub3-validation.md create mode 100644 workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md diff --git a/README.md b/README.md index 3b4671d..5202057 100644 --- a/README.md +++ b/README.md @@ -32,6 +32,7 @@ Start with: - `docs/replacement-readiness-decision.md` - `docs/wealth-vsm-generation-pipeline.md` - `docs/generic-source-generator.md` +- `docs/lefevre-epub3-validation.md` - `infospaces/bootstrap-pilot/` - `infospaces/wealth-vsm-legacy-slice/` - `infospaces/wealth-vsm-generation-pilot/` diff --git a/docs/lefevre-epub3-validation.md b/docs/lefevre-epub3-validation.md new file mode 100644 index 0000000..056dcf9 --- /dev/null +++ b/docs/lefevre-epub3-validation.md @@ -0,0 +1,70 @@ +# Lefevre EPUB3 Validation + +Date: 2026-05-14 + +## Source + +Local source file: + +`/mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub` + +The EPUB is Project Gutenberg edition 60979, EPUB package version 3.0. The OPF +metadata identifies: + +- title: `Reminiscences of a Stock Operator` +- creator: `Edwin Lefevre` +- subjects: `Speculation`, `New York Stock Exchange`, `Investments` +- rights: public domain in the USA + +## Current Infrastructure Result + +The current generic generator can initialize a disposable infospace from the +file and run non-provider metrics: + +- disposable root: + `/tmp/infospace-bench-lefevre-583mopy_/infospaces/reminiscences-stock-operator` +- source chunks: 155 +- entities: 0 +- relations: 0 +- evaluations: 0 +- stale status: false +- metrics snapshot: `5978ece0` + +The source-only metrics were: + +- redundancy ratio: `0.9225806451612903` +- coverage ratio: `1.0` +- coherence components: `155.0` +- consistency cycles: `0.0` +- granularity entropy: `-0.0` + +## Findings + +The EPUB intake works mechanically, but it is not ready for a serious full-book +OpenRouter generation run. + +- EPUB spine order is visible in `OEBPS/content.opf`, but current intake reads + XHTML files by archive-name sorting. +- Current titles mostly collapse to the same long Gutenberg page title instead + of chapter labels such as `I`, `II`, and `III`. +- Current intake includes non-body material such as cover/header/footer/license + candidates unless the caller manually filters after import. +- `generate plan` is not yet a compact cost/risk plan for a long book; a full + all-stage run would imply hundreds of provider calls. +- Resume state is run-level enough for the small generic path, but a long ebook + needs chunk-level retry, stale, and skip policy. +- Cross-chunk entity deduplication and merge/review policy are needed before a + full narrative book becomes a coherent infospace. + +## Desired Readiness Bar + +Before building the real Lefevre infospace with OpenRouter, the CLI should be +able to show: + +- book metadata and selected source sections +- body-only chapter order +- stable chapter/chunk IDs +- estimated provider call count and token/cost budget +- selected chapter or chunk filters for smoke runs +- deterministic fixture acceptance on a small Lefevre-like subset +- optional live one-chapter smoke run with explicit provider/model/cost caps diff --git a/workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md b/workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md new file mode 100644 index 0000000..aa245a6 --- /dev/null +++ b/workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md @@ -0,0 +1,214 @@ +--- +id: IB-WP-0016 +type: workplan +title: "Lefevre EPUB3 Infospace Readiness Pilot" +domain: markitect +repo: infospace-bench +status: active +owner: markitect +topic_slug: markitect +created: "2026-05-14" +updated: "2026-05-14" +state_hub_workstream_slug: "ib-wp-0016-lefevre-ebook-infospace-readiness" +state_hub_workstream_id: "23be7d20-b01f-4b17-9851-4d540e4c0984" +depends_on_workplans: + - IB-WP-0015 +related_workplans: + - IB-WP-0014 +--- + +# IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot + +## Goal + +Use Edwin Lefevre's `Reminiscences of a Stock Operator` EPUB3 as the next real +ebook example for `infospace-bench`, and close the gaps that prevent a serious +OpenRouter-backed full-book infospace build. + +This workplan should leave us able to run a bounded command like: + +```bash +infospace-bench generate from-source \ + /mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \ + --slug reminiscences-stock-operator \ + --name "Reminiscences of a Stock Operator" \ + --profile trading-literature \ + --provider openrouter \ + --model \ + --chapter I \ + --cost-cap \ + --apply +``` + +and then scale from one reviewed chapter to the full book without losing +provenance, reviewability, or cost control. + +## Validation Baseline + +Validation note: `docs/lefevre-epub3-validation.md`. + +Current WP-0015 infrastructure can initialize the local EPUB and run +source-only metrics in a disposable workspace: + +- source chunks: 155 +- entity count: 0 +- relation count: 0 +- evaluation count: 0 +- source-only metrics history can be written without provider calls + +The run proves the basic intake path works, but also shows why a live all-book +run should wait: + +- most generated chunk titles collapse to the same Gutenberg page title +- EPUB spine/chapter metadata is not yet honored deeply enough +- archive-order sorting risks confusing reading order +- non-body sections such as cover/header/footer/license need explicit policy +- plan output is too prompt-heavy for cost review on a 155-chunk book +- long-book resume needs chunk-level state, not only whole-run skip +- generated entities need cross-chunk dedupe/merge policy + +## Non-Goals + +- Do not commit a full generated Lefevre infospace before review. +- Do not make live OpenRouter calls in the default test suite. +- Do not store API keys or provider secrets in the infospace. +- Do not build a general-purpose EPUB conversion suite beyond what the + infospace generator needs. + +## Tasks + +### T01 - Spine-aware EPUB3 intake + +```task +id: IB-WP-0016-T01 +status: todo +priority: high +state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9" +``` + +- Parse `META-INF/container.xml` to find the package document +- Parse OPF metadata, manifest, and spine +- Follow spine reading order instead of archive-name sorting +- Preserve book title, creator, source URL, subjects, language, rights, and + modified timestamp in source provenance +- Exclude or tag cover, nav, table-of-contents, Project Gutenberg header, + transcriber notes, and license/footer material by explicit policy +- Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer + +### T02 - Chapter-aware chunking and IDs + +```task +id: IB-WP-0016-T02 +status: todo +priority: high +state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03" +``` + +- Resolve chapter labels from EPUB nav entries and in-document headings +- Generate stable IDs like `chapter-01`, `chapter-01-part-002`, not repeated + Gutenberg document titles +- Chunk within chapter boundaries with a configurable word limit +- Consider overlap or evidence-window context without duplicating headings +- Preserve page anchors where available as optional provenance +- Add tests showing `Reminiscences`-style roman numeral chapters become stable + ordered source chunks + +### T03 - Scale-aware generation planning + +```task +id: IB-WP-0016-T03 +status: todo +priority: high +state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26" +``` + +- Add compact plan output for long sources +- Report estimated chunks, workflow stages, provider call count, prompt word or + token estimate, and rough cost inputs +- Add CLI selection filters such as `--chapter`, `--chunk`, `--from-chapter`, + `--to-chapter`, `--max-calls`, and `--cost-cap` +- Keep full prompt inspection available, but do not make it the default for + large corpora +- Add tests proving plan output is compact and does not dump hundreds of prompts + +### T04 - Trading-literature profile + +```task +id: IB-WP-0016-T04 +status: todo +priority: medium +state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3" +``` + +- Add or specialize a profile for trading memoir and market-structure texts +- Tune entity prompts for traders, markets, strategies, errors, psychological + patterns, institutions, instruments, and evidence-bearing claims +- Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation, + actor/venue, and strategy/outcome links +- Tune evaluation criteria for groundedness, lesson clarity, historical context, + and overgeneralization risk +- Keep the generic profile usable for non-trading books + +### T05 - Deterministic Lefevre acceptance fixture + +```task +id: IB-WP-0016-T05 +status: todo +priority: high +state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd" +``` + +- Add a small checked-in EPUB-like or extracted chapter fixture derived from + public-domain Lefevre structure +- Add deterministic fixture responses for source summary, entity extraction, + relation extraction, and evaluation +- Prove the fixture generates a manifest-backed infospace with stable source, + entity, relation, evaluation, metrics, history, and report artifacts +- Include a regression test for excluding Gutenberg boilerplate when requested + +### T06 - OpenRouter live-run guardrails + +```task +id: IB-WP-0016-T06 +status: todo +priority: high +state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2" +``` + +- Add an optional live smoke test path that is skipped unless credentials and an + explicit opt-in environment variable are present +- Support a one-chapter OpenRouter run with selected model, bounded retries, + cost/call cap, provider metadata, and resume +- Record provider model, request IDs, timing, usage, and retry counts in run + records and generated artifact provenance +- Document how to run the smoke safely and how to stop before a full-book build + +### T07 - Example output and review policy + +```task +id: IB-WP-0016-T07 +status: todo +priority: medium +state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc" +``` + +- Define where generated Lefevre outputs live +- Decide what is committed, what remains disposable, and what needs human review +- Add a review checklist for duplicate entities, relation endpoints, weak + evidence, and over-broad trading lessons +- Add a final readiness report before generating the full book + +## Acceptance + +- Current local EPUB can be inspected as EPUB3 with metadata and ordered body + sections +- `generate init` can import the book as body-only ordered chapter chunks +- Chunk titles and IDs are stable, readable, and not dominated by Project + Gutenberg boilerplate +- `generate plan` gives compact cost/call planning for the full book +- A deterministic Lefevre-style fixture generates a complete infospace without + network access +- Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and + skipped by default +- A full-book run has documented review and output policy before execution +