---
id: IB-WP-0016
type: workplan
title: "Lefevre EPUB3 Infospace Readiness Pilot"
domain: markitect
repo: infospace-bench
status: active
owner: markitect
topic_slug: markitect
created: "2026-05-14"
updated: "2026-05-17"
state_hub_workstream_slug: "ib-wp-0016-lefevre-ebook-infospace-readiness"
state_hub_workstream_id: "23be7d20-b01f-4b17-9851-4d540e4c0984"
depends_on_workplans:
  - IB-WP-0015
related_workplans:
  - IB-WP-0014
---

# IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot

## Goal

Use Edwin Lefevre's `Reminiscences of a Stock Operator` EPUB3 as the next real
ebook example for `infospace-bench`, and close the gaps that prevent a serious
OpenRouter-backed full-book infospace build.

This workplan should leave us able to run a bounded command like:

```bash
infospace-bench generate from-source \
  /mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \
  --slug reminiscences-stock-operator \
  --name "Reminiscences of a Stock Operator" \
  --profile trading-literature \
  --provider openrouter \
  --model <openrouter-model-id> \
  --chapter I \
  --cost-cap <cap> \
  --apply
```

and then scale from one reviewed chapter to the full book without losing
provenance, reviewability, or cost control.

## Validation Baseline

Validation note: `docs/lefevre-epub3-validation.md` (includes T01 and T02
result sections).

After T01 and T02, the local Lefevre EPUB is intake-ready:

- 67 body chunks at default `max_words=800`, all 24 roman-numeral chapters
  detected, stable IDs `chapter-01..chapter-24` with `-part-NNN` suffix
- Cover, PG header/footer, Contents, Transcriber's Notes, and license
  sections classified out of the body stream by default
- Per-chunk provenance carries full OPF book metadata, chapter label and
  number, page anchors, and spine index

### Smoke Run (2026-05-17)

A fixture-backed end-to-end smoke run with `--max-chunks 3` against the
real EPUB produced a complete infospace:

- 3 source chunks (`chapter-01-part-001..003`), 3 entities, 3 relations,
  3 evaluations, 1 generation-summary report
- All chapter/book/anchor provenance fields land in `artifacts/index.yaml`
  (verified: `chapter_label=I`, `chapter_number=1`,
  `page_anchors=[Page_1, Page_2, Page_3]` on the first chunk)
- Metrics viable: `coverage=1.0`, `redundancy=0.0`,
  `granularity_entropy=1.79`, viability gates pass
- Same-title entities returned by repeated stages were upserted to single
  artifact files — basic dedupe works for exact-title matches

### Remaining Gaps

These are the gaps a serious full-book run still hits:

- No compact `plan` output for cost/call preview on a 67-chunk run
  (~5 stages per chunk = ~335 provider calls at default `max_words`) — T03
- No `--chapter`, `--from-chapter`, `--to-chapter`, `--cost-cap`, or
  `--max-calls` selection — T03
- Generic profile produces sensible structure but does not push concepts
  toward traders, markets, lessons, or strategies — T04
- The generation-summary report only shows counts and metrics; it should
  surface entity titles, chapter coverage, page-anchor links, and unmapped
  chunks for human review — T07
- Long-book resume is still whole-run-skip, not chunk-level — T06
- Near-duplicate entities across chunks (e.g. "Larry Livingston" vs "the
  narrator") need cross-chunk merge/dedupe policy before a 24-chapter run

## Non-Goals

- Do not commit a full generated Lefevre infospace before review.
- Do not make live OpenRouter calls in the default test suite.
- Do not store API keys or provider secrets in the infospace.
- Do not build a general-purpose EPUB conversion suite beyond what the
  infospace generator needs.

## Tasks

### T01 - Spine-aware EPUB3 intake

```task
id: IB-WP-0016-T01
status: done
priority: high
state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9"
```

- Parse `META-INF/container.xml` to find the package document
- Parse OPF metadata, manifest, and spine
- Follow spine reading order instead of archive-name sorting
- Preserve book title, creator, source URL, subjects, language, rights, and
  modified timestamp in source provenance
- Exclude or tag cover, nav, table-of-contents, Project Gutenberg header,
  transcriber notes, and license/footer material by explicit policy
- Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer

### T02 - Chapter-aware chunking and IDs

```task
id: IB-WP-0016-T02
status: done
priority: high
state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03"
```

- Resolve chapter labels from EPUB nav entries and in-document headings
- Generate stable IDs like `chapter-01`, `chapter-01-part-002`, not repeated
  Gutenberg document titles
- Chunk within chapter boundaries with a configurable word limit
- Consider overlap or evidence-window context without duplicating headings
- Preserve page anchors where available as optional provenance
- Add tests showing `Reminiscences`-style roman numeral chapters become stable
  ordered source chunks

### T03 - Scale-aware generation planning

```task
id: IB-WP-0016-T03
status: in_progress
priority: high
state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26"
```

- Add compact plan output for long sources
- Report estimated chunks, workflow stages, provider call count, prompt word or
  token estimate, and rough cost inputs
- Add CLI selection filters such as `--chapter`, `--chunk`, `--from-chapter`,
  `--to-chapter`, `--max-calls`, and `--cost-cap`
- Keep full prompt inspection available, but do not make it the default for
  large corpora
- Add tests proving plan output is compact and does not dump hundreds of prompts

### T04 - Trading-literature profile

```task
id: IB-WP-0016-T04
status: todo
priority: medium
state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3"
```

- Add or specialize a profile for trading memoir and market-structure texts
- Tune entity prompts for traders, markets, strategies, errors, psychological
  patterns, institutions, instruments, and evidence-bearing claims
- Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation,
  actor/venue, and strategy/outcome links
- Tune evaluation criteria for groundedness, lesson clarity, historical context,
  and overgeneralization risk
- Keep the generic profile usable for non-trading books

### T05 - Deterministic Lefevre acceptance fixture

```task
id: IB-WP-0016-T05
status: todo
priority: high
state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd"
```

- Add a small checked-in EPUB-like or extracted chapter fixture derived from
  public-domain Lefevre structure
- Add deterministic fixture responses for source summary, entity extraction,
  relation extraction, and evaluation
- Prove the fixture generates a manifest-backed infospace with stable source,
  entity, relation, evaluation, metrics, history, and report artifacts
- Include a regression test for excluding Gutenberg boilerplate when requested

### T06 - OpenRouter live-run guardrails

```task
id: IB-WP-0016-T06
status: todo
priority: high
state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2"
```

- Add an optional live smoke test path that is skipped unless credentials and an
  explicit opt-in environment variable are present
- Support a one-chapter OpenRouter run with selected model, bounded retries,
  cost/call cap, provider metadata, and resume
- Record provider model, request IDs, timing, usage, and retry counts in run
  records and generated artifact provenance
- Document how to run the smoke safely and how to stop before a full-book build

### T07 - Example output and review policy

```task
id: IB-WP-0016-T07
status: todo
priority: medium
state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc"
```

- Define where generated Lefevre outputs live
- Decide what is committed, what remains disposable, and what needs human review
- Add a review checklist for duplicate entities, relation endpoints, weak
  evidence, and over-broad trading lessons
- Add a final readiness report before generating the full book
- Enrich `reports/generation-summary.md` beyond counts and metrics: list
  entity titles, per-chapter coverage, page-anchor links, and any unmapped
  source chunks (gap found in the 2026-05-17 smoke run)

## Acceptance

- Current local EPUB can be inspected as EPUB3 with metadata and ordered body
  sections
- `generate init` can import the book as body-only ordered chapter chunks
- Chunk titles and IDs are stable, readable, and not dominated by Project
  Gutenberg boilerplate
- `generate plan` gives compact cost/call planning for the full book
- A deterministic Lefevre-style fixture generates a complete infospace without
  network access
- Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and
  skipped by default
- A full-book run has documented review and output policy before execution