Files
infospace-bench/workplans/IB-WP-0016-lefevre-ebook-infospace-readiness.md
tegwick 1d62dffae9 IB-WP-0016-T07: review report and output policy; close IB-WP-0016
Enrich reports/generation-summary.md with the review-oriented sections
that the 2026-05-17 smoke run flagged as missing: ## Chapter coverage
(per-chapter source/entity/relation/anchor counts), ## Entities (the
deduped title list), ## Unmapped source chunks (sources with no
downstream generated artifact), and ## Page anchors (total plus
deterministic sample). Sections are conditional on data being present
so generic non-Lefevre runs stay terse.

Add docs/lefevre-readiness.md as the final sign-off document for
IB-WP-0016: what is wired (T01-T06 recap), an output policy table
(checked-in fixture sources vs disposable generated infospaces vs
archive targets), a seven-item reviewer checklist (duplicate entities,
relation endpoints, weak evidence, overgeneralization, anchor
coverage, unmapped sources, plan-vs-actual variance), a scale-up plan
from one-chapter to full-book, and the load-bearing risks still
outstanding (cross-chunk dedup, whole-run resume, adaptive routing
deferred to LLM-WP-0004 / IB-WP-0018, rate-table drift).

Closes IB-WP-0016 (Lefevre EPUB3 Infospace Readiness Pilot): T01-T07
all done; the workplan is set to status=done.

131 tests pass, 1 skipped (live OpenRouter smoke, correctly gated).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 01:22:41 +02:00

8.6 KiB

id, type, title, domain, repo, status, owner, topic_slug, created, updated, state_hub_workstream_slug, state_hub_workstream_id, depends_on_workplans, related_workplans
id type title domain repo status owner topic_slug created updated state_hub_workstream_slug state_hub_workstream_id depends_on_workplans related_workplans
IB-WP-0016 workplan Lefevre EPUB3 Infospace Readiness Pilot markitect infospace-bench done markitect markitect 2026-05-14 2026-05-17 ib-wp-0016-lefevre-ebook-infospace-readiness 23be7d20-b01f-4b17-9851-4d540e4c0984
IB-WP-0015
IB-WP-0014

IB-WP-0016 - Lefevre EPUB3 Infospace Readiness Pilot

Goal

Use Edwin Lefevre's Reminiscences of a Stock Operator EPUB3 as the next real ebook example for infospace-bench, and close the gaps that prevent a serious OpenRouter-backed full-book infospace build.

This workplan should leave us able to run a bounded command like:

infospace-bench generate from-source \
  /mnt/c/Users/bernd.worsch/Downloads/LefevreEdwin-ReminiscencesOfAStockOperator.epub \
  --slug reminiscences-stock-operator \
  --name "Reminiscences of a Stock Operator" \
  --profile trading-literature \
  --provider openrouter \
  --model <openrouter-model-id> \
  --chapter I \
  --cost-cap <cap> \
  --apply

and then scale from one reviewed chapter to the full book without losing provenance, reviewability, or cost control.

Validation Baseline

Validation note: docs/lefevre-epub3-validation.md (includes T01 and T02 result sections).

After T01 and T02, the local Lefevre EPUB is intake-ready:

  • 67 body chunks at default max_words=800, all 24 roman-numeral chapters detected, stable IDs chapter-01..chapter-24 with -part-NNN suffix
  • Cover, PG header/footer, Contents, Transcriber's Notes, and license sections classified out of the body stream by default
  • Per-chunk provenance carries full OPF book metadata, chapter label and number, page anchors, and spine index

Smoke Run (2026-05-17)

A fixture-backed end-to-end smoke run with --max-chunks 3 against the real EPUB produced a complete infospace:

  • 3 source chunks (chapter-01-part-001..003), 3 entities, 3 relations, 3 evaluations, 1 generation-summary report
  • All chapter/book/anchor provenance fields land in artifacts/index.yaml (verified: chapter_label=I, chapter_number=1, page_anchors=[Page_1, Page_2, Page_3] on the first chunk)
  • Metrics viable: coverage=1.0, redundancy=0.0, granularity_entropy=1.79, viability gates pass
  • Same-title entities returned by repeated stages were upserted to single artifact files — basic dedupe works for exact-title matches

Remaining Gaps

These are the gaps a serious full-book run still hits:

  • No compact plan output for cost/call preview on a 67-chunk run (~5 stages per chunk = ~335 provider calls at default max_words) — T03
  • No --chapter, --from-chapter, --to-chapter, --cost-cap, or --max-calls selection — T03
  • Generic profile produces sensible structure but does not push concepts toward traders, markets, lessons, or strategies — T04
  • The generation-summary report only shows counts and metrics; it should surface entity titles, chapter coverage, page-anchor links, and unmapped chunks for human review — T07
  • Long-book resume is still whole-run-skip, not chunk-level — T06
  • Near-duplicate entities across chunks (e.g. "Larry Livingston" vs "the narrator") need cross-chunk merge/dedupe policy before a 24-chapter run

Non-Goals

  • Do not commit a full generated Lefevre infospace before review.
  • Do not make live OpenRouter calls in the default test suite.
  • Do not store API keys or provider secrets in the infospace.
  • Do not build a general-purpose EPUB conversion suite beyond what the infospace generator needs.

Tasks

T01 - Spine-aware EPUB3 intake

id: IB-WP-0016-T01
status: done
priority: high
state_hub_task_id: "a672fcf9-1b80-4faf-b16d-84ca52601dc9"
  • Parse META-INF/container.xml to find the package document
  • Parse OPF metadata, manifest, and spine
  • Follow spine reading order instead of archive-name sorting
  • Preserve book title, creator, source URL, subjects, language, rights, and modified timestamp in source provenance
  • Exclude or tag cover, nav, table-of-contents, Project Gutenberg header, transcriber notes, and license/footer material by explicit policy
  • Add tests using a small EPUB3 fixture with nav, cover, body, notes, and footer

T02 - Chapter-aware chunking and IDs

id: IB-WP-0016-T02
status: done
priority: high
state_hub_task_id: "47de1110-36d0-4d63-bf87-389746509e03"
  • Resolve chapter labels from EPUB nav entries and in-document headings
  • Generate stable IDs like chapter-01, chapter-01-part-002, not repeated Gutenberg document titles
  • Chunk within chapter boundaries with a configurable word limit
  • Consider overlap or evidence-window context without duplicating headings
  • Preserve page anchors where available as optional provenance
  • Add tests showing Reminiscences-style roman numeral chapters become stable ordered source chunks

T03 - Scale-aware generation planning

id: IB-WP-0016-T03
status: done
priority: high
state_hub_task_id: "bee5c38a-f052-4edb-9313-b3a2ee5a6c26"
  • Add compact plan output for long sources
  • Report estimated chunks, workflow stages, provider call count, prompt word or token estimate, and rough cost inputs
  • Add CLI selection filters such as --chapter, --chunk, --from-chapter, --to-chapter, --max-calls, and --cost-cap
  • Keep full prompt inspection available, but do not make it the default for large corpora
  • Add tests proving plan output is compact and does not dump hundreds of prompts

T04 - Trading-literature profile

id: IB-WP-0016-T04
status: done
priority: medium
state_hub_task_id: "1a1b8fde-773f-46a6-887a-3c87a425d7a3"
  • Add or specialize a profile for trading memoir and market-structure texts
  • Tune entity prompts for traders, markets, strategies, errors, psychological patterns, institutions, instruments, and evidence-bearing claims
  • Tune relation prompts for cause/effect, lesson/evidence, risk/mitigation, actor/venue, and strategy/outcome links
  • Tune evaluation criteria for groundedness, lesson clarity, historical context, and overgeneralization risk
  • Keep the generic profile usable for non-trading books

T05 - Deterministic Lefevre acceptance fixture

id: IB-WP-0016-T05
status: done
priority: high
state_hub_task_id: "c9bbc84e-691b-4530-a79a-6ecfa9c41fdd"
  • Add a small checked-in EPUB-like or extracted chapter fixture derived from public-domain Lefevre structure
  • Add deterministic fixture responses for source summary, entity extraction, relation extraction, and evaluation
  • Prove the fixture generates a manifest-backed infospace with stable source, entity, relation, evaluation, metrics, history, and report artifacts
  • Include a regression test for excluding Gutenberg boilerplate when requested

T06 - OpenRouter live-run guardrails

id: IB-WP-0016-T06
status: done
priority: high
state_hub_task_id: "c6bf97c3-1c2c-4993-8f4f-97a48e01cce2"
  • Add an optional live smoke test path that is skipped unless credentials and an explicit opt-in environment variable are present
  • Support a one-chapter OpenRouter run with selected model, bounded retries, cost/call cap, provider metadata, and resume
  • Record provider model, request IDs, timing, usage, and retry counts in run records and generated artifact provenance
  • Document how to run the smoke safely and how to stop before a full-book build

T07 - Example output and review policy

id: IB-WP-0016-T07
status: done
priority: medium
state_hub_task_id: "5ff1f11e-49ad-4c2d-bd4c-b8cc261309bc"
  • Define where generated Lefevre outputs live
  • Decide what is committed, what remains disposable, and what needs human review
  • Add a review checklist for duplicate entities, relation endpoints, weak evidence, and over-broad trading lessons
  • Add a final readiness report before generating the full book
  • Enrich reports/generation-summary.md beyond counts and metrics: list entity titles, per-chapter coverage, page-anchor links, and any unmapped source chunks (gap found in the 2026-05-17 smoke run)

Acceptance

  • Current local EPUB can be inspected as EPUB3 with metadata and ordered body sections
  • generate init can import the book as body-only ordered chapter chunks
  • Chunk titles and IDs are stable, readable, and not dominated by Project Gutenberg boilerplate
  • generate plan gives compact cost/call planning for the full book
  • A deterministic Lefevre-style fixture generates a complete infospace without network access
  • Optional one-chapter OpenRouter smoke run is explicit, bounded, resumable, and skipped by default
  • A full-book run has documented review and output policy before execution