IB-WP-0016-T02: chapter-aware chunking and stable IDs

Resolve chapter labels from EPUB nav entries (when present) and from the
first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N"
labels into numeric chapter indices, and generate stable IDs of the form
chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The
chunker now operates on cleaned body text, distributes id="Page_*" page
anchors per part via inline markers extracted before splitting, and
supports a configurable overlap_words evidence window between adjacent
parts of the same chapter. Reclassify body sections whose chapter label
matches contents/transcriber-notes/license/colophon tokens so they leave
the body stream by default. Strip <head>...</head> from HTML body
extraction to stop the <title> tag from duplicating heading text in the
chunk markdown.

Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable
chapter-NN IDs, distributes Page_N anchors across multi-part chapters,
and reclassifies Contents and Transcriber's Notes out of body
(role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2).
82 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-17 15:52:47 +02:00
parent ef19aa6de7
commit b9173b6569
5 changed files with 449 additions and 36 deletions

View File

@@ -89,3 +89,26 @@ The remaining gap is title collapse: all body sections still share the
Project Gutenberg page title because chapter headings are not yet read from
in-document `<h1>` content. That collapse is T02's scope (chapter-aware
chunking and stable IDs from in-document headings).
## T02 Result (2026-05-17)
Chapter-aware chunking and stable IDs landed. The same local Lefevre EPUB
now produces:
- 67 body chunks (default `max_words=800` collapses to 24 single-chunk
chapters once `max_words=2000`)
- All 24 roman-numeral chapters detected and assigned stable IDs
`chapter-01` .. `chapter-24`; multi-part chapters get
`chapter-NN-part-001`, `chapter-NN-part-002`, ...
- Chapter labels resolved from the EPUB nav doc (when present) and from
the first in-document `<h2>`/`<h1>` heading
- Project Gutenberg page-title collapse is gone: each chunk's title is the
chapter label, not the shared book title
- TOC body section ("Contents") reclassified to `toc`; transcriber's notes
section reclassified to `notes`; section-role histogram is now
`body=67, cover=1, header=1, toc=1, notes=1, footer=2`
- Page anchors of the form `id="Page_N"` are preserved per chunk via the
`page_anchors` provenance field (e.g. chapter-01 carries
`Page_1..Page_14` distributed across its three parts)
- Optional `overlap_words` parameter supports evidence-window context
between adjacent parts of the same chapter without duplicating headings