generated from coulomb/repo-seed
IB-WP-0016-T02: chapter-aware chunking and stable IDs
Resolve chapter labels from EPUB nav entries (when present) and from the first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N" labels into numeric chapter indices, and generate stable IDs of the form chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The chunker now operates on cleaned body text, distributes id="Page_*" page anchors per part via inline markers extracted before splitting, and supports a configurable overlap_words evidence window between adjacent parts of the same chapter. Reclassify body sections whose chapter label matches contents/transcriber-notes/license/colophon tokens so they leave the body stream by default. Strip <head>...</head> from HTML body extraction to stop the <title> tag from duplicating heading text in the chunk markdown. Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable chapter-NN IDs, distributes Page_N anchors across multi-part chapters, and reclassifies Contents and Transcriber's Notes out of body (role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2). 82 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -89,3 +89,26 @@ The remaining gap is title collapse: all body sections still share the
|
||||
Project Gutenberg page title because chapter headings are not yet read from
|
||||
in-document `<h1>` content. That collapse is T02's scope (chapter-aware
|
||||
chunking and stable IDs from in-document headings).
|
||||
|
||||
## T02 Result (2026-05-17)
|
||||
|
||||
Chapter-aware chunking and stable IDs landed. The same local Lefevre EPUB
|
||||
now produces:
|
||||
|
||||
- 67 body chunks (default `max_words=800` collapses to 24 single-chunk
|
||||
chapters once `max_words=2000`)
|
||||
- All 24 roman-numeral chapters detected and assigned stable IDs
|
||||
`chapter-01` .. `chapter-24`; multi-part chapters get
|
||||
`chapter-NN-part-001`, `chapter-NN-part-002`, ...
|
||||
- Chapter labels resolved from the EPUB nav doc (when present) and from
|
||||
the first in-document `<h2>`/`<h1>` heading
|
||||
- Project Gutenberg page-title collapse is gone: each chunk's title is the
|
||||
chapter label, not the shared book title
|
||||
- TOC body section ("Contents") reclassified to `toc`; transcriber's notes
|
||||
section reclassified to `notes`; section-role histogram is now
|
||||
`body=67, cover=1, header=1, toc=1, notes=1, footer=2`
|
||||
- Page anchors of the form `id="Page_N"` are preserved per chunk via the
|
||||
`page_anchors` provenance field (e.g. chapter-01 carries
|
||||
`Page_1..Page_14` distributed across its three parts)
|
||||
- Optional `overlap_words` parameter supports evidence-window context
|
||||
between adjacent parts of the same chapter without duplicating headings
|
||||
|
||||
Reference in New Issue
Block a user