generated from coulomb/repo-seed
IB-WP-0016-T02: chapter-aware chunking and stable IDs
Resolve chapter labels from EPUB nav entries (when present) and from the first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N" labels into numeric chapter indices, and generate stable IDs of the form chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The chunker now operates on cleaned body text, distributes id="Page_*" page anchors per part via inline markers extracted before splitting, and supports a configurable overlap_words evidence window between adjacent parts of the same chapter. Reclassify body sections whose chapter label matches contents/transcriber-notes/license/colophon tokens so they leave the body stream by default. Strip <head>...</head> from HTML body extraction to stop the <title> tag from duplicating heading text in the chunk markdown. Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable chapter-NN IDs, distributes Page_N anchors across multi-part chapters, and reclassifies Contents and Transcriber's Notes out of body (role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2). 82 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -261,6 +261,9 @@ def _register_source_chunks(root: Path, chunks: list[SourceChunk]) -> None:
|
||||
"section_role": chunk.section_role,
|
||||
"spine_index": chunk.spine_index,
|
||||
"book_metadata": dict(chunk.book_metadata),
|
||||
"chapter_label": chunk.chapter_label,
|
||||
"chapter_number": chunk.chapter_number,
|
||||
"page_anchors": list(chunk.page_anchors),
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user