IB-WP-0016-T02: chapter-aware chunking and stable IDs

Resolve chapter labels from EPUB nav entries (when present) and from the first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N" labels into numeric chapter indices, and generate stable IDs of the form chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The chunker now operates on cleaned body text, distributes id="Page_*" page anchors per part via inline markers extracted before splitting, and supports a configurable overlap_words evidence window between adjacent parts of the same chapter. Reclassify body sections whose chapter label matches contents/transcriber-notes/license/colophon tokens so they leave the body stream by default. Strip <head>...</head> from HTML body extraction to stop the <title> tag from duplicating heading text in the chunk markdown. Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable chapter-NN IDs, distributes Page_N anchors across multi-part chapters, and reclassifies Contents and Transcriber's Notes out of body (role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2). 82 tests pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 15:52:47 +02:00
parent ef19aa6de7
commit b9173b6569
5 changed files with 449 additions and 36 deletions
--- a/src/infospace_bench/generator.py
+++ b/src/infospace_bench/generator.py
@@ -261,6 +261,9 @@ def _register_source_chunks(root: Path, chunks: list[SourceChunk]) -> None:
                "section_role": chunk.section_role,
                "spine_index": chunk.spine_index,
                "book_metadata": dict(chunk.book_metadata),
+                "chapter_label": chunk.chapter_label,
+                "chapter_number": chunk.chapter_number,
+                "page_anchors": list(chunk.page_anchors),
            },
        )