Commit Graph

18 Commits

Author SHA1 Message Date
b9173b6569 IB-WP-0016-T02: chapter-aware chunking and stable IDs
Resolve chapter labels from EPUB nav entries (when present) and from the
first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N"
labels into numeric chapter indices, and generate stable IDs of the form
chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The
chunker now operates on cleaned body text, distributes id="Page_*" page
anchors per part via inline markers extracted before splitting, and
supports a configurable overlap_words evidence window between adjacent
parts of the same chapter. Reclassify body sections whose chapter label
matches contents/transcriber-notes/license/colophon tokens so they leave
the body stream by default. Strip <head>...</head> from HTML body
extraction to stop the <title> tag from duplicating heading text in the
chunk markdown.

Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable
chapter-NN IDs, distributes Page_N anchors across multi-part chapters,
and reclassifies Contents and Transcriber's Notes out of body
(role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2).
82 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 15:52:47 +02:00
5b6a63fb7a IB-WP-0016-T01: spine-aware EPUB3 intake
Parse META-INF/container.xml and the OPF package document, then iterate
documents in spine reading order instead of archive-name sort. Classify
each spine item (body, cover, nav, toc, header, footer, notes, license,
auxiliary) and exclude non-body sections by default; include_non_body=True
opts them back in for inspection. Capture OPF book metadata (title,
creator, language, subjects, rights, identifier, source_url, modified)
onto every chunk and propagate it through source artifact provenance.
Preserve the legacy zip-without-OPF fallback for malformed EPUBs.

Real Lefevre EPUB now yields 148 body chunks in spine order (was 155
mixed, archive-sorted) with cover=1, header=1, footer=4 detected and
dropped. 78 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 13:52:24 +02:00
ddefd69f71 IB-WP-0014: archive-list, restore, retention annotation, docs (T03-T05)
Round out IB-WP-0014 with the remaining archive operations and docs.

- restore_archive() and `infospace-bench restore <pkg> --target <dir>` round-trip
  a finalized package's bytes back to disk. Refuses to overwrite a non-empty
  target unless --force. --from <infospace-root> resolves the store location.
- archive-list CLI with --with-retention flag; annotate_retention() opens the
  per-infospace registry and joins each record with its current retention
  state (effective class, expires, holds, eligibility).
- docs/archive-integration.md covers when to archive, the include set,
  retention classes, storage layout, credentials policy, and the explicit
  non-goal that S3/git backends live in artifact-store.
- SCOPE.md cross-links the new doc.
- Workplan flipped to status: done. Full pytest suite: 72 passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:46:23 +02:00
c3b62a6ec3 Agentic memory profile 2026-05-15 16:01:35 +02:00
9d1a2088aa Workplan for practical example 2026-05-14 22:05:10 +02:00
46aad3cce8 generic source-to-infospace generator 2026-05-14 19:33:22 +02:00
a729a7643e infospace pipeline for wealth of nations example 2026-05-14 18:04:38 +02:00
3de72eb0d2 command parity and migration guide 2026-05-14 17:16:39 +02:00
5d53c33d3e Kontextual Engine Integration Boundary 2026-05-14 16:43:29 +02:00
fc70acb257 engine and lifecycle 2026-05-14 16:26:42 +02:00
55405d8a5a acceptance matrix and workflow generation 2026-05-14 16:01:32 +02:00
7f54dec585 eval history and metrics 2026-05-14 15:35:04 +02:00
9627d03c1a entity relationship model 2026-05-14 15:06:17 +02:00
6eb3c6a0fb markitect-tool integration 2026-05-14 14:53:16 +02:00
28de86f13e docs and stuff 2026-05-14 13:47:36 +02:00
9d643f6e99 Reestablishing intent based goals and workplans 2026-05-14 13:09:58 +02:00
916a895a85 Initial implementation 2026-05-14 11:32:25 +02:00
f25bd2cf84 State-hub connect and initial workplans 2026-05-03 20:43:56 +02:00