Commit Graph

36 Commits

Author SHA1 Message Date
b9173b6569 IB-WP-0016-T02: chapter-aware chunking and stable IDs
Resolve chapter labels from EPUB nav entries (when present) and from the
first in-document h1/h2/h3 heading, parse roman-numeral and "Chapter N"
labels into numeric chapter indices, and generate stable IDs of the form
chapter-NN with -part-NNN suffix when a chapter exceeds max_words. The
chunker now operates on cleaned body text, distributes id="Page_*" page
anchors per part via inline markers extracted before splitting, and
supports a configurable overlap_words evidence window between adjacent
parts of the same chapter. Reclassify body sections whose chapter label
matches contents/transcriber-notes/license/colophon tokens so they leave
the body stream by default. Strip <head>...</head> from HTML body
extraction to stop the <title> tag from duplicating heading text in the
chunk markdown.

Real Lefevre EPUB now detects all 24 roman-numeral chapters with stable
chapter-NN IDs, distributes Page_N anchors across multi-part chapters,
and reclassifies Contents and Transcriber's Notes out of body
(role histogram body=67, cover=1, header=1, toc=1, notes=1, footer=2).
82 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 15:52:47 +02:00
a696f75280 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0016-T02: todo → in_progress
2026-05-17 13:55:49 +02:00
5b6a63fb7a IB-WP-0016-T01: spine-aware EPUB3 intake
Parse META-INF/container.xml and the OPF package document, then iterate
documents in spine reading order instead of archive-name sort. Classify
each spine item (body, cover, nav, toc, header, footer, notes, license,
auxiliary) and exclude non-body sections by default; include_non_body=True
opts them back in for inspection. Capture OPF book metadata (title,
creator, language, subjects, rights, identifier, source_url, modified)
onto every chunk and propagate it through source artifact provenance.
Preserve the legacy zip-without-OPF fallback for malformed EPUBs.

Real Lefevre EPUB now yields 148 body chunks in spine order (was 155
mixed, archive-sorted) with cover=1, header=1, footer=4 detected and
dropped. 78 tests pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 13:52:24 +02:00
2bcd9396f8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0016-T01: todo → in_progress
2026-05-17 12:28:38 +02:00
7825608307 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0014-T05: todo → done
2026-05-17 11:52:20 +02:00
e7be3f41b8 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0014-T04: todo → done
2026-05-17 11:52:20 +02:00
d31be49db6 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0014-T03: todo → done
2026-05-17 11:52:20 +02:00
ddefd69f71 IB-WP-0014: archive-list, restore, retention annotation, docs (T03-T05)
Round out IB-WP-0014 with the remaining archive operations and docs.

- restore_archive() and `infospace-bench restore <pkg> --target <dir>` round-trip
  a finalized package's bytes back to disk. Refuses to overwrite a non-empty
  target unless --force. --from <infospace-root> resolves the store location.
- archive-list CLI with --with-retention flag; annotate_retention() opens the
  per-infospace registry and joins each record with its current retention
  state (effective class, expires, holds, eligibility).
- docs/archive-integration.md covers when to archive, the include set,
  retention classes, storage layout, credentials policy, and the explicit
  non-goal that S3/git backends live in artifact-store.
- SCOPE.md cross-links the new doc.
- Workplan flipped to status: done. Full pytest suite: 72 passed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:46:23 +02:00
f1085e8571 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0014-T02: todo → done
2026-05-17 11:36:34 +02:00
a8177474d2 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-17:
  - IB-WP-0014-T01: in_progress → done
2026-05-17 11:36:34 +02:00
36bfa33fb9 IB-WP-0014: archive integration with artifact-store (T01+T02)
Reframe IB-WP-0014 from "in-repo S3/git backend adapters" to "durable archive
surface via artifact-store". The live infospace stays in a local working folder;
finalized snapshots are bundled into content-addressed artifact-store packages.

- New module infospace_bench.archive: archive_infospace(), list_archives(),
  ArchiveRecord. Self-bootstraps a SQLite + local-FS registry under
  output/archives/.store/ when no Registry is passed in.
- New output/archives/index.yaml records each archive event (package id,
  manifest digest, retention class, included paths, file count, note).
- artifactstore added as a path dep; Python floor bumped to 3.12 to match.
- Makefile for venv-based dev setup; stack-and-commands.md updated.
- tests/test_archive.py covers index write, list, recursive-capture guard,
  caller-supplied include, and empty-include error. Full suite 65 passed.

Remaining tasks (T03 list CLI, T04 restore, T05 docs) tracked in the workplan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 11:30:49 +02:00
c3b62a6ec3 Agentic memory profile 2026-05-15 16:01:35 +02:00
a2daf9a46b docs(workplans): add memory profile pilot plan 2026-05-15 00:23:29 +02:00
9d1a2088aa Workplan for practical example 2026-05-14 22:05:10 +02:00
46aad3cce8 generic source-to-infospace generator 2026-05-14 19:33:22 +02:00
b442a2de47 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - IB-WP-0015-T03: todo → in_progress
2026-05-14 18:37:40 +02:00
ca9929d659 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - IB-WP-0015-T02: todo → in_progress
2026-05-14 18:37:40 +02:00
66cd85d0fc chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - IB-WP-0015-T01: todo → in_progress
2026-05-14 18:37:40 +02:00
b0acd3725b Workplan for infospace creation 2026-05-14 18:30:44 +02:00
a729a7643e infospace pipeline for wealth of nations example 2026-05-14 18:04:38 +02:00
7b5510a4c3 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - IB-WP-0013-T02: todo → in_progress
2026-05-14 17:50:20 +02:00
97a9c3b155 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - IB-WP-0013-T01: todo → in_progress
2026-05-14 17:50:20 +02:00
448b432942 Workplans to actually create infospaces 2026-05-14 17:46:48 +02:00
3de72eb0d2 command parity and migration guide 2026-05-14 17:16:39 +02:00
5d53c33d3e Kontextual Engine Integration Boundary 2026-05-14 16:43:29 +02:00
fc70acb257 engine and lifecycle 2026-05-14 16:26:42 +02:00
55405d8a5a acceptance matrix and workflow generation 2026-05-14 16:01:32 +02:00
7f54dec585 eval history and metrics 2026-05-14 15:35:04 +02:00
9627d03c1a entity relationship model 2026-05-14 15:06:17 +02:00
6eb3c6a0fb markitect-tool integration 2026-05-14 14:53:16 +02:00
28de86f13e docs and stuff 2026-05-14 13:47:36 +02:00
4b1e64f199 chore(consistency): sync task status from DB [auto]
Updated by fix-consistency on 2026-05-14:
  - IB-WP-0005-T01: todo → in_progress
2026-05-14 13:13:08 +02:00
9d643f6e99 Reestablishing intent based goals and workplans 2026-05-14 13:09:58 +02:00
8472b31ce3 workplan cleanup 2026-05-14 12:46:32 +02:00
916a895a85 Initial implementation 2026-05-14 11:32:25 +02:00
f25bd2cf84 State-hub connect and initial workplans 2026-05-03 20:43:56 +02:00