Files
markitect-tool/docs/source-adapter-migration.md
2026-05-14 22:05:34 +02:00

3.6 KiB

Source Adapter Migration Notes

Purpose

These notes describe how sibling repositories should consume the markitect-tool source adapter contract implemented by MKTT-WP-0018.

The source adapter layer is deliberately split:

external source files
  -> markitect-filter concrete adapters
  -> markitect-tool source adapter protocol and normalized Markdown model
  -> infospace-bench workflows
  -> optional kontextual-engine ingestion

Markitect-tool

markitect-tool owns the stable contract:

  • normalized source data model
  • read-only source adapter protocol
  • adapter registry and Python entry point discovery
  • mkt source CLI commands
  • public API helpers such as inspect_source and normalize_source
  • fake adapter contract tests

It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.

Markitect-filter

markitect-filter should implement concrete adapters behind the entry point group:

[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"

The first adapter should be:

id: source.epub3
operations: read
media_types: application/epub+zip
extensions: .epub

The EPUB3 adapter should satisfy the contract tests described in docs/source-adapter-contract.md and add EPUB-specific fixtures for container, OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and lossy extraction diagnostics.

Infospace-bench

infospace-bench should replace its local EPUB intake spike with the public source adapter API:

from markitect_tool import normalize_source

result = normalize_source("source.epub")
if not result.is_valid:
    raise RuntimeError(result.to_dict()["diagnostics"])
markdown = result.document.markdown
segments = result.document.segments

Application workflows should consume normalized Markdown and segment metadata. They should not depend on EPUB package internals, spine parsing, XHTML extraction, or boilerplate classification directly.

Kontextual-engine

kontextual-engine can treat normalized source outputs as ingestible derived knowledge assets when it needs durable ingestion. The durable layer should persist policy, indexing, retrieval, permissions, audit, and lifecycle state. It should not require source-format dependencies inside markitect-tool.

Recommended ingestion boundary:

  • keep NormalizedMarkdownDocument.to_dict() as the portable derivative
  • preserve source asset digest and normalization cache key
  • record adapter ID, adapter version, and read options
  • preserve document and segment provenance
  • store diagnostics and quality signals for human review

Follow-up Workplan Seeds

Recommended markitect-filter workplan:

MKTF-WP-0001: EPUB3 Read Adapter

Implement source.epub3 against docs/source-adapter-contract.md:
- package scaffold and pyproject entry point
- optional EPUB dependencies
- META-INF/container.xml parsing
- OPF metadata and spine reading order
- nav/chapter label extraction
- body XHTML to normalized Markdown segments
- explicit boilerplate skip policy
- malformed/unsupported/lossy diagnostics
- contract tests using markitect-tool fake adapter expectations

Recommended infospace-bench workplan:

ISB-WP-source-adapter-intake

Replace local EPUB source intake with markitect-tool normalize_source:
- install markitect-filter[epub3] in the relevant environment
- call normalize_source for source documents
- consume NormalizedMarkdownDocument markdown and segments
- remove app-local EPUB package parsing
- preserve source diagnostics in benchmark review artifacts