3.6 KiB
Source Adapter Migration Notes
Purpose
These notes describe how sibling repositories should consume the
markitect-tool source adapter contract implemented by MKTT-WP-0018.
The source adapter layer is deliberately split:
external source files
-> markitect-filter concrete adapters
-> markitect-tool source adapter protocol and normalized Markdown model
-> infospace-bench workflows
-> optional kontextual-engine ingestion
Markitect-tool
markitect-tool owns the stable contract:
- normalized source data model
- read-only source adapter protocol
- adapter registry and Python entry point discovery
mkt sourceCLI commands- public API helpers such as
inspect_sourceandnormalize_source - fake adapter contract tests
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
Markitect-filter
markitect-filter should implement concrete adapters behind the entry point
group:
[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
The first adapter should be:
id: source.epub3
operations: read
media_types: application/epub+zip
extensions: .epub
The EPUB3 adapter should satisfy the contract tests described in
docs/source-adapter-contract.md and add EPUB-specific fixtures for container,
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
lossy extraction diagnostics.
Infospace-bench
infospace-bench should replace its local EPUB intake spike with the public
source adapter API:
from markitect_tool import normalize_source
result = normalize_source("source.epub")
if not result.is_valid:
raise RuntimeError(result.to_dict()["diagnostics"])
markdown = result.document.markdown
segments = result.document.segments
Application workflows should consume normalized Markdown and segment metadata. They should not depend on EPUB package internals, spine parsing, XHTML extraction, or boilerplate classification directly.
Kontextual-engine
kontextual-engine can treat normalized source outputs as ingestible derived
knowledge assets when it needs durable ingestion. The durable layer should
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
It should not require source-format dependencies inside markitect-tool.
Recommended ingestion boundary:
- keep
NormalizedMarkdownDocument.to_dict()as the portable derivative - preserve source asset digest and normalization cache key
- record adapter ID, adapter version, and read options
- preserve document and segment provenance
- store diagnostics and quality signals for human review
Follow-up Workplan Seeds
Recommended markitect-filter workplan:
MKTF-WP-0001: EPUB3 Read Adapter
Implement source.epub3 against docs/source-adapter-contract.md:
- package scaffold and pyproject entry point
- optional EPUB dependencies
- META-INF/container.xml parsing
- OPF metadata and spine reading order
- nav/chapter label extraction
- body XHTML to normalized Markdown segments
- explicit boilerplate skip policy
- malformed/unsupported/lossy diagnostics
- contract tests using markitect-tool fake adapter expectations
Recommended infospace-bench workplan:
ISB-WP-source-adapter-intake
Replace local EPUB source intake with markitect-tool normalize_source:
- install markitect-filter[epub3] in the relevant environment
- call normalize_source for source documents
- consume NormalizedMarkdownDocument markdown and segments
- remove app-local EPUB package parsing
- preserve source diagnostics in benchmark review artifacts