Files
markitect-tool/docs/source-adapter-migration.md
2026-05-14 22:05:34 +02:00

120 lines
3.6 KiB
Markdown

# Source Adapter Migration Notes
## Purpose
These notes describe how sibling repositories should consume the
`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`.
The source adapter layer is deliberately split:
```text
external source files
-> markitect-filter concrete adapters
-> markitect-tool source adapter protocol and normalized Markdown model
-> infospace-bench workflows
-> optional kontextual-engine ingestion
```
## Markitect-tool
`markitect-tool` owns the stable contract:
- normalized source data model
- read-only source adapter protocol
- adapter registry and Python entry point discovery
- `mkt source` CLI commands
- public API helpers such as `inspect_source` and `normalize_source`
- fake adapter contract tests
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
## Markitect-filter
`markitect-filter` should implement concrete adapters behind the entry point
group:
```toml
[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
```
The first adapter should be:
```text
id: source.epub3
operations: read
media_types: application/epub+zip
extensions: .epub
```
The EPUB3 adapter should satisfy the contract tests described in
`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container,
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
lossy extraction diagnostics.
## Infospace-bench
`infospace-bench` should replace its local EPUB intake spike with the public
source adapter API:
```python
from markitect_tool import normalize_source
result = normalize_source("source.epub")
if not result.is_valid:
raise RuntimeError(result.to_dict()["diagnostics"])
markdown = result.document.markdown
segments = result.document.segments
```
Application workflows should consume normalized Markdown and segment metadata.
They should not depend on EPUB package internals, spine parsing, XHTML
extraction, or boilerplate classification directly.
## Kontextual-engine
`kontextual-engine` can treat normalized source outputs as ingestible derived
knowledge assets when it needs durable ingestion. The durable layer should
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
It should not require source-format dependencies inside `markitect-tool`.
Recommended ingestion boundary:
- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative
- preserve source asset digest and normalization cache key
- record adapter ID, adapter version, and read options
- preserve document and segment provenance
- store diagnostics and quality signals for human review
## Follow-up Workplan Seeds
Recommended `markitect-filter` workplan:
```text
MKTF-WP-0001: EPUB3 Read Adapter
Implement source.epub3 against docs/source-adapter-contract.md:
- package scaffold and pyproject entry point
- optional EPUB dependencies
- META-INF/container.xml parsing
- OPF metadata and spine reading order
- nav/chapter label extraction
- body XHTML to normalized Markdown segments
- explicit boilerplate skip policy
- malformed/unsupported/lossy diagnostics
- contract tests using markitect-tool fake adapter expectations
```
Recommended `infospace-bench` workplan:
```text
ISB-WP-source-adapter-intake
Replace local EPUB source intake with markitect-tool normalize_source:
- install markitect-filter[epub3] in the relevant environment
- call normalize_source for source documents
- consume NormalizedMarkdownDocument markdown and segments
- remove app-local EPUB package parsing
- preserve source diagnostics in benchmark review artifacts
```