generated from coulomb/repo-seed
120 lines
3.6 KiB
Markdown
120 lines
3.6 KiB
Markdown
# Source Adapter Migration Notes
|
|
|
|
## Purpose
|
|
|
|
These notes describe how sibling repositories should consume the
|
|
`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`.
|
|
|
|
The source adapter layer is deliberately split:
|
|
|
|
```text
|
|
external source files
|
|
-> markitect-filter concrete adapters
|
|
-> markitect-tool source adapter protocol and normalized Markdown model
|
|
-> infospace-bench workflows
|
|
-> optional kontextual-engine ingestion
|
|
```
|
|
|
|
## Markitect-tool
|
|
|
|
`markitect-tool` owns the stable contract:
|
|
|
|
- normalized source data model
|
|
- read-only source adapter protocol
|
|
- adapter registry and Python entry point discovery
|
|
- `mkt source` CLI commands
|
|
- public API helpers such as `inspect_source` and `normalize_source`
|
|
- fake adapter contract tests
|
|
|
|
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
|
|
|
|
## Markitect-filter
|
|
|
|
`markitect-filter` should implement concrete adapters behind the entry point
|
|
group:
|
|
|
|
```toml
|
|
[project.entry-points."markitect_tool.source_adapters"]
|
|
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
|
|
```
|
|
|
|
The first adapter should be:
|
|
|
|
```text
|
|
id: source.epub3
|
|
operations: read
|
|
media_types: application/epub+zip
|
|
extensions: .epub
|
|
```
|
|
|
|
The EPUB3 adapter should satisfy the contract tests described in
|
|
`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container,
|
|
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
|
|
lossy extraction diagnostics.
|
|
|
|
## Infospace-bench
|
|
|
|
`infospace-bench` should replace its local EPUB intake spike with the public
|
|
source adapter API:
|
|
|
|
```python
|
|
from markitect_tool import normalize_source
|
|
|
|
result = normalize_source("source.epub")
|
|
if not result.is_valid:
|
|
raise RuntimeError(result.to_dict()["diagnostics"])
|
|
markdown = result.document.markdown
|
|
segments = result.document.segments
|
|
```
|
|
|
|
Application workflows should consume normalized Markdown and segment metadata.
|
|
They should not depend on EPUB package internals, spine parsing, XHTML
|
|
extraction, or boilerplate classification directly.
|
|
|
|
## Kontextual-engine
|
|
|
|
`kontextual-engine` can treat normalized source outputs as ingestible derived
|
|
knowledge assets when it needs durable ingestion. The durable layer should
|
|
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
|
|
It should not require source-format dependencies inside `markitect-tool`.
|
|
|
|
Recommended ingestion boundary:
|
|
|
|
- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative
|
|
- preserve source asset digest and normalization cache key
|
|
- record adapter ID, adapter version, and read options
|
|
- preserve document and segment provenance
|
|
- store diagnostics and quality signals for human review
|
|
|
|
## Follow-up Workplan Seeds
|
|
|
|
Recommended `markitect-filter` workplan:
|
|
|
|
```text
|
|
MKTF-WP-0001: EPUB3 Read Adapter
|
|
|
|
Implement source.epub3 against docs/source-adapter-contract.md:
|
|
- package scaffold and pyproject entry point
|
|
- optional EPUB dependencies
|
|
- META-INF/container.xml parsing
|
|
- OPF metadata and spine reading order
|
|
- nav/chapter label extraction
|
|
- body XHTML to normalized Markdown segments
|
|
- explicit boilerplate skip policy
|
|
- malformed/unsupported/lossy diagnostics
|
|
- contract tests using markitect-tool fake adapter expectations
|
|
```
|
|
|
|
Recommended `infospace-bench` workplan:
|
|
|
|
```text
|
|
ISB-WP-source-adapter-intake
|
|
|
|
Replace local EPUB source intake with markitect-tool normalize_source:
|
|
- install markitect-filter[epub3] in the relevant environment
|
|
- call normalize_source for source documents
|
|
- consume NormalizedMarkdownDocument markdown and segments
|
|
- remove app-local EPUB package parsing
|
|
- preserve source diagnostics in benchmark review artifacts
|
|
```
|