generated from coulomb/repo-seed
source adapter framework
This commit is contained in:
119
docs/source-adapter-migration.md
Normal file
119
docs/source-adapter-migration.md
Normal file
@@ -0,0 +1,119 @@
|
||||
# Source Adapter Migration Notes
|
||||
|
||||
## Purpose
|
||||
|
||||
These notes describe how sibling repositories should consume the
|
||||
`markitect-tool` source adapter contract implemented by `MKTT-WP-0018`.
|
||||
|
||||
The source adapter layer is deliberately split:
|
||||
|
||||
```text
|
||||
external source files
|
||||
-> markitect-filter concrete adapters
|
||||
-> markitect-tool source adapter protocol and normalized Markdown model
|
||||
-> infospace-bench workflows
|
||||
-> optional kontextual-engine ingestion
|
||||
```
|
||||
|
||||
## Markitect-tool
|
||||
|
||||
`markitect-tool` owns the stable contract:
|
||||
|
||||
- normalized source data model
|
||||
- read-only source adapter protocol
|
||||
- adapter registry and Python entry point discovery
|
||||
- `mkt source` CLI commands
|
||||
- public API helpers such as `inspect_source` and `normalize_source`
|
||||
- fake adapter contract tests
|
||||
|
||||
It does not own EPUB3, PDF, DOCX, ODT, OCR, or browser extraction.
|
||||
|
||||
## Markitect-filter
|
||||
|
||||
`markitect-filter` should implement concrete adapters behind the entry point
|
||||
group:
|
||||
|
||||
```toml
|
||||
[project.entry-points."markitect_tool.source_adapters"]
|
||||
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
|
||||
```
|
||||
|
||||
The first adapter should be:
|
||||
|
||||
```text
|
||||
id: source.epub3
|
||||
operations: read
|
||||
media_types: application/epub+zip
|
||||
extensions: .epub
|
||||
```
|
||||
|
||||
The EPUB3 adapter should satisfy the contract tests described in
|
||||
`docs/source-adapter-contract.md` and add EPUB-specific fixtures for container,
|
||||
OPF, spine, nav, body XHTML, malformed package structure, skipped assets, and
|
||||
lossy extraction diagnostics.
|
||||
|
||||
## Infospace-bench
|
||||
|
||||
`infospace-bench` should replace its local EPUB intake spike with the public
|
||||
source adapter API:
|
||||
|
||||
```python
|
||||
from markitect_tool import normalize_source
|
||||
|
||||
result = normalize_source("source.epub")
|
||||
if not result.is_valid:
|
||||
raise RuntimeError(result.to_dict()["diagnostics"])
|
||||
markdown = result.document.markdown
|
||||
segments = result.document.segments
|
||||
```
|
||||
|
||||
Application workflows should consume normalized Markdown and segment metadata.
|
||||
They should not depend on EPUB package internals, spine parsing, XHTML
|
||||
extraction, or boilerplate classification directly.
|
||||
|
||||
## Kontextual-engine
|
||||
|
||||
`kontextual-engine` can treat normalized source outputs as ingestible derived
|
||||
knowledge assets when it needs durable ingestion. The durable layer should
|
||||
persist policy, indexing, retrieval, permissions, audit, and lifecycle state.
|
||||
It should not require source-format dependencies inside `markitect-tool`.
|
||||
|
||||
Recommended ingestion boundary:
|
||||
|
||||
- keep `NormalizedMarkdownDocument.to_dict()` as the portable derivative
|
||||
- preserve source asset digest and normalization cache key
|
||||
- record adapter ID, adapter version, and read options
|
||||
- preserve document and segment provenance
|
||||
- store diagnostics and quality signals for human review
|
||||
|
||||
## Follow-up Workplan Seeds
|
||||
|
||||
Recommended `markitect-filter` workplan:
|
||||
|
||||
```text
|
||||
MKTF-WP-0001: EPUB3 Read Adapter
|
||||
|
||||
Implement source.epub3 against docs/source-adapter-contract.md:
|
||||
- package scaffold and pyproject entry point
|
||||
- optional EPUB dependencies
|
||||
- META-INF/container.xml parsing
|
||||
- OPF metadata and spine reading order
|
||||
- nav/chapter label extraction
|
||||
- body XHTML to normalized Markdown segments
|
||||
- explicit boilerplate skip policy
|
||||
- malformed/unsupported/lossy diagnostics
|
||||
- contract tests using markitect-tool fake adapter expectations
|
||||
```
|
||||
|
||||
Recommended `infospace-bench` workplan:
|
||||
|
||||
```text
|
||||
ISB-WP-source-adapter-intake
|
||||
|
||||
Replace local EPUB source intake with markitect-tool normalize_source:
|
||||
- install markitect-filter[epub3] in the relevant environment
|
||||
- call normalize_source for source documents
|
||||
- consume NormalizedMarkdownDocument markdown and segments
|
||||
- remove app-local EPUB package parsing
|
||||
- preserve source diagnostics in benchmark review artifacts
|
||||
```
|
||||
Reference in New Issue
Block a user