16 KiB
Source Adapter Contract
Purpose
This document pins the v1 contract for source-format adapters. It is the
handoff from MKTT-WP-0019 to MKTT-WP-0018: markitect-tool implements the
contract, registry, CLI, public API, and tests; markitect-filter implements
concrete adapters, starting with EPUB3.
The v1 contract is intentionally read-only. It normalizes heterogeneous source formats into canonical Markitect Markdown plus metadata, provenance, quality signals, and diagnostics. Writer/export adapters are future scope.
Scope
The v1 source adapter layer supports:
- local filesystem source inputs
- deterministic inspection and normalization
- package-provided read adapters discovered through Python entry points
- optional dependencies isolated in adapter packages
- JSON-serializable normalized Markdown outputs
- contract tests with fake adapters and small fixtures
The v1 layer does not support:
- EPUB3, PDF, DOCX, ODT, OCR, browser, or archive parsing in
markitect-tool - write/export adapters
- network fetching for source URIs
- durable ingestion, permissions, retrieval, or governance
- hidden AI-assisted repair or enrichment
URI fields appear in the model so adapters can preserve source identity, but v1 CLI/API inputs are local paths unless a later workplan opens remote source loading explicitly.
Package Shape
External adapter packages should depend on markitect-tool and register one or
more read adapter descriptors through the entry point group
markitect_tool.source_adapters.
Recommended markitect-filter shape:
markitect_filter/
src/markitect_filter/
__init__.py
epub3.py
adapters.py
tests/
fixtures/epub3/
test_epub3_adapter.py
pyproject.toml
The package should expose a lightweight descriptor function that does not import heavyweight format dependencies until the adapter is instantiated or used. For example:
[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
Adapter packages may use extras such as markitect-filter[epub3] or
markitect-filter[pdf]. Missing optional dependencies must be reported through
structured diagnostics; they must not surface as raw import errors.
Descriptor Contract
MKTT-WP-0018 should implement a SourceAdapterDescriptor dataclass and map
it into the existing ExtensionDescriptor catalog with kind source-adapter.
Required descriptor fields:
| Field | Type | Meaning |
|---|---|---|
id |
str |
Stable adapter id, for example source.epub3. |
version |
str |
Adapter contract implementation version. |
name |
str |
Human-readable adapter name. |
operations |
list[str] |
V1 must contain only read. |
media_types |
list[str] |
Supported media types, lower-case. |
extensions |
list[str] |
Supported file suffixes including dots. |
factory |
callable | Returns a SourceReadAdapter. |
Optional descriptor fields:
| Field | Type | Meaning |
|---|---|---|
summary |
`str | None` |
option_schema |
dict |
JSON-schema-like adapter options. |
optional_dependencies |
list[OptionalDependency] |
Runtime libraries needed by the adapter. |
safety |
dict |
Reads files, network, external processes, and related flags. |
quality_profile |
dict |
Known extraction quality behavior. |
metadata |
dict |
Adapter-specific metadata. |
The corresponding ExtensionDescriptor should use:
id: same as SourceAdapterDescriptor.id
kind: source-adapter
input_contract: SourceInspectRequest | SourceReadRequest
output_contract: SourceInspectResult | SourceReadResult
diagnostics_namespace: source
provenance_prefix: source.<adapter>
Capabilities should include:
source read
markdown normalize
diagnostics emit
provenance emit
filesystem read
Descriptor IDs are globally unique. Duplicate IDs from external packages are registry errors. Descriptors from package entry points are sorted by ID for deterministic listing.
Entry Point Contract
The entry point group is:
markitect_tool.source_adapters
Each entry point may load to one of:
- a
SourceAdapterDescriptor - an iterable of
SourceAdapterDescriptor - a callable returning either of the above
Discovery must not instantiate adapters unless the loaded object itself is a descriptor factory. Descriptors should remain cheap enough to list without format-specific imports.
Discovery errors should produce diagnostics with code
source.discovery_failed. Missing optional dependencies declared by a
descriptor should produce source.missing_dependency and mark that adapter
unavailable for reads until the dependency is installed.
Data Model
All model objects must support stable to_dict() serialization. Serialization
rules:
- omit
None, empty lists, empty dicts, and empty strings - preserve
False,0, and empty Markdown content where semantically valid - use UTF-8 text
- use canonical JSON with sorted keys and compact separators when computing hashes or cache keys
- keep all timestamps and dates as strings unless they are filesystem metadata
such as
mtime_ns
SourceAsset
| Field | Required | Type | Meaning |
|---|---|---|---|
uri |
yes | str |
Stable source URI. For local files, use a normalized path URI or path string. |
path |
no | str |
Local path when available. |
name |
no | str |
Display name or basename. |
media_type |
no | str |
Detected or declared media type. |
extension |
no | str |
Lower-case suffix including the dot. |
size |
no | int |
Byte size for local files. |
mtime_ns |
no | int |
Local file modification timestamp in nanoseconds. |
digest |
no | str |
sha256:<hex> of source bytes when available. |
metadata |
no | dict |
Source asset metadata that is not document metadata. |
SourceMetadata
| Field | Required | Type | Meaning |
|---|---|---|---|
title |
no | str |
Source title. |
creators |
no | list[str] |
Authors or creators in source order. |
language |
no | str |
BCP 47 language tag when known. |
rights |
no | str |
Rights or license text from the source. |
source_url |
no | str |
Original public URL when known. |
publication_date |
no | str |
Source publication date string. |
publisher |
no | str |
Publisher name. |
identifiers |
no | dict[str, str] |
ISBN, DOI, package IDs, and similar identifiers. |
raw |
no | dict |
Adapter-preserved raw metadata. |
SourceProvenance
| Field | Required | Type | Meaning |
|---|---|---|---|
source_uri |
yes | str |
Source asset URI. |
source_path |
no | str |
Local source path. |
source_href |
no | str |
Package-internal href or document-relative reference. |
package_path |
no | str |
Archive/package member path, such as EPUB XHTML. |
anchor |
no | str |
Source anchor or fragment. |
page |
no | str |
Page label or number where available. |
section |
no | str |
Chapter, section, or nav label. |
start_offset |
no | int |
Adapter-defined start offset. |
end_offset |
no | int |
Adapter-defined end offset. |
digest |
no | str |
Digest of the specific source component. |
metadata |
no | dict |
Adapter-specific provenance details. |
NormalizedMarkdownSegment
| Field | Required | Type | Meaning |
|---|---|---|---|
segment_id |
yes | str |
Stable ID unique within the document. |
order |
yes | int |
Zero-based reading order. |
markdown |
yes | str |
Canonical Markdown for the segment. |
heading |
no | str |
Primary heading text for the segment. |
heading_level |
no | int |
Markdown heading level when known. |
anchors |
no | list[str] |
Source anchors covered by the segment. |
provenance |
no | list[SourceProvenance] |
Source spans contributing to the segment. |
metadata |
no | dict |
Adapter-specific segment metadata. |
Segment IDs should be deterministic. Prefer source anchors when they are stable
and unique. Otherwise use ordinal IDs such as seg-0001, seg-0002, and so
on. Segment order is always authoritative for reading order.
NormalizationQuality
| Field | Required | Type | Meaning |
|---|---|---|---|
lossiness |
yes | str |
One of none, low, medium, high, or unknown. |
confidence |
no | float |
Adapter confidence from 0.0 to 1.0. |
skipped_items |
no | int |
Count of skipped source items. |
warnings |
no | int |
Count of warning diagnostics. |
metadata |
no | dict |
Adapter-specific quality details. |
NormalizedMarkdownDocument
| Field | Required | Type | Meaning |
|---|---|---|---|
schema_version |
yes | str |
V1 uses markitect.source.v1. |
document_id |
yes | str |
Stable normalized document ID. |
asset |
yes | SourceAsset |
Original source identity. |
metadata |
yes | SourceMetadata |
Source document metadata. |
markdown |
yes | str |
Full normalized Markdown. |
segments |
yes | list[NormalizedMarkdownSegment] |
Ordered segment list. |
quality |
yes | NormalizationQuality |
Extraction quality summary. |
diagnostics |
no | list[Diagnostic] |
Existing Markitect diagnostic shape. |
provenance |
no | list[SourceProvenance] |
Document-level provenance. |
attachments |
no | list[SourceAsset] |
Referenced binary assets; v1 metadata only. |
adapter |
yes | dict |
Adapter id, version, and options. |
cache_key |
yes | str |
Deterministic normalization cache key. |
The full markdown field should be equal to the ordered segment Markdown joined
with exactly two newlines, unless an adapter has a documented reason to emit
document-level frontmatter or separators.
Hashing And Cache Keys
Source asset digests use the source bytes:
sha256:<hex>
Document IDs should be stable across machines and based on:
- normalized source asset URI or path
- source asset digest when available
- adapter ID
- adapter version
Normalization cache keys should be based on canonical JSON containing:
- source asset URI or path
- source asset digest
- adapter ID
- adapter version
- normalized model version
- read options
Use this prefix:
source-normalize:sha256:<hex>
Read Adapter Protocol
MKTT-WP-0018 should implement Python Protocol classes equivalent to:
class SourceReadAdapter(Protocol):
descriptor: SourceAdapterDescriptor
def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch:
...
def inspect(self, request: SourceInspectRequest) -> SourceInspectResult:
...
def read(self, request: SourceReadRequest) -> SourceReadResult:
...
Request and result objects:
| Type | Required fields | Meaning |
|---|---|---|
SourceAdapterMatchRequest |
asset, options |
Cheap matching request. |
SourceAdapterMatch |
adapter_id, matched, confidence, reason, diagnostics |
Match result. Confidence is 0 to 100. |
SourceInspectRequest |
asset, options |
Metadata-only inspection request. |
SourceInspectResult |
valid, asset, adapter, metadata, capabilities, diagnostics, quality |
Inspection result without full Markdown conversion. |
SourceReadRequest |
asset, options |
Full normalization request. |
SourceReadResult |
valid, document, diagnostics |
Normalized read result. |
inspect must not perform full conversion. It may open enough of the source to
validate structure and collect metadata. read may perform full extraction.
Options must be JSON-serializable. Adapter-specific options should be declared
in option_schema. Unknown options should produce
source.unknown_option unless the descriptor explicitly permits free-form
options.
Adapter Selection
Selection is deterministic:
- If an explicit adapter ID is provided, use only that descriptor.
- Prefer media type matches over extension-only matches.
- Prefer higher
can_read().confidence. - Prefer descriptors with required optional dependencies available.
- Break remaining ties by descriptor ID in ascending lexical order and emit
warning
source.adapter_ambiguous.
No matching adapter returns an error diagnostic:
source.unsupported_format
Malformed sources return an error diagnostic:
source.malformed
Missing required optional dependencies return:
source.missing_dependency
Warnings do not make a result invalid. Any error diagnostic makes valid
false.
CLI Contract
The public commands are:
mkt source adapters
mkt source inspect <path>
mkt source normalize <path> --format markdown
Common options:
--adapter <adapter-id> Explicit adapter selection.
--format text|json|yaml For adapters and inspect.
--format markdown|json|yaml For normalize.
--option key=value Adapter-specific option, repeatable.
--output <path> Write normalized output.
Exit behavior:
| Exit | Meaning |
|---|---|
0 |
Operation valid; warning diagnostics may exist. |
1 |
Operation completed with error diagnostics. |
2 |
CLI usage error from Click. |
JSON output must contain a top-level valid field for operations that can
fail. Markdown output writes only normalized Markdown to stdout or --output;
diagnostics for Markdown output go to stderr. If normalization is invalid, do
not emit partial Markdown unless a future option explicitly requests it.
API Contract
MKTT-WP-0018 should export these names from markitect_tool:
SourceAsset
SourceMetadata
SourceProvenance
NormalizedMarkdownSegment
NormalizedMarkdownDocument
NormalizationQuality
SourceAdapterDescriptor
SourceReadAdapter
SourceAdapterRegistry
SourceAdapterMatchRequest
SourceAdapterMatch
SourceInspectRequest
SourceInspectResult
SourceReadRequest
SourceReadResult
default_source_adapter_registry
discover_source_adapters
inspect_source
normalize_source
Direct API helpers should accept an optional registry and adapter ID so tests and sibling repos can avoid global discovery when they need deterministic fixtures.
Contract Tests For MKTT-WP-0018
Implementation should add tests for:
SourceAsset, metadata, provenance, quality, segment, and document serialization- source document cache-key determinism
- fake in-tree adapter registration and read behavior
- fake external entry point discovery
- optional dependency diagnostics
- unsupported format diagnostics
- malformed source diagnostics
- adapter selection tie behavior
- CLI
source adaptersJSON fixture - CLI
source inspectJSON fixture - CLI
source normalize --format jsonfixture - CLI
source normalize --format markdownfixture - public API exports
Fixtures live in examples/source-adapters/ and should be reused by tests where
practical.
Markitect-filter Handoff
The first markitect-filter implementation should provide an EPUB3 descriptor:
id: source.epub3
name: EPUB3
operations: read
media_types: application/epub+zip
extensions: .epub
entry_point: markitect_filter.adapters:epub3_adapter_descriptor
The EPUB3 adapter should inspect and normalize:
META-INF/container.xml- the OPF package document
- Dublin Core and package metadata
- spine reading order
- navigation labels
- body XHTML as ordered Markdown segments
- source hrefs, anchors, sections, and page references where available
It should classify or skip cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit options and diagnostics. It should report unsupported media, malformed package structure, skipped assets, and lossy extraction.