# Source Adapter Contract ## Purpose This document pins the v1 contract for source-format adapters. It is the handoff from `MKTT-WP-0019` to `MKTT-WP-0018`: `markitect-tool` implements the contract, registry, CLI, public API, and tests; `markitect-filter` implements concrete adapters, starting with EPUB3. The v1 contract is intentionally read-only. It normalizes heterogeneous source formats into canonical Markitect Markdown plus metadata, provenance, quality signals, and diagnostics. Writer/export adapters are future scope. ## Scope The v1 source adapter layer supports: - local filesystem source inputs - deterministic inspection and normalization - package-provided read adapters discovered through Python entry points - optional dependencies isolated in adapter packages - JSON-serializable normalized Markdown outputs - contract tests with fake adapters and small fixtures The v1 layer does not support: - EPUB3, PDF, DOCX, ODT, OCR, browser, or archive parsing in `markitect-tool` - write/export adapters - network fetching for source URIs - durable ingestion, permissions, retrieval, or governance - hidden AI-assisted repair or enrichment URI fields appear in the model so adapters can preserve source identity, but v1 CLI/API inputs are local paths unless a later workplan opens remote source loading explicitly. ## Package Shape External adapter packages should depend on `markitect-tool` and register one or more read adapter descriptors through the entry point group `markitect_tool.source_adapters`. Recommended `markitect-filter` shape: ```text markitect_filter/ src/markitect_filter/ __init__.py epub3.py adapters.py tests/ fixtures/epub3/ test_epub3_adapter.py pyproject.toml ``` The package should expose a lightweight descriptor function that does not import heavyweight format dependencies until the adapter is instantiated or used. For example: ```toml [project.entry-points."markitect_tool.source_adapters"] epub3 = "markitect_filter.adapters:epub3_adapter_descriptor" ``` Adapter packages may use extras such as `markitect-filter[epub3]` or `markitect-filter[pdf]`. Missing optional dependencies must be reported through structured diagnostics; they must not surface as raw import errors. ## Descriptor Contract `MKTT-WP-0018` should implement a `SourceAdapterDescriptor` dataclass and map it into the existing `ExtensionDescriptor` catalog with kind `source-adapter`. Required descriptor fields: | Field | Type | Meaning | | --- | --- | --- | | `id` | `str` | Stable adapter id, for example `source.epub3`. | | `version` | `str` | Adapter contract implementation version. | | `name` | `str` | Human-readable adapter name. | | `operations` | `list[str]` | V1 must contain only `read`. | | `media_types` | `list[str]` | Supported media types, lower-case. | | `extensions` | `list[str]` | Supported file suffixes including dots. | | `factory` | callable | Returns a `SourceReadAdapter`. | Optional descriptor fields: | Field | Type | Meaning | | --- | --- | --- | | `summary` | `str | None` | Short description for CLI and docs. | | `option_schema` | `dict` | JSON-schema-like adapter options. | | `optional_dependencies` | `list[OptionalDependency]` | Runtime libraries needed by the adapter. | | `safety` | `dict` | Reads files, network, external processes, and related flags. | | `quality_profile` | `dict` | Known extraction quality behavior. | | `metadata` | `dict` | Adapter-specific metadata. | The corresponding `ExtensionDescriptor` should use: ```text id: same as SourceAdapterDescriptor.id kind: source-adapter input_contract: SourceInspectRequest | SourceReadRequest output_contract: SourceInspectResult | SourceReadResult diagnostics_namespace: source provenance_prefix: source. ``` Capabilities should include: ```text source read markdown normalize diagnostics emit provenance emit filesystem read ``` Descriptor IDs are globally unique. Duplicate IDs from external packages are registry errors. Descriptors from package entry points are sorted by ID for deterministic listing. ## Entry Point Contract The entry point group is: ```text markitect_tool.source_adapters ``` Each entry point may load to one of: - a `SourceAdapterDescriptor` - an iterable of `SourceAdapterDescriptor` - a callable returning either of the above Discovery must not instantiate adapters unless the loaded object itself is a descriptor factory. Descriptors should remain cheap enough to list without format-specific imports. Discovery errors should produce diagnostics with code `source.discovery_failed`. Missing optional dependencies declared by a descriptor should produce `source.missing_dependency` and mark that adapter unavailable for reads until the dependency is installed. ## Data Model All model objects must support stable `to_dict()` serialization. Serialization rules: - omit `None`, empty lists, empty dicts, and empty strings - preserve `False`, `0`, and empty Markdown content where semantically valid - use UTF-8 text - use canonical JSON with sorted keys and compact separators when computing hashes or cache keys - keep all timestamps and dates as strings unless they are filesystem metadata such as `mtime_ns` ### `SourceAsset` | Field | Required | Type | Meaning | | --- | --- | --- | --- | | `uri` | yes | `str` | Stable source URI. For local files, use a normalized path URI or path string. | | `path` | no | `str` | Local path when available. | | `name` | no | `str` | Display name or basename. | | `media_type` | no | `str` | Detected or declared media type. | | `extension` | no | `str` | Lower-case suffix including the dot. | | `size` | no | `int` | Byte size for local files. | | `mtime_ns` | no | `int` | Local file modification timestamp in nanoseconds. | | `digest` | no | `str` | `sha256:` of source bytes when available. | | `metadata` | no | `dict` | Source asset metadata that is not document metadata. | ### `SourceMetadata` | Field | Required | Type | Meaning | | --- | --- | --- | --- | | `title` | no | `str` | Source title. | | `creators` | no | `list[str]` | Authors or creators in source order. | | `language` | no | `str` | BCP 47 language tag when known. | | `rights` | no | `str` | Rights or license text from the source. | | `source_url` | no | `str` | Original public URL when known. | | `publication_date` | no | `str` | Source publication date string. | | `publisher` | no | `str` | Publisher name. | | `identifiers` | no | `dict[str, str]` | ISBN, DOI, package IDs, and similar identifiers. | | `raw` | no | `dict` | Adapter-preserved raw metadata. | ### `SourceProvenance` | Field | Required | Type | Meaning | | --- | --- | --- | --- | | `source_uri` | yes | `str` | Source asset URI. | | `source_path` | no | `str` | Local source path. | | `source_href` | no | `str` | Package-internal href or document-relative reference. | | `package_path` | no | `str` | Archive/package member path, such as EPUB XHTML. | | `anchor` | no | `str` | Source anchor or fragment. | | `page` | no | `str` | Page label or number where available. | | `section` | no | `str` | Chapter, section, or nav label. | | `start_offset` | no | `int` | Adapter-defined start offset. | | `end_offset` | no | `int` | Adapter-defined end offset. | | `digest` | no | `str` | Digest of the specific source component. | | `metadata` | no | `dict` | Adapter-specific provenance details. | ### `NormalizedMarkdownSegment` | Field | Required | Type | Meaning | | --- | --- | --- | --- | | `segment_id` | yes | `str` | Stable ID unique within the document. | | `order` | yes | `int` | Zero-based reading order. | | `markdown` | yes | `str` | Canonical Markdown for the segment. | | `heading` | no | `str` | Primary heading text for the segment. | | `heading_level` | no | `int` | Markdown heading level when known. | | `anchors` | no | `list[str]` | Source anchors covered by the segment. | | `provenance` | no | `list[SourceProvenance]` | Source spans contributing to the segment. | | `metadata` | no | `dict` | Adapter-specific segment metadata. | Segment IDs should be deterministic. Prefer source anchors when they are stable and unique. Otherwise use ordinal IDs such as `seg-0001`, `seg-0002`, and so on. Segment order is always authoritative for reading order. ### `NormalizationQuality` | Field | Required | Type | Meaning | | --- | --- | --- | --- | | `lossiness` | yes | `str` | One of `none`, `low`, `medium`, `high`, or `unknown`. | | `confidence` | no | `float` | Adapter confidence from `0.0` to `1.0`. | | `skipped_items` | no | `int` | Count of skipped source items. | | `warnings` | no | `int` | Count of warning diagnostics. | | `metadata` | no | `dict` | Adapter-specific quality details. | ### `NormalizedMarkdownDocument` | Field | Required | Type | Meaning | | --- | --- | --- | --- | | `schema_version` | yes | `str` | V1 uses `markitect.source.v1`. | | `document_id` | yes | `str` | Stable normalized document ID. | | `asset` | yes | `SourceAsset` | Original source identity. | | `metadata` | yes | `SourceMetadata` | Source document metadata. | | `markdown` | yes | `str` | Full normalized Markdown. | | `segments` | yes | `list[NormalizedMarkdownSegment]` | Ordered segment list. | | `quality` | yes | `NormalizationQuality` | Extraction quality summary. | | `diagnostics` | no | `list[Diagnostic]` | Existing Markitect diagnostic shape. | | `provenance` | no | `list[SourceProvenance]` | Document-level provenance. | | `attachments` | no | `list[SourceAsset]` | Referenced binary assets; v1 metadata only. | | `adapter` | yes | `dict` | Adapter id, version, and options. | | `cache_key` | yes | `str` | Deterministic normalization cache key. | The full `markdown` field should be equal to the ordered segment Markdown joined with exactly two newlines, unless an adapter has a documented reason to emit document-level frontmatter or separators. ## Hashing And Cache Keys Source asset digests use the source bytes: ```text sha256: ``` Document IDs should be stable across machines and based on: - normalized source asset URI or path - source asset digest when available - adapter ID - adapter version Normalization cache keys should be based on canonical JSON containing: - source asset URI or path - source asset digest - adapter ID - adapter version - normalized model version - read options Use this prefix: ```text source-normalize:sha256: ``` ## Read Adapter Protocol `MKTT-WP-0018` should implement Python `Protocol` classes equivalent to: ```python class SourceReadAdapter(Protocol): descriptor: SourceAdapterDescriptor def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch: ... def inspect(self, request: SourceInspectRequest) -> SourceInspectResult: ... def read(self, request: SourceReadRequest) -> SourceReadResult: ... ``` Request and result objects: | Type | Required fields | Meaning | | --- | --- | --- | | `SourceAdapterMatchRequest` | `asset`, `options` | Cheap matching request. | | `SourceAdapterMatch` | `adapter_id`, `matched`, `confidence`, `reason`, `diagnostics` | Match result. Confidence is `0` to `100`. | | `SourceInspectRequest` | `asset`, `options` | Metadata-only inspection request. | | `SourceInspectResult` | `valid`, `asset`, `adapter`, `metadata`, `capabilities`, `diagnostics`, `quality` | Inspection result without full Markdown conversion. | | `SourceReadRequest` | `asset`, `options` | Full normalization request. | | `SourceReadResult` | `valid`, `document`, `diagnostics` | Normalized read result. | `inspect` must not perform full conversion. It may open enough of the source to validate structure and collect metadata. `read` may perform full extraction. Options must be JSON-serializable. Adapter-specific options should be declared in `option_schema`. Unknown options should produce `source.unknown_option` unless the descriptor explicitly permits free-form options. ## Adapter Selection Selection is deterministic: 1. If an explicit adapter ID is provided, use only that descriptor. 2. Prefer media type matches over extension-only matches. 3. Prefer higher `can_read().confidence`. 4. Prefer descriptors with required optional dependencies available. 5. Break remaining ties by descriptor ID in ascending lexical order and emit warning `source.adapter_ambiguous`. No matching adapter returns an error diagnostic: ```text source.unsupported_format ``` Malformed sources return an error diagnostic: ```text source.malformed ``` Missing required optional dependencies return: ```text source.missing_dependency ``` Warnings do not make a result invalid. Any error diagnostic makes `valid` false. ## CLI Contract The public commands are: ```bash mkt source adapters mkt source inspect mkt source normalize --format markdown ``` Common options: ```text --adapter Explicit adapter selection. --format text|json|yaml For adapters and inspect. --format markdown|json|yaml For normalize. --option key=value Adapter-specific option, repeatable. --output Write normalized output. ``` Exit behavior: | Exit | Meaning | | --- | --- | | `0` | Operation valid; warning diagnostics may exist. | | `1` | Operation completed with error diagnostics. | | `2` | CLI usage error from Click. | JSON output must contain a top-level `valid` field for operations that can fail. Markdown output writes only normalized Markdown to stdout or `--output`; diagnostics for Markdown output go to stderr. If normalization is invalid, do not emit partial Markdown unless a future option explicitly requests it. ## API Contract `MKTT-WP-0018` should export these names from `markitect_tool`: ```text SourceAsset SourceMetadata SourceProvenance NormalizedMarkdownSegment NormalizedMarkdownDocument NormalizationQuality SourceAdapterDescriptor SourceReadAdapter SourceAdapterRegistry SourceAdapterMatchRequest SourceAdapterMatch SourceInspectRequest SourceInspectResult SourceReadRequest SourceReadResult default_source_adapter_registry discover_source_adapters inspect_source normalize_source ``` Direct API helpers should accept an optional registry and adapter ID so tests and sibling repos can avoid global discovery when they need deterministic fixtures. ## Contract Tests For MKTT-WP-0018 Implementation should add tests for: - `SourceAsset`, metadata, provenance, quality, segment, and document serialization - source document cache-key determinism - fake in-tree adapter registration and read behavior - fake external entry point discovery - optional dependency diagnostics - unsupported format diagnostics - malformed source diagnostics - adapter selection tie behavior - CLI `source adapters` JSON fixture - CLI `source inspect` JSON fixture - CLI `source normalize --format json` fixture - CLI `source normalize --format markdown` fixture - public API exports Fixtures live in `examples/source-adapters/` and should be reused by tests where practical. ## Markitect-filter Handoff The first `markitect-filter` implementation should provide an EPUB3 descriptor: ```text id: source.epub3 name: EPUB3 operations: read media_types: application/epub+zip extensions: .epub entry_point: markitect_filter.adapters:epub3_adapter_descriptor ``` The EPUB3 adapter should inspect and normalize: - `META-INF/container.xml` - the OPF package document - Dublin Core and package metadata - spine reading order - navigation labels - body XHTML as ordered Markdown segments - source hrefs, anchors, sections, and page references where available It should classify or skip cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit options and diagnostics. It should report unsupported media, malformed package structure, skipped assets, and lossy extraction.