Files
markitect-tool/docs/source-adapter-contract.md

16 KiB

Source Adapter Contract

Purpose

This document pins the v1 contract for source-format adapters. It is the handoff from MKTT-WP-0019 to MKTT-WP-0018: markitect-tool implements the contract, registry, CLI, public API, and tests; markitect-filter implements concrete adapters, starting with EPUB3.

The v1 contract is intentionally read-only. It normalizes heterogeneous source formats into canonical Markitect Markdown plus metadata, provenance, quality signals, and diagnostics. Writer/export adapters are future scope.

Scope

The v1 source adapter layer supports:

  • local filesystem source inputs
  • deterministic inspection and normalization
  • package-provided read adapters discovered through Python entry points
  • optional dependencies isolated in adapter packages
  • JSON-serializable normalized Markdown outputs
  • contract tests with fake adapters and small fixtures

The v1 layer does not support:

  • EPUB3, PDF, DOCX, ODT, OCR, browser, or archive parsing in markitect-tool
  • write/export adapters
  • network fetching for source URIs
  • durable ingestion, permissions, retrieval, or governance
  • hidden AI-assisted repair or enrichment

URI fields appear in the model so adapters can preserve source identity, but v1 CLI/API inputs are local paths unless a later workplan opens remote source loading explicitly.

Package Shape

External adapter packages should depend on markitect-tool and register one or more read adapter descriptors through the entry point group markitect_tool.source_adapters.

Recommended markitect-filter shape:

markitect_filter/
  src/markitect_filter/
    __init__.py
    epub3.py
    adapters.py
  tests/
    fixtures/epub3/
    test_epub3_adapter.py
  pyproject.toml

The package should expose a lightweight descriptor function that does not import heavyweight format dependencies until the adapter is instantiated or used. For example:

[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"

Adapter packages may use extras such as markitect-filter[epub3] or markitect-filter[pdf]. Missing optional dependencies must be reported through structured diagnostics; they must not surface as raw import errors.

Descriptor Contract

MKTT-WP-0018 should implement a SourceAdapterDescriptor dataclass and map it into the existing ExtensionDescriptor catalog with kind source-adapter.

Required descriptor fields:

Field Type Meaning
id str Stable adapter id, for example source.epub3.
version str Adapter contract implementation version.
name str Human-readable adapter name.
operations list[str] V1 must contain only read.
media_types list[str] Supported media types, lower-case.
extensions list[str] Supported file suffixes including dots.
factory callable Returns a SourceReadAdapter.

Optional descriptor fields:

Field Type Meaning
summary `str None`
option_schema dict JSON-schema-like adapter options.
optional_dependencies list[OptionalDependency] Runtime libraries needed by the adapter.
safety dict Reads files, network, external processes, and related flags.
quality_profile dict Known extraction quality behavior.
metadata dict Adapter-specific metadata.

The corresponding ExtensionDescriptor should use:

id: same as SourceAdapterDescriptor.id
kind: source-adapter
input_contract: SourceInspectRequest | SourceReadRequest
output_contract: SourceInspectResult | SourceReadResult
diagnostics_namespace: source
provenance_prefix: source.<adapter>

Capabilities should include:

source read
markdown normalize
diagnostics emit
provenance emit
filesystem read

Descriptor IDs are globally unique. Duplicate IDs from external packages are registry errors. Descriptors from package entry points are sorted by ID for deterministic listing.

Entry Point Contract

The entry point group is:

markitect_tool.source_adapters

Each entry point may load to one of:

  • a SourceAdapterDescriptor
  • an iterable of SourceAdapterDescriptor
  • a callable returning either of the above

Discovery must not instantiate adapters unless the loaded object itself is a descriptor factory. Descriptors should remain cheap enough to list without format-specific imports.

Discovery errors should produce diagnostics with code source.discovery_failed. Missing optional dependencies declared by a descriptor should produce source.missing_dependency and mark that adapter unavailable for reads until the dependency is installed.

Data Model

All model objects must support stable to_dict() serialization. Serialization rules:

  • omit None, empty lists, empty dicts, and empty strings
  • preserve False, 0, and empty Markdown content where semantically valid
  • use UTF-8 text
  • use canonical JSON with sorted keys and compact separators when computing hashes or cache keys
  • keep all timestamps and dates as strings unless they are filesystem metadata such as mtime_ns

SourceAsset

Field Required Type Meaning
uri yes str Stable source URI. For local files, use a normalized path URI or path string.
path no str Local path when available.
name no str Display name or basename.
media_type no str Detected or declared media type.
extension no str Lower-case suffix including the dot.
size no int Byte size for local files.
mtime_ns no int Local file modification timestamp in nanoseconds.
digest no str sha256:<hex> of source bytes when available.
metadata no dict Source asset metadata that is not document metadata.

SourceMetadata

Field Required Type Meaning
title no str Source title.
creators no list[str] Authors or creators in source order.
language no str BCP 47 language tag when known.
rights no str Rights or license text from the source.
source_url no str Original public URL when known.
publication_date no str Source publication date string.
publisher no str Publisher name.
identifiers no dict[str, str] ISBN, DOI, package IDs, and similar identifiers.
raw no dict Adapter-preserved raw metadata.

SourceProvenance

Field Required Type Meaning
source_uri yes str Source asset URI.
source_path no str Local source path.
source_href no str Package-internal href or document-relative reference.
package_path no str Archive/package member path, such as EPUB XHTML.
anchor no str Source anchor or fragment.
page no str Page label or number where available.
section no str Chapter, section, or nav label.
start_offset no int Adapter-defined start offset.
end_offset no int Adapter-defined end offset.
digest no str Digest of the specific source component.
metadata no dict Adapter-specific provenance details.

NormalizedMarkdownSegment

Field Required Type Meaning
segment_id yes str Stable ID unique within the document.
order yes int Zero-based reading order.
markdown yes str Canonical Markdown for the segment.
heading no str Primary heading text for the segment.
heading_level no int Markdown heading level when known.
anchors no list[str] Source anchors covered by the segment.
provenance no list[SourceProvenance] Source spans contributing to the segment.
metadata no dict Adapter-specific segment metadata.

Segment IDs should be deterministic. Prefer source anchors when they are stable and unique. Otherwise use ordinal IDs such as seg-0001, seg-0002, and so on. Segment order is always authoritative for reading order.

NormalizationQuality

Field Required Type Meaning
lossiness yes str One of none, low, medium, high, or unknown.
confidence no float Adapter confidence from 0.0 to 1.0.
skipped_items no int Count of skipped source items.
warnings no int Count of warning diagnostics.
metadata no dict Adapter-specific quality details.

NormalizedMarkdownDocument

Field Required Type Meaning
schema_version yes str V1 uses markitect.source.v1.
document_id yes str Stable normalized document ID.
asset yes SourceAsset Original source identity.
metadata yes SourceMetadata Source document metadata.
markdown yes str Full normalized Markdown.
segments yes list[NormalizedMarkdownSegment] Ordered segment list.
quality yes NormalizationQuality Extraction quality summary.
diagnostics no list[Diagnostic] Existing Markitect diagnostic shape.
provenance no list[SourceProvenance] Document-level provenance.
attachments no list[SourceAsset] Referenced binary assets; v1 metadata only.
adapter yes dict Adapter id, version, and options.
cache_key yes str Deterministic normalization cache key.

The full markdown field should be equal to the ordered segment Markdown joined with exactly two newlines, unless an adapter has a documented reason to emit document-level frontmatter or separators.

Hashing And Cache Keys

Source asset digests use the source bytes:

sha256:<hex>

Document IDs should be stable across machines and based on:

  • normalized source asset URI or path
  • source asset digest when available
  • adapter ID
  • adapter version

Normalization cache keys should be based on canonical JSON containing:

  • source asset URI or path
  • source asset digest
  • adapter ID
  • adapter version
  • normalized model version
  • read options

Use this prefix:

source-normalize:sha256:<hex>

Read Adapter Protocol

MKTT-WP-0018 should implement Python Protocol classes equivalent to:

class SourceReadAdapter(Protocol):
    descriptor: SourceAdapterDescriptor

    def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch:
        ...

    def inspect(self, request: SourceInspectRequest) -> SourceInspectResult:
        ...

    def read(self, request: SourceReadRequest) -> SourceReadResult:
        ...

Request and result objects:

Type Required fields Meaning
SourceAdapterMatchRequest asset, options Cheap matching request.
SourceAdapterMatch adapter_id, matched, confidence, reason, diagnostics Match result. Confidence is 0 to 100.
SourceInspectRequest asset, options Metadata-only inspection request.
SourceInspectResult valid, asset, adapter, metadata, capabilities, diagnostics, quality Inspection result without full Markdown conversion.
SourceReadRequest asset, options Full normalization request.
SourceReadResult valid, document, diagnostics Normalized read result.

inspect must not perform full conversion. It may open enough of the source to validate structure and collect metadata. read may perform full extraction.

Options must be JSON-serializable. Adapter-specific options should be declared in option_schema. Unknown options should produce source.unknown_option unless the descriptor explicitly permits free-form options.

Adapter Selection

Selection is deterministic:

  1. If an explicit adapter ID is provided, use only that descriptor.
  2. Prefer media type matches over extension-only matches.
  3. Prefer higher can_read().confidence.
  4. Prefer descriptors with required optional dependencies available.
  5. Break remaining ties by descriptor ID in ascending lexical order and emit warning source.adapter_ambiguous.

No matching adapter returns an error diagnostic:

source.unsupported_format

Malformed sources return an error diagnostic:

source.malformed

Missing required optional dependencies return:

source.missing_dependency

Warnings do not make a result invalid. Any error diagnostic makes valid false.

CLI Contract

The public commands are:

mkt source adapters
mkt source inspect <path>
mkt source normalize <path> --format markdown

Common options:

--adapter <adapter-id>       Explicit adapter selection.
--format text|json|yaml      For adapters and inspect.
--format markdown|json|yaml  For normalize.
--option key=value           Adapter-specific option, repeatable.
--output <path>              Write normalized output.

Exit behavior:

Exit Meaning
0 Operation valid; warning diagnostics may exist.
1 Operation completed with error diagnostics.
2 CLI usage error from Click.

JSON output must contain a top-level valid field for operations that can fail. Markdown output writes only normalized Markdown to stdout or --output; diagnostics for Markdown output go to stderr. If normalization is invalid, do not emit partial Markdown unless a future option explicitly requests it.

API Contract

MKTT-WP-0018 should export these names from markitect_tool:

SourceAsset
SourceMetadata
SourceProvenance
NormalizedMarkdownSegment
NormalizedMarkdownDocument
NormalizationQuality
SourceAdapterDescriptor
SourceReadAdapter
SourceAdapterRegistry
SourceAdapterMatchRequest
SourceAdapterMatch
SourceInspectRequest
SourceInspectResult
SourceReadRequest
SourceReadResult
default_source_adapter_registry
discover_source_adapters
inspect_source
normalize_source

Direct API helpers should accept an optional registry and adapter ID so tests and sibling repos can avoid global discovery when they need deterministic fixtures.

Contract Tests For MKTT-WP-0018

Implementation should add tests for:

  • SourceAsset, metadata, provenance, quality, segment, and document serialization
  • source document cache-key determinism
  • fake in-tree adapter registration and read behavior
  • fake external entry point discovery
  • optional dependency diagnostics
  • unsupported format diagnostics
  • malformed source diagnostics
  • adapter selection tie behavior
  • CLI source adapters JSON fixture
  • CLI source inspect JSON fixture
  • CLI source normalize --format json fixture
  • CLI source normalize --format markdown fixture
  • public API exports

Fixtures live in examples/source-adapters/ and should be reused by tests where practical.

Markitect-filter Handoff

The first markitect-filter implementation should provide an EPUB3 descriptor:

id: source.epub3
name: EPUB3
operations: read
media_types: application/epub+zip
extensions: .epub
entry_point: markitect_filter.adapters:epub3_adapter_descriptor

The EPUB3 adapter should inspect and normalize:

  • META-INF/container.xml
  • the OPF package document
  • Dublin Core and package metadata
  • spine reading order
  • navigation labels
  • body XHTML as ordered Markdown segments
  • source hrefs, anchors, sections, and page references where available

It should classify or skip cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit options and diagnostics. It should report unsupported media, malformed package structure, skipped assets, and lossy extraction.