coulomb/markitect-tool

Fork 0

generated from coulomb/repo-seed

Files

tegwick f8f20c7c32 Workplan refinement and examples

2026-05-14 21:49:43 +02:00

16 KiB

Raw Blame History

Source Adapter Contract

Purpose

This document pins the v1 contract for source-format adapters. It is the handoff from MKTT-WP-0019 to MKTT-WP-0018: markitect-tool implements the contract, registry, CLI, public API, and tests; markitect-filter implements concrete adapters, starting with EPUB3.

The v1 contract is intentionally read-only. It normalizes heterogeneous source formats into canonical Markitect Markdown plus metadata, provenance, quality signals, and diagnostics. Writer/export adapters are future scope.

Scope

The v1 source adapter layer supports:

local filesystem source inputs
deterministic inspection and normalization
package-provided read adapters discovered through Python entry points
optional dependencies isolated in adapter packages
JSON-serializable normalized Markdown outputs
contract tests with fake adapters and small fixtures

The v1 layer does not support:

EPUB3, PDF, DOCX, ODT, OCR, browser, or archive parsing in markitect-tool
write/export adapters
network fetching for source URIs
durable ingestion, permissions, retrieval, or governance
hidden AI-assisted repair or enrichment

URI fields appear in the model so adapters can preserve source identity, but v1 CLI/API inputs are local paths unless a later workplan opens remote source loading explicitly.

Package Shape

External adapter packages should depend on markitect-tool and register one or more read adapter descriptors through the entry point group markitect_tool.source_adapters.

Recommended markitect-filter shape:

markitect_filter/
  src/markitect_filter/
    __init__.py
    epub3.py
    adapters.py
  tests/
    fixtures/epub3/
    test_epub3_adapter.py
  pyproject.toml

The package should expose a lightweight descriptor function that does not import heavyweight format dependencies until the adapter is instantiated or used. For example:

[project.entry-points."markitect_tool.source_adapters"]
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"

Adapter packages may use extras such as markitect-filter[epub3] or markitect-filter[pdf]. Missing optional dependencies must be reported through structured diagnostics; they must not surface as raw import errors.

Descriptor Contract

MKTT-WP-0018 should implement a SourceAdapterDescriptor dataclass and map it into the existing ExtensionDescriptor catalog with kind source-adapter.

Required descriptor fields:

Field	Type	Meaning
`id`	`str`	Stable adapter id, for example `source.epub3`.
`version`	`str`	Adapter contract implementation version.
`name`	`str`	Human-readable adapter name.
`operations`	`list[str]`	V1 must contain only `read`.
`media_types`	`list[str]`	Supported media types, lower-case.
`extensions`	`list[str]`	Supported file suffixes including dots.
`factory`	callable	Returns a `SourceReadAdapter`.

Optional descriptor fields:

Field	Type	Meaning
`summary`	`str	None`
`option_schema`	`dict`	JSON-schema-like adapter options.
`optional_dependencies`	`list[OptionalDependency]`	Runtime libraries needed by the adapter.
`safety`	`dict`	Reads files, network, external processes, and related flags.
`quality_profile`	`dict`	Known extraction quality behavior.
`metadata`	`dict`	Adapter-specific metadata.

The corresponding ExtensionDescriptor should use:

id: same as SourceAdapterDescriptor.id
kind: source-adapter
input_contract: SourceInspectRequest | SourceReadRequest
output_contract: SourceInspectResult | SourceReadResult
diagnostics_namespace: source
provenance_prefix: source.<adapter>

Capabilities should include:

source read
markdown normalize
diagnostics emit
provenance emit
filesystem read

Descriptor IDs are globally unique. Duplicate IDs from external packages are registry errors. Descriptors from package entry points are sorted by ID for deterministic listing.

Entry Point Contract

The entry point group is:

markitect_tool.source_adapters

Each entry point may load to one of:

a SourceAdapterDescriptor
an iterable of SourceAdapterDescriptor
a callable returning either of the above

Discovery must not instantiate adapters unless the loaded object itself is a descriptor factory. Descriptors should remain cheap enough to list without format-specific imports.

Discovery errors should produce diagnostics with code source.discovery_failed. Missing optional dependencies declared by a descriptor should produce source.missing_dependency and mark that adapter unavailable for reads until the dependency is installed.

Data Model

All model objects must support stable to_dict() serialization. Serialization rules:

omit None, empty lists, empty dicts, and empty strings
preserve False, 0, and empty Markdown content where semantically valid
use UTF-8 text
use canonical JSON with sorted keys and compact separators when computing hashes or cache keys
keep all timestamps and dates as strings unless they are filesystem metadata such as mtime_ns

`SourceAsset`

Field	Required	Type	Meaning
`uri`	yes	`str`	Stable source URI. For local files, use a normalized path URI or path string.
`path`	no	`str`	Local path when available.
`name`	no	`str`	Display name or basename.
`media_type`	no	`str`	Detected or declared media type.
`extension`	no	`str`	Lower-case suffix including the dot.
`size`	no	`int`	Byte size for local files.
`mtime_ns`	no	`int`	Local file modification timestamp in nanoseconds.
`digest`	no	`str`	`sha256:<hex>` of source bytes when available.
`metadata`	no	`dict`	Source asset metadata that is not document metadata.

`SourceMetadata`

Field	Required	Type	Meaning
`title`	no	`str`	Source title.
`creators`	no	`list[str]`	Authors or creators in source order.
`language`	no	`str`	BCP 47 language tag when known.
`rights`	no	`str`	Rights or license text from the source.
`source_url`	no	`str`	Original public URL when known.
`publication_date`	no	`str`	Source publication date string.
`publisher`	no	`str`	Publisher name.
`identifiers`	no	`dict[str, str]`	ISBN, DOI, package IDs, and similar identifiers.
`raw`	no	`dict`	Adapter-preserved raw metadata.

`SourceProvenance`

Field	Required	Type	Meaning
`source_uri`	yes	`str`	Source asset URI.
`source_path`	no	`str`	Local source path.
`source_href`	no	`str`	Package-internal href or document-relative reference.
`package_path`	no	`str`	Archive/package member path, such as EPUB XHTML.
`anchor`	no	`str`	Source anchor or fragment.
`page`	no	`str`	Page label or number where available.
`section`	no	`str`	Chapter, section, or nav label.
`start_offset`	no	`int`	Adapter-defined start offset.
`end_offset`	no	`int`	Adapter-defined end offset.
`digest`	no	`str`	Digest of the specific source component.
`metadata`	no	`dict`	Adapter-specific provenance details.

`NormalizedMarkdownSegment`

Field	Required	Type	Meaning
`segment_id`	yes	`str`	Stable ID unique within the document.
`order`	yes	`int`	Zero-based reading order.
`markdown`	yes	`str`	Canonical Markdown for the segment.
`heading`	no	`str`	Primary heading text for the segment.
`heading_level`	no	`int`	Markdown heading level when known.
`anchors`	no	`list[str]`	Source anchors covered by the segment.
`provenance`	no	`list[SourceProvenance]`	Source spans contributing to the segment.
`metadata`	no	`dict`	Adapter-specific segment metadata.

Segment IDs should be deterministic. Prefer source anchors when they are stable and unique. Otherwise use ordinal IDs such as seg-0001, seg-0002, and so on. Segment order is always authoritative for reading order.

`NormalizationQuality`

Field	Required	Type	Meaning
`lossiness`	yes	`str`	One of `none`, `low`, `medium`, `high`, or `unknown`.
`confidence`	no	`float`	Adapter confidence from `0.0` to `1.0`.
`skipped_items`	no	`int`	Count of skipped source items.
`warnings`	no	`int`	Count of warning diagnostics.
`metadata`	no	`dict`	Adapter-specific quality details.

`NormalizedMarkdownDocument`

Field	Required	Type	Meaning
`schema_version`	yes	`str`	V1 uses `markitect.source.v1`.
`document_id`	yes	`str`	Stable normalized document ID.
`asset`	yes	`SourceAsset`	Original source identity.
`metadata`	yes	`SourceMetadata`	Source document metadata.
`markdown`	yes	`str`	Full normalized Markdown.
`segments`	yes	`list[NormalizedMarkdownSegment]`	Ordered segment list.
`quality`	yes	`NormalizationQuality`	Extraction quality summary.
`diagnostics`	no	`list[Diagnostic]`	Existing Markitect diagnostic shape.
`provenance`	no	`list[SourceProvenance]`	Document-level provenance.
`attachments`	no	`list[SourceAsset]`	Referenced binary assets; v1 metadata only.
`adapter`	yes	`dict`	Adapter id, version, and options.
`cache_key`	yes	`str`	Deterministic normalization cache key.

The full markdown field should be equal to the ordered segment Markdown joined with exactly two newlines, unless an adapter has a documented reason to emit document-level frontmatter or separators.

Hashing And Cache Keys

Source asset digests use the source bytes:

sha256:<hex>

Document IDs should be stable across machines and based on:

normalized source asset URI or path
source asset digest when available
adapter ID
adapter version

Normalization cache keys should be based on canonical JSON containing:

source asset URI or path
source asset digest
adapter ID
adapter version
normalized model version
read options

Use this prefix:

source-normalize:sha256:<hex>

Read Adapter Protocol

MKTT-WP-0018 should implement Python Protocol classes equivalent to:

class SourceReadAdapter(Protocol):
    descriptor: SourceAdapterDescriptor

    def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch:
        ...

    def inspect(self, request: SourceInspectRequest) -> SourceInspectResult:
        ...

    def read(self, request: SourceReadRequest) -> SourceReadResult:
        ...

Request and result objects:

Type	Required fields	Meaning
`SourceAdapterMatchRequest`	`asset`, `options`	Cheap matching request.
`SourceAdapterMatch`	`adapter_id`, `matched`, `confidence`, `reason`, `diagnostics`	Match result. Confidence is `0` to `100`.
`SourceInspectRequest`	`asset`, `options`	Metadata-only inspection request.
`SourceInspectResult`	`valid`, `asset`, `adapter`, `metadata`, `capabilities`, `diagnostics`, `quality`	Inspection result without full Markdown conversion.
`SourceReadRequest`	`asset`, `options`	Full normalization request.
`SourceReadResult`	`valid`, `document`, `diagnostics`	Normalized read result.

inspect must not perform full conversion. It may open enough of the source to validate structure and collect metadata. read may perform full extraction.

Options must be JSON-serializable. Adapter-specific options should be declared in option_schema. Unknown options should produce source.unknown_option unless the descriptor explicitly permits free-form options.

Adapter Selection

Selection is deterministic:

If an explicit adapter ID is provided, use only that descriptor.
Prefer media type matches over extension-only matches.
Prefer higher can_read().confidence.
Prefer descriptors with required optional dependencies available.
Break remaining ties by descriptor ID in ascending lexical order and emit warning source.adapter_ambiguous.

No matching adapter returns an error diagnostic:

source.unsupported_format

Malformed sources return an error diagnostic:

source.malformed

Missing required optional dependencies return:

source.missing_dependency

Warnings do not make a result invalid. Any error diagnostic makes valid false.

CLI Contract

The public commands are:

mkt source adapters
mkt source inspect <path>
mkt source normalize <path> --format markdown

Common options:

--adapter <adapter-id>       Explicit adapter selection.
--format text|json|yaml      For adapters and inspect.
--format markdown|json|yaml  For normalize.
--option key=value           Adapter-specific option, repeatable.
--output <path>              Write normalized output.

Exit behavior:

Exit	Meaning
`0`	Operation valid; warning diagnostics may exist.
`1`	Operation completed with error diagnostics.
`2`	CLI usage error from Click.

JSON output must contain a top-level valid field for operations that can fail. Markdown output writes only normalized Markdown to stdout or --output; diagnostics for Markdown output go to stderr. If normalization is invalid, do not emit partial Markdown unless a future option explicitly requests it.

API Contract

MKTT-WP-0018 should export these names from markitect_tool:

SourceAsset
SourceMetadata
SourceProvenance
NormalizedMarkdownSegment
NormalizedMarkdownDocument
NormalizationQuality
SourceAdapterDescriptor
SourceReadAdapter
SourceAdapterRegistry
SourceAdapterMatchRequest
SourceAdapterMatch
SourceInspectRequest
SourceInspectResult
SourceReadRequest
SourceReadResult
default_source_adapter_registry
discover_source_adapters
inspect_source
normalize_source

Direct API helpers should accept an optional registry and adapter ID so tests and sibling repos can avoid global discovery when they need deterministic fixtures.

Contract Tests For MKTT-WP-0018

Implementation should add tests for:

SourceAsset, metadata, provenance, quality, segment, and document serialization
source document cache-key determinism
fake in-tree adapter registration and read behavior
fake external entry point discovery
optional dependency diagnostics
unsupported format diagnostics
malformed source diagnostics
adapter selection tie behavior
CLI source adapters JSON fixture
CLI source inspect JSON fixture
CLI source normalize --format json fixture
CLI source normalize --format markdown fixture
public API exports

Fixtures live in examples/source-adapters/ and should be reused by tests where practical.

Markitect-filter Handoff

The first markitect-filter implementation should provide an EPUB3 descriptor:

id: source.epub3
name: EPUB3
operations: read
media_types: application/epub+zip
extensions: .epub
entry_point: markitect_filter.adapters:epub3_adapter_descriptor

The EPUB3 adapter should inspect and normalize:

META-INF/container.xml
the OPF package document
Dublin Core and package metadata
spine reading order
navigation labels
body XHTML as ordered Markdown segments
source hrefs, anchors, sections, and page references where available

It should classify or skip cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit options and diagnostics. It should report unsupported media, malformed package structure, skipped assets, and lossy extraction.

16 KiB Raw Blame History