generated from coulomb/repo-seed
462 lines
16 KiB
Markdown
462 lines
16 KiB
Markdown
# Source Adapter Contract
|
|
|
|
## Purpose
|
|
|
|
This document pins the v1 contract for source-format adapters. It is the
|
|
handoff from `MKTT-WP-0019` to `MKTT-WP-0018`: `markitect-tool` implements the
|
|
contract, registry, CLI, public API, and tests; `markitect-filter` implements
|
|
concrete adapters, starting with EPUB3.
|
|
|
|
The v1 contract is intentionally read-only. It normalizes heterogeneous source
|
|
formats into canonical Markitect Markdown plus metadata, provenance, quality
|
|
signals, and diagnostics. Writer/export adapters are future scope.
|
|
|
|
## Scope
|
|
|
|
The v1 source adapter layer supports:
|
|
|
|
- local filesystem source inputs
|
|
- deterministic inspection and normalization
|
|
- package-provided read adapters discovered through Python entry points
|
|
- optional dependencies isolated in adapter packages
|
|
- JSON-serializable normalized Markdown outputs
|
|
- contract tests with fake adapters and small fixtures
|
|
|
|
The v1 layer does not support:
|
|
|
|
- EPUB3, PDF, DOCX, ODT, OCR, browser, or archive parsing in `markitect-tool`
|
|
- write/export adapters
|
|
- network fetching for source URIs
|
|
- durable ingestion, permissions, retrieval, or governance
|
|
- hidden AI-assisted repair or enrichment
|
|
|
|
URI fields appear in the model so adapters can preserve source identity, but
|
|
v1 CLI/API inputs are local paths unless a later workplan opens remote source
|
|
loading explicitly.
|
|
|
|
## Package Shape
|
|
|
|
External adapter packages should depend on `markitect-tool` and register one or
|
|
more read adapter descriptors through the entry point group
|
|
`markitect_tool.source_adapters`.
|
|
|
|
Recommended `markitect-filter` shape:
|
|
|
|
```text
|
|
markitect_filter/
|
|
src/markitect_filter/
|
|
__init__.py
|
|
epub3.py
|
|
adapters.py
|
|
tests/
|
|
fixtures/epub3/
|
|
test_epub3_adapter.py
|
|
pyproject.toml
|
|
```
|
|
|
|
The package should expose a lightweight descriptor function that does not
|
|
import heavyweight format dependencies until the adapter is instantiated or
|
|
used. For example:
|
|
|
|
```toml
|
|
[project.entry-points."markitect_tool.source_adapters"]
|
|
epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
|
|
```
|
|
|
|
Adapter packages may use extras such as `markitect-filter[epub3]` or
|
|
`markitect-filter[pdf]`. Missing optional dependencies must be reported through
|
|
structured diagnostics; they must not surface as raw import errors.
|
|
|
|
## Descriptor Contract
|
|
|
|
`MKTT-WP-0018` should implement a `SourceAdapterDescriptor` dataclass and map
|
|
it into the existing `ExtensionDescriptor` catalog with kind `source-adapter`.
|
|
|
|
Required descriptor fields:
|
|
|
|
| Field | Type | Meaning |
|
|
| --- | --- | --- |
|
|
| `id` | `str` | Stable adapter id, for example `source.epub3`. |
|
|
| `version` | `str` | Adapter contract implementation version. |
|
|
| `name` | `str` | Human-readable adapter name. |
|
|
| `operations` | `list[str]` | V1 must contain only `read`. |
|
|
| `media_types` | `list[str]` | Supported media types, lower-case. |
|
|
| `extensions` | `list[str]` | Supported file suffixes including dots. |
|
|
| `factory` | callable | Returns a `SourceReadAdapter`. |
|
|
|
|
Optional descriptor fields:
|
|
|
|
| Field | Type | Meaning |
|
|
| --- | --- | --- |
|
|
| `summary` | `str | None` | Short description for CLI and docs. |
|
|
| `option_schema` | `dict` | JSON-schema-like adapter options. |
|
|
| `optional_dependencies` | `list[OptionalDependency]` | Runtime libraries needed by the adapter. |
|
|
| `safety` | `dict` | Reads files, network, external processes, and related flags. |
|
|
| `quality_profile` | `dict` | Known extraction quality behavior. |
|
|
| `metadata` | `dict` | Adapter-specific metadata. |
|
|
|
|
The corresponding `ExtensionDescriptor` should use:
|
|
|
|
```text
|
|
id: same as SourceAdapterDescriptor.id
|
|
kind: source-adapter
|
|
input_contract: SourceInspectRequest | SourceReadRequest
|
|
output_contract: SourceInspectResult | SourceReadResult
|
|
diagnostics_namespace: source
|
|
provenance_prefix: source.<adapter>
|
|
```
|
|
|
|
Capabilities should include:
|
|
|
|
```text
|
|
source read
|
|
markdown normalize
|
|
diagnostics emit
|
|
provenance emit
|
|
filesystem read
|
|
```
|
|
|
|
Descriptor IDs are globally unique. Duplicate IDs from external packages are
|
|
registry errors. Descriptors from package entry points are sorted by ID for
|
|
deterministic listing.
|
|
|
|
## Entry Point Contract
|
|
|
|
The entry point group is:
|
|
|
|
```text
|
|
markitect_tool.source_adapters
|
|
```
|
|
|
|
Each entry point may load to one of:
|
|
|
|
- a `SourceAdapterDescriptor`
|
|
- an iterable of `SourceAdapterDescriptor`
|
|
- a callable returning either of the above
|
|
|
|
Discovery must not instantiate adapters unless the loaded object itself is a
|
|
descriptor factory. Descriptors should remain cheap enough to list without
|
|
format-specific imports.
|
|
|
|
Discovery errors should produce diagnostics with code
|
|
`source.discovery_failed`. Missing optional dependencies declared by a
|
|
descriptor should produce `source.missing_dependency` and mark that adapter
|
|
unavailable for reads until the dependency is installed.
|
|
|
|
## Data Model
|
|
|
|
All model objects must support stable `to_dict()` serialization. Serialization
|
|
rules:
|
|
|
|
- omit `None`, empty lists, empty dicts, and empty strings
|
|
- preserve `False`, `0`, and empty Markdown content where semantically valid
|
|
- use UTF-8 text
|
|
- use canonical JSON with sorted keys and compact separators when computing
|
|
hashes or cache keys
|
|
- keep all timestamps and dates as strings unless they are filesystem metadata
|
|
such as `mtime_ns`
|
|
|
|
### `SourceAsset`
|
|
|
|
| Field | Required | Type | Meaning |
|
|
| --- | --- | --- | --- |
|
|
| `uri` | yes | `str` | Stable source URI. For local files, use a normalized path URI or path string. |
|
|
| `path` | no | `str` | Local path when available. |
|
|
| `name` | no | `str` | Display name or basename. |
|
|
| `media_type` | no | `str` | Detected or declared media type. |
|
|
| `extension` | no | `str` | Lower-case suffix including the dot. |
|
|
| `size` | no | `int` | Byte size for local files. |
|
|
| `mtime_ns` | no | `int` | Local file modification timestamp in nanoseconds. |
|
|
| `digest` | no | `str` | `sha256:<hex>` of source bytes when available. |
|
|
| `metadata` | no | `dict` | Source asset metadata that is not document metadata. |
|
|
|
|
### `SourceMetadata`
|
|
|
|
| Field | Required | Type | Meaning |
|
|
| --- | --- | --- | --- |
|
|
| `title` | no | `str` | Source title. |
|
|
| `creators` | no | `list[str]` | Authors or creators in source order. |
|
|
| `language` | no | `str` | BCP 47 language tag when known. |
|
|
| `rights` | no | `str` | Rights or license text from the source. |
|
|
| `source_url` | no | `str` | Original public URL when known. |
|
|
| `publication_date` | no | `str` | Source publication date string. |
|
|
| `publisher` | no | `str` | Publisher name. |
|
|
| `identifiers` | no | `dict[str, str]` | ISBN, DOI, package IDs, and similar identifiers. |
|
|
| `raw` | no | `dict` | Adapter-preserved raw metadata. |
|
|
|
|
### `SourceProvenance`
|
|
|
|
| Field | Required | Type | Meaning |
|
|
| --- | --- | --- | --- |
|
|
| `source_uri` | yes | `str` | Source asset URI. |
|
|
| `source_path` | no | `str` | Local source path. |
|
|
| `source_href` | no | `str` | Package-internal href or document-relative reference. |
|
|
| `package_path` | no | `str` | Archive/package member path, such as EPUB XHTML. |
|
|
| `anchor` | no | `str` | Source anchor or fragment. |
|
|
| `page` | no | `str` | Page label or number where available. |
|
|
| `section` | no | `str` | Chapter, section, or nav label. |
|
|
| `start_offset` | no | `int` | Adapter-defined start offset. |
|
|
| `end_offset` | no | `int` | Adapter-defined end offset. |
|
|
| `digest` | no | `str` | Digest of the specific source component. |
|
|
| `metadata` | no | `dict` | Adapter-specific provenance details. |
|
|
|
|
### `NormalizedMarkdownSegment`
|
|
|
|
| Field | Required | Type | Meaning |
|
|
| --- | --- | --- | --- |
|
|
| `segment_id` | yes | `str` | Stable ID unique within the document. |
|
|
| `order` | yes | `int` | Zero-based reading order. |
|
|
| `markdown` | yes | `str` | Canonical Markdown for the segment. |
|
|
| `heading` | no | `str` | Primary heading text for the segment. |
|
|
| `heading_level` | no | `int` | Markdown heading level when known. |
|
|
| `anchors` | no | `list[str]` | Source anchors covered by the segment. |
|
|
| `provenance` | no | `list[SourceProvenance]` | Source spans contributing to the segment. |
|
|
| `metadata` | no | `dict` | Adapter-specific segment metadata. |
|
|
|
|
Segment IDs should be deterministic. Prefer source anchors when they are stable
|
|
and unique. Otherwise use ordinal IDs such as `seg-0001`, `seg-0002`, and so
|
|
on. Segment order is always authoritative for reading order.
|
|
|
|
### `NormalizationQuality`
|
|
|
|
| Field | Required | Type | Meaning |
|
|
| --- | --- | --- | --- |
|
|
| `lossiness` | yes | `str` | One of `none`, `low`, `medium`, `high`, or `unknown`. |
|
|
| `confidence` | no | `float` | Adapter confidence from `0.0` to `1.0`. |
|
|
| `skipped_items` | no | `int` | Count of skipped source items. |
|
|
| `warnings` | no | `int` | Count of warning diagnostics. |
|
|
| `metadata` | no | `dict` | Adapter-specific quality details. |
|
|
|
|
### `NormalizedMarkdownDocument`
|
|
|
|
| Field | Required | Type | Meaning |
|
|
| --- | --- | --- | --- |
|
|
| `schema_version` | yes | `str` | V1 uses `markitect.source.v1`. |
|
|
| `document_id` | yes | `str` | Stable normalized document ID. |
|
|
| `asset` | yes | `SourceAsset` | Original source identity. |
|
|
| `metadata` | yes | `SourceMetadata` | Source document metadata. |
|
|
| `markdown` | yes | `str` | Full normalized Markdown. |
|
|
| `segments` | yes | `list[NormalizedMarkdownSegment]` | Ordered segment list. |
|
|
| `quality` | yes | `NormalizationQuality` | Extraction quality summary. |
|
|
| `diagnostics` | no | `list[Diagnostic]` | Existing Markitect diagnostic shape. |
|
|
| `provenance` | no | `list[SourceProvenance]` | Document-level provenance. |
|
|
| `attachments` | no | `list[SourceAsset]` | Referenced binary assets; v1 metadata only. |
|
|
| `adapter` | yes | `dict` | Adapter id, version, and options. |
|
|
| `cache_key` | yes | `str` | Deterministic normalization cache key. |
|
|
|
|
The full `markdown` field should be equal to the ordered segment Markdown joined
|
|
with exactly two newlines, unless an adapter has a documented reason to emit
|
|
document-level frontmatter or separators.
|
|
|
|
## Hashing And Cache Keys
|
|
|
|
Source asset digests use the source bytes:
|
|
|
|
```text
|
|
sha256:<hex>
|
|
```
|
|
|
|
Document IDs should be stable across machines and based on:
|
|
|
|
- normalized source asset URI or path
|
|
- source asset digest when available
|
|
- adapter ID
|
|
- adapter version
|
|
|
|
Normalization cache keys should be based on canonical JSON containing:
|
|
|
|
- source asset URI or path
|
|
- source asset digest
|
|
- adapter ID
|
|
- adapter version
|
|
- normalized model version
|
|
- read options
|
|
|
|
Use this prefix:
|
|
|
|
```text
|
|
source-normalize:sha256:<hex>
|
|
```
|
|
|
|
## Read Adapter Protocol
|
|
|
|
`MKTT-WP-0018` should implement Python `Protocol` classes equivalent to:
|
|
|
|
```python
|
|
class SourceReadAdapter(Protocol):
|
|
descriptor: SourceAdapterDescriptor
|
|
|
|
def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch:
|
|
...
|
|
|
|
def inspect(self, request: SourceInspectRequest) -> SourceInspectResult:
|
|
...
|
|
|
|
def read(self, request: SourceReadRequest) -> SourceReadResult:
|
|
...
|
|
```
|
|
|
|
Request and result objects:
|
|
|
|
| Type | Required fields | Meaning |
|
|
| --- | --- | --- |
|
|
| `SourceAdapterMatchRequest` | `asset`, `options` | Cheap matching request. |
|
|
| `SourceAdapterMatch` | `adapter_id`, `matched`, `confidence`, `reason`, `diagnostics` | Match result. Confidence is `0` to `100`. |
|
|
| `SourceInspectRequest` | `asset`, `options` | Metadata-only inspection request. |
|
|
| `SourceInspectResult` | `valid`, `asset`, `adapter`, `metadata`, `capabilities`, `diagnostics`, `quality` | Inspection result without full Markdown conversion. |
|
|
| `SourceReadRequest` | `asset`, `options` | Full normalization request. |
|
|
| `SourceReadResult` | `valid`, `document`, `diagnostics` | Normalized read result. |
|
|
|
|
`inspect` must not perform full conversion. It may open enough of the source to
|
|
validate structure and collect metadata. `read` may perform full extraction.
|
|
|
|
Options must be JSON-serializable. Adapter-specific options should be declared
|
|
in `option_schema`. Unknown options should produce
|
|
`source.unknown_option` unless the descriptor explicitly permits free-form
|
|
options.
|
|
|
|
## Adapter Selection
|
|
|
|
Selection is deterministic:
|
|
|
|
1. If an explicit adapter ID is provided, use only that descriptor.
|
|
2. Prefer media type matches over extension-only matches.
|
|
3. Prefer higher `can_read().confidence`.
|
|
4. Prefer descriptors with required optional dependencies available.
|
|
5. Break remaining ties by descriptor ID in ascending lexical order and emit
|
|
warning `source.adapter_ambiguous`.
|
|
|
|
No matching adapter returns an error diagnostic:
|
|
|
|
```text
|
|
source.unsupported_format
|
|
```
|
|
|
|
Malformed sources return an error diagnostic:
|
|
|
|
```text
|
|
source.malformed
|
|
```
|
|
|
|
Missing required optional dependencies return:
|
|
|
|
```text
|
|
source.missing_dependency
|
|
```
|
|
|
|
Warnings do not make a result invalid. Any error diagnostic makes `valid`
|
|
false.
|
|
|
|
## CLI Contract
|
|
|
|
The public commands are:
|
|
|
|
```bash
|
|
mkt source adapters
|
|
mkt source inspect <path>
|
|
mkt source normalize <path> --format markdown
|
|
```
|
|
|
|
Common options:
|
|
|
|
```text
|
|
--adapter <adapter-id> Explicit adapter selection.
|
|
--format text|json|yaml For adapters and inspect.
|
|
--format markdown|json|yaml For normalize.
|
|
--option key=value Adapter-specific option, repeatable.
|
|
--output <path> Write normalized output.
|
|
```
|
|
|
|
Exit behavior:
|
|
|
|
| Exit | Meaning |
|
|
| --- | --- |
|
|
| `0` | Operation valid; warning diagnostics may exist. |
|
|
| `1` | Operation completed with error diagnostics. |
|
|
| `2` | CLI usage error from Click. |
|
|
|
|
JSON output must contain a top-level `valid` field for operations that can
|
|
fail. Markdown output writes only normalized Markdown to stdout or `--output`;
|
|
diagnostics for Markdown output go to stderr. If normalization is invalid, do
|
|
not emit partial Markdown unless a future option explicitly requests it.
|
|
|
|
## API Contract
|
|
|
|
`MKTT-WP-0018` should export these names from `markitect_tool`:
|
|
|
|
```text
|
|
SourceAsset
|
|
SourceMetadata
|
|
SourceProvenance
|
|
NormalizedMarkdownSegment
|
|
NormalizedMarkdownDocument
|
|
NormalizationQuality
|
|
SourceAdapterDescriptor
|
|
SourceReadAdapter
|
|
SourceAdapterRegistry
|
|
SourceAdapterMatchRequest
|
|
SourceAdapterMatch
|
|
SourceInspectRequest
|
|
SourceInspectResult
|
|
SourceReadRequest
|
|
SourceReadResult
|
|
default_source_adapter_registry
|
|
discover_source_adapters
|
|
inspect_source
|
|
normalize_source
|
|
```
|
|
|
|
Direct API helpers should accept an optional registry and adapter ID so tests
|
|
and sibling repos can avoid global discovery when they need deterministic
|
|
fixtures.
|
|
|
|
## Contract Tests For MKTT-WP-0018
|
|
|
|
Implementation should add tests for:
|
|
|
|
- `SourceAsset`, metadata, provenance, quality, segment, and document
|
|
serialization
|
|
- source document cache-key determinism
|
|
- fake in-tree adapter registration and read behavior
|
|
- fake external entry point discovery
|
|
- optional dependency diagnostics
|
|
- unsupported format diagnostics
|
|
- malformed source diagnostics
|
|
- adapter selection tie behavior
|
|
- CLI `source adapters` JSON fixture
|
|
- CLI `source inspect` JSON fixture
|
|
- CLI `source normalize --format json` fixture
|
|
- CLI `source normalize --format markdown` fixture
|
|
- public API exports
|
|
|
|
Fixtures live in `examples/source-adapters/` and should be reused by tests where
|
|
practical.
|
|
|
|
## Markitect-filter Handoff
|
|
|
|
The first `markitect-filter` implementation should provide an EPUB3 descriptor:
|
|
|
|
```text
|
|
id: source.epub3
|
|
name: EPUB3
|
|
operations: read
|
|
media_types: application/epub+zip
|
|
extensions: .epub
|
|
entry_point: markitect_filter.adapters:epub3_adapter_descriptor
|
|
```
|
|
|
|
The EPUB3 adapter should inspect and normalize:
|
|
|
|
- `META-INF/container.xml`
|
|
- the OPF package document
|
|
- Dublin Core and package metadata
|
|
- spine reading order
|
|
- navigation labels
|
|
- body XHTML as ordered Markdown segments
|
|
- source hrefs, anchors, sections, and page references where available
|
|
|
|
It should classify or skip cover, navigation, table-of-contents, header,
|
|
footer, license, and transcriber-note material through explicit options and
|
|
diagnostics. It should report unsupported media, malformed package structure,
|
|
skipped assets, and lossy extraction.
|