Workplan refinement and examples

2026-05-14 21:49:43 +02:00
parent 28ce4b3f65
commit f8f20c7c32
13 changed files with 726 additions and 23 deletions
--- a/docs/source-adapter-contract.md
+++ b/docs/source-adapter-contract.md
@@ -0,0 +1,461 @@
+# Source Adapter Contract
+
+## Purpose
+
+This document pins the v1 contract for source-format adapters. It is the
+handoff from `MKTT-WP-0019` to `MKTT-WP-0018`: `markitect-tool` implements the
+contract, registry, CLI, public API, and tests; `markitect-filter` implements
+concrete adapters, starting with EPUB3.
+
+The v1 contract is intentionally read-only. It normalizes heterogeneous source
+formats into canonical Markitect Markdown plus metadata, provenance, quality
+signals, and diagnostics. Writer/export adapters are future scope.
+
+## Scope
+
+The v1 source adapter layer supports:
+
+- local filesystem source inputs
+- deterministic inspection and normalization
+- package-provided read adapters discovered through Python entry points
+- optional dependencies isolated in adapter packages
+- JSON-serializable normalized Markdown outputs
+- contract tests with fake adapters and small fixtures
+
+The v1 layer does not support:
+
+- EPUB3, PDF, DOCX, ODT, OCR, browser, or archive parsing in `markitect-tool`
+- write/export adapters
+- network fetching for source URIs
+- durable ingestion, permissions, retrieval, or governance
+- hidden AI-assisted repair or enrichment
+
+URI fields appear in the model so adapters can preserve source identity, but
+v1 CLI/API inputs are local paths unless a later workplan opens remote source
+loading explicitly.
+
+## Package Shape
+
+External adapter packages should depend on `markitect-tool` and register one or
+more read adapter descriptors through the entry point group
+`markitect_tool.source_adapters`.
+
+Recommended `markitect-filter` shape:
+
+```text
+markitect_filter/
+  src/markitect_filter/
+    __init__.py
+    epub3.py
+    adapters.py
+  tests/
+    fixtures/epub3/
+    test_epub3_adapter.py
+  pyproject.toml
+```
+
+The package should expose a lightweight descriptor function that does not
+import heavyweight format dependencies until the adapter is instantiated or
+used. For example:
+
+```toml
+[project.entry-points."markitect_tool.source_adapters"]
+epub3 = "markitect_filter.adapters:epub3_adapter_descriptor"
+```
+
+Adapter packages may use extras such as `markitect-filter[epub3]` or
+`markitect-filter[pdf]`. Missing optional dependencies must be reported through
+structured diagnostics; they must not surface as raw import errors.
+
+## Descriptor Contract
+
+`MKTT-WP-0018` should implement a `SourceAdapterDescriptor` dataclass and map
+it into the existing `ExtensionDescriptor` catalog with kind `source-adapter`.
+
+Required descriptor fields:
+
+| Field | Type | Meaning |
+| --- | --- | --- |
+| `id` | `str` | Stable adapter id, for example `source.epub3`. |
+| `version` | `str` | Adapter contract implementation version. |
+| `name` | `str` | Human-readable adapter name. |
+| `operations` | `list[str]` | V1 must contain only `read`. |
+| `media_types` | `list[str]` | Supported media types, lower-case. |
+| `extensions` | `list[str]` | Supported file suffixes including dots. |
+| `factory` | callable | Returns a `SourceReadAdapter`. |
+
+Optional descriptor fields:
+
+| Field | Type | Meaning |
+| --- | --- | --- |
+| `summary` | `str | None` | Short description for CLI and docs. |
+| `option_schema` | `dict` | JSON-schema-like adapter options. |
+| `optional_dependencies` | `list[OptionalDependency]` | Runtime libraries needed by the adapter. |
+| `safety` | `dict` | Reads files, network, external processes, and related flags. |
+| `quality_profile` | `dict` | Known extraction quality behavior. |
+| `metadata` | `dict` | Adapter-specific metadata. |
+
+The corresponding `ExtensionDescriptor` should use:
+
+```text
+id: same as SourceAdapterDescriptor.id
+kind: source-adapter
+input_contract: SourceInspectRequest | SourceReadRequest
+output_contract: SourceInspectResult | SourceReadResult
+diagnostics_namespace: source
+provenance_prefix: source.<adapter>
+```
+
+Capabilities should include:
+
+```text
+source read
+markdown normalize
+diagnostics emit
+provenance emit
+filesystem read
+```
+
+Descriptor IDs are globally unique. Duplicate IDs from external packages are
+registry errors. Descriptors from package entry points are sorted by ID for
+deterministic listing.
+
+## Entry Point Contract
+
+The entry point group is:
+
+```text
+markitect_tool.source_adapters
+```
+
+Each entry point may load to one of:
+
+- a `SourceAdapterDescriptor`
+- an iterable of `SourceAdapterDescriptor`
+- a callable returning either of the above
+
+Discovery must not instantiate adapters unless the loaded object itself is a
+descriptor factory. Descriptors should remain cheap enough to list without
+format-specific imports.
+
+Discovery errors should produce diagnostics with code
+`source.discovery_failed`. Missing optional dependencies declared by a
+descriptor should produce `source.missing_dependency` and mark that adapter
+unavailable for reads until the dependency is installed.
+
+## Data Model
+
+All model objects must support stable `to_dict()` serialization. Serialization
+rules:
+
+- omit `None`, empty lists, empty dicts, and empty strings
+- preserve `False`, `0`, and empty Markdown content where semantically valid
+- use UTF-8 text
+- use canonical JSON with sorted keys and compact separators when computing
+  hashes or cache keys
+- keep all timestamps and dates as strings unless they are filesystem metadata
+  such as `mtime_ns`
+
+### `SourceAsset`
+
+| Field | Required | Type | Meaning |
+| --- | --- | --- | --- |
+| `uri` | yes | `str` | Stable source URI. For local files, use a normalized path URI or path string. |
+| `path` | no | `str` | Local path when available. |
+| `name` | no | `str` | Display name or basename. |
+| `media_type` | no | `str` | Detected or declared media type. |
+| `extension` | no | `str` | Lower-case suffix including the dot. |
+| `size` | no | `int` | Byte size for local files. |
+| `mtime_ns` | no | `int` | Local file modification timestamp in nanoseconds. |
+| `digest` | no | `str` | `sha256:<hex>` of source bytes when available. |
+| `metadata` | no | `dict` | Source asset metadata that is not document metadata. |
+
+### `SourceMetadata`
+
+| Field | Required | Type | Meaning |
+| --- | --- | --- | --- |
+| `title` | no | `str` | Source title. |
+| `creators` | no | `list[str]` | Authors or creators in source order. |
+| `language` | no | `str` | BCP 47 language tag when known. |
+| `rights` | no | `str` | Rights or license text from the source. |
+| `source_url` | no | `str` | Original public URL when known. |
+| `publication_date` | no | `str` | Source publication date string. |
+| `publisher` | no | `str` | Publisher name. |
+| `identifiers` | no | `dict[str, str]` | ISBN, DOI, package IDs, and similar identifiers. |
+| `raw` | no | `dict` | Adapter-preserved raw metadata. |
+
+### `SourceProvenance`
+
+| Field | Required | Type | Meaning |
+| --- | --- | --- | --- |
+| `source_uri` | yes | `str` | Source asset URI. |
+| `source_path` | no | `str` | Local source path. |
+| `source_href` | no | `str` | Package-internal href or document-relative reference. |
+| `package_path` | no | `str` | Archive/package member path, such as EPUB XHTML. |
+| `anchor` | no | `str` | Source anchor or fragment. |
+| `page` | no | `str` | Page label or number where available. |
+| `section` | no | `str` | Chapter, section, or nav label. |
+| `start_offset` | no | `int` | Adapter-defined start offset. |
+| `end_offset` | no | `int` | Adapter-defined end offset. |
+| `digest` | no | `str` | Digest of the specific source component. |
+| `metadata` | no | `dict` | Adapter-specific provenance details. |
+
+### `NormalizedMarkdownSegment`
+
+| Field | Required | Type | Meaning |
+| --- | --- | --- | --- |
+| `segment_id` | yes | `str` | Stable ID unique within the document. |
+| `order` | yes | `int` | Zero-based reading order. |
+| `markdown` | yes | `str` | Canonical Markdown for the segment. |
+| `heading` | no | `str` | Primary heading text for the segment. |
+| `heading_level` | no | `int` | Markdown heading level when known. |
+| `anchors` | no | `list[str]` | Source anchors covered by the segment. |
+| `provenance` | no | `list[SourceProvenance]` | Source spans contributing to the segment. |
+| `metadata` | no | `dict` | Adapter-specific segment metadata. |
+
+Segment IDs should be deterministic. Prefer source anchors when they are stable
+and unique. Otherwise use ordinal IDs such as `seg-0001`, `seg-0002`, and so
+on. Segment order is always authoritative for reading order.
+
+### `NormalizationQuality`
+
+| Field | Required | Type | Meaning |
+| --- | --- | --- | --- |
+| `lossiness` | yes | `str` | One of `none`, `low`, `medium`, `high`, or `unknown`. |
+| `confidence` | no | `float` | Adapter confidence from `0.0` to `1.0`. |
+| `skipped_items` | no | `int` | Count of skipped source items. |
+| `warnings` | no | `int` | Count of warning diagnostics. |
+| `metadata` | no | `dict` | Adapter-specific quality details. |
+
+### `NormalizedMarkdownDocument`
+
+| Field | Required | Type | Meaning |
+| --- | --- | --- | --- |
+| `schema_version` | yes | `str` | V1 uses `markitect.source.v1`. |
+| `document_id` | yes | `str` | Stable normalized document ID. |
+| `asset` | yes | `SourceAsset` | Original source identity. |
+| `metadata` | yes | `SourceMetadata` | Source document metadata. |
+| `markdown` | yes | `str` | Full normalized Markdown. |
+| `segments` | yes | `list[NormalizedMarkdownSegment]` | Ordered segment list. |
+| `quality` | yes | `NormalizationQuality` | Extraction quality summary. |
+| `diagnostics` | no | `list[Diagnostic]` | Existing Markitect diagnostic shape. |
+| `provenance` | no | `list[SourceProvenance]` | Document-level provenance. |
+| `attachments` | no | `list[SourceAsset]` | Referenced binary assets; v1 metadata only. |
+| `adapter` | yes | `dict` | Adapter id, version, and options. |
+| `cache_key` | yes | `str` | Deterministic normalization cache key. |
+
+The full `markdown` field should be equal to the ordered segment Markdown joined
+with exactly two newlines, unless an adapter has a documented reason to emit
+document-level frontmatter or separators.
+
+## Hashing And Cache Keys
+
+Source asset digests use the source bytes:
+
+```text
+sha256:<hex>
+```
+
+Document IDs should be stable across machines and based on:
+
+- normalized source asset URI or path
+- source asset digest when available
+- adapter ID
+- adapter version
+
+Normalization cache keys should be based on canonical JSON containing:
+
+- source asset URI or path
+- source asset digest
+- adapter ID
+- adapter version
+- normalized model version
+- read options
+
+Use this prefix:
+
+```text
+source-normalize:sha256:<hex>
+```
+
+## Read Adapter Protocol
+
+`MKTT-WP-0018` should implement Python `Protocol` classes equivalent to:
+
+```python
+class SourceReadAdapter(Protocol):
+    descriptor: SourceAdapterDescriptor
+
+    def can_read(self, request: SourceAdapterMatchRequest) -> SourceAdapterMatch:
+        ...
+
+    def inspect(self, request: SourceInspectRequest) -> SourceInspectResult:
+        ...
+
+    def read(self, request: SourceReadRequest) -> SourceReadResult:
+        ...
+```
+
+Request and result objects:
+
+| Type | Required fields | Meaning |
+| --- | --- | --- |
+| `SourceAdapterMatchRequest` | `asset`, `options` | Cheap matching request. |
+| `SourceAdapterMatch` | `adapter_id`, `matched`, `confidence`, `reason`, `diagnostics` | Match result. Confidence is `0` to `100`. |
+| `SourceInspectRequest` | `asset`, `options` | Metadata-only inspection request. |
+| `SourceInspectResult` | `valid`, `asset`, `adapter`, `metadata`, `capabilities`, `diagnostics`, `quality` | Inspection result without full Markdown conversion. |
+| `SourceReadRequest` | `asset`, `options` | Full normalization request. |
+| `SourceReadResult` | `valid`, `document`, `diagnostics` | Normalized read result. |
+
+`inspect` must not perform full conversion. It may open enough of the source to
+validate structure and collect metadata. `read` may perform full extraction.
+
+Options must be JSON-serializable. Adapter-specific options should be declared
+in `option_schema`. Unknown options should produce
+`source.unknown_option` unless the descriptor explicitly permits free-form
+options.
+
+## Adapter Selection
+
+Selection is deterministic:
+
+1. If an explicit adapter ID is provided, use only that descriptor.
+2. Prefer media type matches over extension-only matches.
+3. Prefer higher `can_read().confidence`.
+4. Prefer descriptors with required optional dependencies available.
+5. Break remaining ties by descriptor ID in ascending lexical order and emit
+   warning `source.adapter_ambiguous`.
+
+No matching adapter returns an error diagnostic:
+
+```text
+source.unsupported_format
+```
+
+Malformed sources return an error diagnostic:
+
+```text
+source.malformed
+```
+
+Missing required optional dependencies return:
+
+```text
+source.missing_dependency
+```
+
+Warnings do not make a result invalid. Any error diagnostic makes `valid`
+false.
+
+## CLI Contract
+
+The public commands are:
+
+```bash
+mkt source adapters
+mkt source inspect <path>
+mkt source normalize <path> --format markdown
+```
+
+Common options:
+
+```text
+--adapter <adapter-id>       Explicit adapter selection.
+--format text|json|yaml      For adapters and inspect.
+--format markdown|json|yaml  For normalize.
+--option key=value           Adapter-specific option, repeatable.
+--output <path>              Write normalized output.
+```
+
+Exit behavior:
+
+| Exit | Meaning |
+| --- | --- |
+| `0` | Operation valid; warning diagnostics may exist. |
+| `1` | Operation completed with error diagnostics. |
+| `2` | CLI usage error from Click. |
+
+JSON output must contain a top-level `valid` field for operations that can
+fail. Markdown output writes only normalized Markdown to stdout or `--output`;
+diagnostics for Markdown output go to stderr. If normalization is invalid, do
+not emit partial Markdown unless a future option explicitly requests it.
+
+## API Contract
+
+`MKTT-WP-0018` should export these names from `markitect_tool`:
+
+```text
+SourceAsset
+SourceMetadata
+SourceProvenance
+NormalizedMarkdownSegment
+NormalizedMarkdownDocument
+NormalizationQuality
+SourceAdapterDescriptor
+SourceReadAdapter
+SourceAdapterRegistry
+SourceAdapterMatchRequest
+SourceAdapterMatch
+SourceInspectRequest
+SourceInspectResult
+SourceReadRequest
+SourceReadResult
+default_source_adapter_registry
+discover_source_adapters
+inspect_source
+normalize_source
+```
+
+Direct API helpers should accept an optional registry and adapter ID so tests
+and sibling repos can avoid global discovery when they need deterministic
+fixtures.
+
+## Contract Tests For MKTT-WP-0018
+
+Implementation should add tests for:
+
+- `SourceAsset`, metadata, provenance, quality, segment, and document
+  serialization
+- source document cache-key determinism
+- fake in-tree adapter registration and read behavior
+- fake external entry point discovery
+- optional dependency diagnostics
+- unsupported format diagnostics
+- malformed source diagnostics
+- adapter selection tie behavior
+- CLI `source adapters` JSON fixture
+- CLI `source inspect` JSON fixture
+- CLI `source normalize --format json` fixture
+- CLI `source normalize --format markdown` fixture
+- public API exports
+
+Fixtures live in `examples/source-adapters/` and should be reused by tests where
+practical.
+
+## Markitect-filter Handoff
+
+The first `markitect-filter` implementation should provide an EPUB3 descriptor:
+
+```text
+id: source.epub3
+name: EPUB3
+operations: read
+media_types: application/epub+zip
+extensions: .epub
+entry_point: markitect_filter.adapters:epub3_adapter_descriptor
+```
+
+The EPUB3 adapter should inspect and normalize:
+
+- `META-INF/container.xml`
+- the OPF package document
+- Dublin Core and package metadata
+- spine reading order
+- navigation labels
+- body XHTML as ordered Markdown segments
+- source hrefs, anchors, sections, and page references where available
+
+It should classify or skip cover, navigation, table-of-contents, header,
+footer, license, and transcriber-note material through explicit options and
+diagnostics. It should report unsupported media, malformed package structure,
+skipped assets, and lossy extraction.