--- id: MKTT-WP-0018 type: workplan title: "Source Adapter Interface And Markdown Normalization Contract" domain: markitect status: done owner: markitect-tool topic_slug: markitect planning_priority: complete planning_order: 145 depends_on_workplans: - MKTT-WP-0013 - MKTT-WP-0017 - MKTT-WP-0019 related_workplans: - MKTT-WP-0010 - MKTT-WP-0011 - MKTT-WP-0012 - MKTT-WP-0019 created: "2026-05-14" updated: "2026-05-14" state_hub_workstream_id: "c4e4511f-13ea-40b4-9083-6d9ab6d12dad" --- # MKTT-WP-0018: Source Adapter Interface And Markdown Normalization Contract ## Purpose Define the `markitect-tool` framework for source-format adapters and canonical markdown normalization so a separate `markitect-filter` repository can provide concrete adapters for EPUB3 first, and later PDF, DOCX, ODT, HTML, and other formats. This workplan deliberately does **not** make `markitect-tool` a document conversion product. It keeps `markitect-tool` focused on the syntax layer: - structured markdown contracts - canonical normalized markdown representations - adapter protocols and descriptors - registry/discovery hooks - deterministic validation and diagnostics Concrete format extraction lives outside the core toolkit, initially in `markitect-filter`. ## Background The Lefevre EPUB3 example in `infospace-bench` exposed a boundary problem. `infospace-bench` can pragmatically read EPUB-like files, but that logic is not application-layer work. The durable split should be: ```text source formats -> markitect-filter concrete adapters -> markitect-tool source adapter protocol and markdown normalization contract -> infospace-bench generation workflows -> optional kontextual-engine persistence, retrieval, governance ``` This lets `infospace-bench` consume normalized markdown sources without owning EPUB/PDF/DOCX details, while `kontextual-engine` can later use the same adapter outputs for managed ingestion. ## Decision Establish a `markitect-tool` source adapter framework and normalization contract first. Then implement EPUB3 as the first concrete read adapter in a new `markitect-filter` repo. `MKTT-WP-0019` must run first and pin the v1 contract details for this implementation: field-level normalized model semantics, read-only protocol shape, external package entry point discovery, CLI/API output envelopes, and fake adapter contract-test expectations. Those decisions are captured in `docs/source-adapter-contract.md`. `markitect-tool` should define: - adapter request/result types - canonical markdown document and segment models - provenance, metadata, quality, and diagnostic envelopes - adapter capability descriptors - registry and package discovery hooks - CLI/API affordances for adapter inspection and conversion - contract tests that an external adapter package can satisfy `markitect-filter` should define: - concrete source readers - optional dependencies needed for each source format - format-specific tests and fixtures - EPUB3 spine/nav/body extraction as its first implementation Write/export adapters remain future optional scope until a format-specific preservation contract exists. ## Non-Goals - Do not implement EPUB3 parsing in `markitect-tool` core. - Do not add heavyweight PDF, DOCX, or ODT dependencies to the base install. - Do not move infospace lifecycle or generation concerns into `markitect-tool`. - Do not implement persistent ingestion, permissions, retrieval, or governance; those remain `kontextual-engine` responsibilities. - Do not define domain-specific entity/relation workflows here. - Do not implement source writer/export adapters in the v1 slice; keep the protocol read-only unless a later workplan deliberately opens write scope. ## P18.1 - Architecture boundary and markitect-filter handoff ```task id: MKTT-WP-0018-T001 status: done priority: high state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1" ``` Implement the cross-repo architecture pinned by `MKTT-WP-0019`: - `markitect-tool` owns adapter contracts and markdown normalization - `markitect-filter` owns concrete source-format adapters - `infospace-bench` consumes normalized markdown for concrete infospaces - `kontextual-engine` can ingest adapter outputs into durable knowledge assets Output: architecture note covering responsibilities, extension package shape, the `docs/source-adapter-contract.md` entry point contract, dependency policy, and migration path from the current `infospace-bench` EPUB spike. Implemented: `docs/source-adapter-contract.md` defines the contract boundary, external package shape, dependency policy, entry point group, and `markitect-filter` EPUB3 handoff. `docs/source-adapter-migration.md` documents the sibling-repo migration path. ## P18.2 - Canonical source-to-markdown data model ```task id: MKTT-WP-0018-T002 status: done priority: high state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb" ``` Implement the normalized output model specified by `docs/source-adapter-contract.md`: - `SourceAsset` - `SourceMetadata` - `NormalizedMarkdownDocument` - `NormalizedMarkdownSegment` - `SourceProvenance` - `NormalizationDiagnostic` - `NormalizationQuality` - optional `SourceBinaryAttachment` or asset reference envelope The model should represent: - original path/URI/media type - title, author/creator, language, rights, source URL, publication metadata - ordered markdown content - segment IDs, headings, anchors, page/section references - digest and cache keys - extraction diagnostics - lossiness/quality signals - adapter name/version/options Output: public data model, serialization tests using `examples/source-adapters/normalized-document.json`, and normalization contract documentation matching the field-level v1 specification. Implemented: `markitect_tool.source` exposes `SourceAsset`, `SourceMetadata`, `NormalizedMarkdownDocument`, `NormalizedMarkdownSegment`, `SourceProvenance`, and `NormalizationQuality`, with stable dictionary serialization, round-trip tests, digest/cache-key support, diagnostics, and fixture coverage. ## P18.3 - Source adapter protocol and capability descriptors ```task id: MKTT-WP-0018-T003 status: done priority: high state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9" ``` Define the read adapter interface: - source reader protocol: `can_read`, `inspect`, `read` - media type and file extension matching - adapter option schema - capability descriptor shape - safety and dependency flags - deterministic diagnostics Do not add writer protocols in this implementation slice. Preserve room for a future writer protocol by keeping descriptors capability-based, but avoid shipping `can_write`/`write` contracts before there is a format-specific preservation model. The first implementation slice can ship a fake in-tree adapter for tests only. Concrete EPUB3 implementation belongs in `markitect-filter`. Output: protocol module, descriptor integration, tests for matching, inspection, reading, diagnostics, and unsupported-format behavior. Implemented: `SourceReadAdapter`, request/result types, `SourceAdapterDescriptor`, deterministic selection, dependency diagnostics, unsupported-format diagnostics, and read-only capability descriptors live in `markitect_tool.source`. ## P18.4 - Adapter registry and discovery hooks ```task id: MKTT-WP-0018-T004 status: done priority: high state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3" ``` Wire source adapters into the existing internal extension framework: - register source adapter descriptors - discover package-provided adapters through the entry point group pinned by `docs/source-adapter-contract.md` - expose adapter capabilities via extension listing/inspection - report missing optional dependency diagnostics - ensure adapter packages can remain independently versioned Output: registry implementation, package discovery tests, and compatibility notes for `markitect-filter`. Implemented: `SourceAdapterRegistry`, `discover_source_adapters`, and `default_source_adapter_registry` discover descriptors through `markitect_tool.source_adapters`, expose source adapter descriptors through the extension catalog, and report missing optional dependencies deterministically. ## P18.5 - Normalization CLI and public API surface ```task id: MKTT-WP-0018-T005 status: done priority: medium state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3" ``` Expose a small CLI/API surface: - `mkt source adapters` - `mkt source inspect ` - `mkt source normalize --format markdown` - structured JSON output for inspection and diagnostics - markdown output for normalized content - public API exports for adapter discovery and normalization Output: CLI commands, API exports, generated command/API docs updates, and tests. Implemented: `mkt source adapters`, `mkt source inspect`, and `mkt source normalize` expose JSON/YAML/text/Markdown behavior. Public API exports were added to `markitect_tool.__all__`, and generated CLI/API docs were refreshed. ## P18.6 - Contract tests and fake adapter fixture ```task id: MKTT-WP-0018-T006 status: done priority: high state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3" ``` Add deterministic contract tests proving that an external read adapter can: - register through the extension framework - advertise read capabilities - inspect a source without full conversion - normalize a source into canonical markdown documents and segments - emit provenance, metadata, diagnostics, and quality signals - fail gracefully for unsupported or malformed sources Output: fake adapter fixture, reusable contract-test helpers, and documentation for `markitect-filter` adapter implementers. Implemented: `tests/test_source_adapter_contract.py` provides fake adapter coverage for model serialization, cache keys, registry selection, entry point discovery, dependency diagnostics, CLI JSON/Markdown envelopes, and public API exports. `examples/source-adapters/` contains expected-output fixtures. ## P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter ```task id: MKTT-WP-0018-T007 status: done priority: medium state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4" ``` Document how the new contract affects sibling repos: - `infospace-bench` should replace its local EPUB/read normalization spike with calls to the source adapter API - `markitect-filter` should implement EPUB3 first against this contract - `kontextual-engine` should treat normalized source outputs as ingestible knowledge asset derivatives when it needs durable ingestion Output: migration note and follow-up workplan seeds for `markitect-filter` and `infospace-bench`. Implemented: `docs/source-adapter-migration.md` documents the migration path for `markitect-filter`, `infospace-bench`, and `kontextual-engine`, including follow-up workplan seeds. ## Acceptance - `markitect-tool` exposes a stable source adapter protocol and canonical markdown normalization contract. - The base install remains markdown-native and does not pull heavyweight format-conversion dependencies. - External adapter packages can register and be discovered through the existing extension framework. - CLI and API users can inspect available source adapters and normalize a source through a registered adapter. - Tests prove the contract with a fake adapter and no network dependency. - Documentation clearly assigns EPUB3 implementation to `markitect-filter`, not `markitect-tool` or `infospace-bench`. - Writer/export adapter support is explicitly deferred beyond the v1 read adapter contract. - Implementation behavior matches `docs/source-adapter-contract.md` and the fixtures in `examples/source-adapters/`.