markitect-tool/workplans/MKTT-WP-0018-source-adapter-contract.md

---
id: MKTT-WP-0018
type: workplan
title: "Source Adapter Interface And Markdown Normalization Contract"
domain: markitect
status: done
owner: markitect-tool
topic_slug: markitect
planning_priority: complete
planning_order: 145
depends_on_workplans:
  - MKTT-WP-0013
  - MKTT-WP-0017
  - MKTT-WP-0019
related_workplans:
  - MKTT-WP-0010
  - MKTT-WP-0011
  - MKTT-WP-0012
  - MKTT-WP-0019
created: "2026-05-14"
updated: "2026-05-14"
state_hub_workstream_id: "c4e4511f-13ea-40b4-9083-6d9ab6d12dad"
---

# MKTT-WP-0018: Source Adapter Interface And Markdown Normalization Contract

## Purpose

Define the `markitect-tool` framework for source-format adapters and canonical
markdown normalization so a separate `markitect-filter` repository can provide
concrete adapters for EPUB3 first, and later PDF, DOCX, ODT, HTML, and other
formats.

This workplan deliberately does **not** make `markitect-tool` a document
conversion product. It keeps `markitect-tool` focused on the syntax layer:

- structured markdown contracts
- canonical normalized markdown representations
- adapter protocols and descriptors
- registry/discovery hooks
- deterministic validation and diagnostics

Concrete format extraction lives outside the core toolkit, initially in
`markitect-filter`.

## Background

The Lefevre EPUB3 example in `infospace-bench` exposed a boundary problem.
`infospace-bench` can pragmatically read EPUB-like files, but that logic is not
application-layer work. The durable split should be:

```text
source formats
  -> markitect-filter concrete adapters
  -> markitect-tool source adapter protocol and markdown normalization contract
  -> infospace-bench generation workflows
  -> optional kontextual-engine persistence, retrieval, governance
```

This lets `infospace-bench` consume normalized markdown sources without owning
EPUB/PDF/DOCX details, while `kontextual-engine` can later use the same
adapter outputs for managed ingestion.

## Decision

Establish a `markitect-tool` source adapter framework and normalization
contract first. Then implement EPUB3 as the first concrete read adapter in a
new `markitect-filter` repo.

`MKTT-WP-0019` must run first and pin the v1 contract details for this
implementation: field-level normalized model semantics, read-only protocol
shape, external package entry point discovery, CLI/API output envelopes, and
fake adapter contract-test expectations. Those decisions are captured in
`docs/source-adapter-contract.md`.

`markitect-tool` should define:

- adapter request/result types
- canonical markdown document and segment models
- provenance, metadata, quality, and diagnostic envelopes
- adapter capability descriptors
- registry and package discovery hooks
- CLI/API affordances for adapter inspection and conversion
- contract tests that an external adapter package can satisfy

`markitect-filter` should define:

- concrete source readers
- optional dependencies needed for each source format
- format-specific tests and fixtures
- EPUB3 spine/nav/body extraction as its first implementation

Write/export adapters remain future optional scope until a format-specific
preservation contract exists.

## Non-Goals

- Do not implement EPUB3 parsing in `markitect-tool` core.
- Do not add heavyweight PDF, DOCX, or ODT dependencies to the base install.
- Do not move infospace lifecycle or generation concerns into `markitect-tool`.
- Do not implement persistent ingestion, permissions, retrieval, or governance;
  those remain `kontextual-engine` responsibilities.
- Do not define domain-specific entity/relation workflows here.
- Do not implement source writer/export adapters in the v1 slice; keep the
  protocol read-only unless a later workplan deliberately opens write scope.

## P18.1 - Architecture boundary and markitect-filter handoff

```task
id: MKTT-WP-0018-T001
status: done
priority: high
state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1"
```

Implement the cross-repo architecture pinned by `MKTT-WP-0019`:

- `markitect-tool` owns adapter contracts and markdown normalization
- `markitect-filter` owns concrete source-format adapters
- `infospace-bench` consumes normalized markdown for concrete infospaces
- `kontextual-engine` can ingest adapter outputs into durable knowledge assets

Output: architecture note covering responsibilities, extension package shape,
the `docs/source-adapter-contract.md` entry point contract, dependency policy,
and migration path from the current `infospace-bench` EPUB spike.

Implemented: `docs/source-adapter-contract.md` defines the contract boundary,
external package shape, dependency policy, entry point group, and
`markitect-filter` EPUB3 handoff. `docs/source-adapter-migration.md` documents
the sibling-repo migration path.

## P18.2 - Canonical source-to-markdown data model

```task
id: MKTT-WP-0018-T002
status: done
priority: high
state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb"
```

Implement the normalized output model specified by
`docs/source-adapter-contract.md`:

- `SourceAsset`
- `SourceMetadata`
- `NormalizedMarkdownDocument`
- `NormalizedMarkdownSegment`
- `SourceProvenance`
- `NormalizationDiagnostic`
- `NormalizationQuality`
- optional `SourceBinaryAttachment` or asset reference envelope

The model should represent:

- original path/URI/media type
- title, author/creator, language, rights, source URL, publication metadata
- ordered markdown content
- segment IDs, headings, anchors, page/section references
- digest and cache keys
- extraction diagnostics
- lossiness/quality signals
- adapter name/version/options

Output: public data model, serialization tests using
`examples/source-adapters/normalized-document.json`, and normalization contract
documentation matching the field-level v1 specification.

Implemented: `markitect_tool.source` exposes `SourceAsset`, `SourceMetadata`,
`NormalizedMarkdownDocument`, `NormalizedMarkdownSegment`,
`SourceProvenance`, and `NormalizationQuality`, with stable dictionary
serialization, round-trip tests, digest/cache-key support, diagnostics, and
fixture coverage.

## P18.3 - Source adapter protocol and capability descriptors

```task
id: MKTT-WP-0018-T003
status: done
priority: high
state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9"
```

Define the read adapter interface:

- source reader protocol: `can_read`, `inspect`, `read`
- media type and file extension matching
- adapter option schema
- capability descriptor shape
- safety and dependency flags
- deterministic diagnostics

Do not add writer protocols in this implementation slice. Preserve room for a
future writer protocol by keeping descriptors capability-based, but avoid
shipping `can_write`/`write` contracts before there is a format-specific
preservation model.

The first implementation slice can ship a fake in-tree adapter for tests only.
Concrete EPUB3 implementation belongs in `markitect-filter`.

Output: protocol module, descriptor integration, tests for matching,
inspection, reading, diagnostics, and unsupported-format behavior.

Implemented: `SourceReadAdapter`, request/result types,
`SourceAdapterDescriptor`, deterministic selection, dependency diagnostics,
unsupported-format diagnostics, and read-only capability descriptors live in
`markitect_tool.source`.

## P18.4 - Adapter registry and discovery hooks

```task
id: MKTT-WP-0018-T004
status: done
priority: high
state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3"
```

Wire source adapters into the existing internal extension framework:

- register source adapter descriptors
- discover package-provided adapters through the entry point group pinned by
  `docs/source-adapter-contract.md`
- expose adapter capabilities via extension listing/inspection
- report missing optional dependency diagnostics
- ensure adapter packages can remain independently versioned

Output: registry implementation, package discovery tests, and compatibility
notes for `markitect-filter`.

Implemented: `SourceAdapterRegistry`, `discover_source_adapters`, and
`default_source_adapter_registry` discover descriptors through
`markitect_tool.source_adapters`, expose source adapter descriptors through the
extension catalog, and report missing optional dependencies deterministically.

## P18.5 - Normalization CLI and public API surface

```task
id: MKTT-WP-0018-T005
status: done
priority: medium
state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3"
```

Expose a small CLI/API surface:

- `mkt source adapters`
- `mkt source inspect <path-or-uri>`
- `mkt source normalize <path-or-uri> --format markdown`
- structured JSON output for inspection and diagnostics
- markdown output for normalized content
- public API exports for adapter discovery and normalization

Output: CLI commands, API exports, generated command/API docs updates, and
tests.

Implemented: `mkt source adapters`, `mkt source inspect`, and
`mkt source normalize` expose JSON/YAML/text/Markdown behavior. Public API
exports were added to `markitect_tool.__all__`, and generated CLI/API docs were
refreshed.

## P18.6 - Contract tests and fake adapter fixture

```task
id: MKTT-WP-0018-T006
status: done
priority: high
state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3"
```

Add deterministic contract tests proving that an external read adapter can:

- register through the extension framework
- advertise read capabilities
- inspect a source without full conversion
- normalize a source into canonical markdown documents and segments
- emit provenance, metadata, diagnostics, and quality signals
- fail gracefully for unsupported or malformed sources

Output: fake adapter fixture, reusable contract-test helpers, and documentation
for `markitect-filter` adapter implementers.

Implemented: `tests/test_source_adapter_contract.py` provides fake adapter
coverage for model serialization, cache keys, registry selection, entry point
discovery, dependency diagnostics, CLI JSON/Markdown envelopes, and public API
exports. `examples/source-adapters/` contains expected-output fixtures.

## P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter

```task
id: MKTT-WP-0018-T007
status: done
priority: medium
state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4"
```

Document how the new contract affects sibling repos:

- `infospace-bench` should replace its local EPUB/read normalization spike with
  calls to the source adapter API
- `markitect-filter` should implement EPUB3 first against this contract
- `kontextual-engine` should treat normalized source outputs as ingestible
  knowledge asset derivatives when it needs durable ingestion

Output: migration note and follow-up workplan seeds for `markitect-filter` and
`infospace-bench`.

Implemented: `docs/source-adapter-migration.md` documents the migration path
for `markitect-filter`, `infospace-bench`, and `kontextual-engine`, including
follow-up workplan seeds.

## Acceptance

- `markitect-tool` exposes a stable source adapter protocol and canonical
  markdown normalization contract.
- The base install remains markdown-native and does not pull heavyweight
  format-conversion dependencies.
- External adapter packages can register and be discovered through the existing
  extension framework.
- CLI and API users can inspect available source adapters and normalize a source
  through a registered adapter.
- Tests prove the contract with a fake adapter and no network dependency.
- Documentation clearly assigns EPUB3 implementation to `markitect-filter`, not
  `markitect-tool` or `infospace-bench`.
- Writer/export adapter support is explicitly deferred beyond the v1 read
  adapter contract.
- Implementation behavior matches `docs/source-adapter-contract.md` and the
  fixtures in `examples/source-adapters/`.