Files
markitect-tool/workplans/MKTT-WP-0018-source-adapter-contract.md
2026-05-14 22:05:34 +02:00

11 KiB

id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id
id type title domain status owner topic_slug planning_priority planning_order depends_on_workplans related_workplans created updated state_hub_workstream_id
MKTT-WP-0018 workplan Source Adapter Interface And Markdown Normalization Contract markitect done markitect-tool markitect complete 145
MKTT-WP-0013
MKTT-WP-0017
MKTT-WP-0019
MKTT-WP-0010
MKTT-WP-0011
MKTT-WP-0012
MKTT-WP-0019
2026-05-14 2026-05-14 c4e4511f-13ea-40b4-9083-6d9ab6d12dad

MKTT-WP-0018: Source Adapter Interface And Markdown Normalization Contract

Purpose

Define the markitect-tool framework for source-format adapters and canonical markdown normalization so a separate markitect-filter repository can provide concrete adapters for EPUB3 first, and later PDF, DOCX, ODT, HTML, and other formats.

This workplan deliberately does not make markitect-tool a document conversion product. It keeps markitect-tool focused on the syntax layer:

  • structured markdown contracts
  • canonical normalized markdown representations
  • adapter protocols and descriptors
  • registry/discovery hooks
  • deterministic validation and diagnostics

Concrete format extraction lives outside the core toolkit, initially in markitect-filter.

Background

The Lefevre EPUB3 example in infospace-bench exposed a boundary problem. infospace-bench can pragmatically read EPUB-like files, but that logic is not application-layer work. The durable split should be:

source formats
  -> markitect-filter concrete adapters
  -> markitect-tool source adapter protocol and markdown normalization contract
  -> infospace-bench generation workflows
  -> optional kontextual-engine persistence, retrieval, governance

This lets infospace-bench consume normalized markdown sources without owning EPUB/PDF/DOCX details, while kontextual-engine can later use the same adapter outputs for managed ingestion.

Decision

Establish a markitect-tool source adapter framework and normalization contract first. Then implement EPUB3 as the first concrete read adapter in a new markitect-filter repo.

MKTT-WP-0019 must run first and pin the v1 contract details for this implementation: field-level normalized model semantics, read-only protocol shape, external package entry point discovery, CLI/API output envelopes, and fake adapter contract-test expectations. Those decisions are captured in docs/source-adapter-contract.md.

markitect-tool should define:

  • adapter request/result types
  • canonical markdown document and segment models
  • provenance, metadata, quality, and diagnostic envelopes
  • adapter capability descriptors
  • registry and package discovery hooks
  • CLI/API affordances for adapter inspection and conversion
  • contract tests that an external adapter package can satisfy

markitect-filter should define:

  • concrete source readers
  • optional dependencies needed for each source format
  • format-specific tests and fixtures
  • EPUB3 spine/nav/body extraction as its first implementation

Write/export adapters remain future optional scope until a format-specific preservation contract exists.

Non-Goals

  • Do not implement EPUB3 parsing in markitect-tool core.
  • Do not add heavyweight PDF, DOCX, or ODT dependencies to the base install.
  • Do not move infospace lifecycle or generation concerns into markitect-tool.
  • Do not implement persistent ingestion, permissions, retrieval, or governance; those remain kontextual-engine responsibilities.
  • Do not define domain-specific entity/relation workflows here.
  • Do not implement source writer/export adapters in the v1 slice; keep the protocol read-only unless a later workplan deliberately opens write scope.

P18.1 - Architecture boundary and markitect-filter handoff

id: MKTT-WP-0018-T001
status: done
priority: high
state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1"

Implement the cross-repo architecture pinned by MKTT-WP-0019:

  • markitect-tool owns adapter contracts and markdown normalization
  • markitect-filter owns concrete source-format adapters
  • infospace-bench consumes normalized markdown for concrete infospaces
  • kontextual-engine can ingest adapter outputs into durable knowledge assets

Output: architecture note covering responsibilities, extension package shape, the docs/source-adapter-contract.md entry point contract, dependency policy, and migration path from the current infospace-bench EPUB spike.

Implemented: docs/source-adapter-contract.md defines the contract boundary, external package shape, dependency policy, entry point group, and markitect-filter EPUB3 handoff. docs/source-adapter-migration.md documents the sibling-repo migration path.

P18.2 - Canonical source-to-markdown data model

id: MKTT-WP-0018-T002
status: done
priority: high
state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb"

Implement the normalized output model specified by docs/source-adapter-contract.md:

  • SourceAsset
  • SourceMetadata
  • NormalizedMarkdownDocument
  • NormalizedMarkdownSegment
  • SourceProvenance
  • NormalizationDiagnostic
  • NormalizationQuality
  • optional SourceBinaryAttachment or asset reference envelope

The model should represent:

  • original path/URI/media type
  • title, author/creator, language, rights, source URL, publication metadata
  • ordered markdown content
  • segment IDs, headings, anchors, page/section references
  • digest and cache keys
  • extraction diagnostics
  • lossiness/quality signals
  • adapter name/version/options

Output: public data model, serialization tests using examples/source-adapters/normalized-document.json, and normalization contract documentation matching the field-level v1 specification.

Implemented: markitect_tool.source exposes SourceAsset, SourceMetadata, NormalizedMarkdownDocument, NormalizedMarkdownSegment, SourceProvenance, and NormalizationQuality, with stable dictionary serialization, round-trip tests, digest/cache-key support, diagnostics, and fixture coverage.

P18.3 - Source adapter protocol and capability descriptors

id: MKTT-WP-0018-T003
status: done
priority: high
state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9"

Define the read adapter interface:

  • source reader protocol: can_read, inspect, read
  • media type and file extension matching
  • adapter option schema
  • capability descriptor shape
  • safety and dependency flags
  • deterministic diagnostics

Do not add writer protocols in this implementation slice. Preserve room for a future writer protocol by keeping descriptors capability-based, but avoid shipping can_write/write contracts before there is a format-specific preservation model.

The first implementation slice can ship a fake in-tree adapter for tests only. Concrete EPUB3 implementation belongs in markitect-filter.

Output: protocol module, descriptor integration, tests for matching, inspection, reading, diagnostics, and unsupported-format behavior.

Implemented: SourceReadAdapter, request/result types, SourceAdapterDescriptor, deterministic selection, dependency diagnostics, unsupported-format diagnostics, and read-only capability descriptors live in markitect_tool.source.

P18.4 - Adapter registry and discovery hooks

id: MKTT-WP-0018-T004
status: done
priority: high
state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3"

Wire source adapters into the existing internal extension framework:

  • register source adapter descriptors
  • discover package-provided adapters through the entry point group pinned by docs/source-adapter-contract.md
  • expose adapter capabilities via extension listing/inspection
  • report missing optional dependency diagnostics
  • ensure adapter packages can remain independently versioned

Output: registry implementation, package discovery tests, and compatibility notes for markitect-filter.

Implemented: SourceAdapterRegistry, discover_source_adapters, and default_source_adapter_registry discover descriptors through markitect_tool.source_adapters, expose source adapter descriptors through the extension catalog, and report missing optional dependencies deterministically.

P18.5 - Normalization CLI and public API surface

id: MKTT-WP-0018-T005
status: done
priority: medium
state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3"

Expose a small CLI/API surface:

  • mkt source adapters
  • mkt source inspect <path-or-uri>
  • mkt source normalize <path-or-uri> --format markdown
  • structured JSON output for inspection and diagnostics
  • markdown output for normalized content
  • public API exports for adapter discovery and normalization

Output: CLI commands, API exports, generated command/API docs updates, and tests.

Implemented: mkt source adapters, mkt source inspect, and mkt source normalize expose JSON/YAML/text/Markdown behavior. Public API exports were added to markitect_tool.__all__, and generated CLI/API docs were refreshed.

P18.6 - Contract tests and fake adapter fixture

id: MKTT-WP-0018-T006
status: done
priority: high
state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3"

Add deterministic contract tests proving that an external read adapter can:

  • register through the extension framework
  • advertise read capabilities
  • inspect a source without full conversion
  • normalize a source into canonical markdown documents and segments
  • emit provenance, metadata, diagnostics, and quality signals
  • fail gracefully for unsupported or malformed sources

Output: fake adapter fixture, reusable contract-test helpers, and documentation for markitect-filter adapter implementers.

Implemented: tests/test_source_adapter_contract.py provides fake adapter coverage for model serialization, cache keys, registry selection, entry point discovery, dependency diagnostics, CLI JSON/Markdown envelopes, and public API exports. examples/source-adapters/ contains expected-output fixtures.

P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter

id: MKTT-WP-0018-T007
status: done
priority: medium
state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4"

Document how the new contract affects sibling repos:

  • infospace-bench should replace its local EPUB/read normalization spike with calls to the source adapter API
  • markitect-filter should implement EPUB3 first against this contract
  • kontextual-engine should treat normalized source outputs as ingestible knowledge asset derivatives when it needs durable ingestion

Output: migration note and follow-up workplan seeds for markitect-filter and infospace-bench.

Implemented: docs/source-adapter-migration.md documents the migration path for markitect-filter, infospace-bench, and kontextual-engine, including follow-up workplan seeds.

Acceptance

  • markitect-tool exposes a stable source adapter protocol and canonical markdown normalization contract.
  • The base install remains markdown-native and does not pull heavyweight format-conversion dependencies.
  • External adapter packages can register and be discovered through the existing extension framework.
  • CLI and API users can inspect available source adapters and normalize a source through a registered adapter.
  • Tests prove the contract with a fake adapter and no network dependency.
  • Documentation clearly assigns EPUB3 implementation to markitect-filter, not markitect-tool or infospace-bench.
  • Writer/export adapter support is explicitly deferred beyond the v1 read adapter contract.
  • Implementation behavior matches docs/source-adapter-contract.md and the fixtures in examples/source-adapters/.