11 KiB
id, type, title, domain, status, owner, topic_slug, planning_priority, planning_order, depends_on_workplans, related_workplans, created, updated, state_hub_workstream_id
| id | type | title | domain | status | owner | topic_slug | planning_priority | planning_order | depends_on_workplans | related_workplans | created | updated | state_hub_workstream_id | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MKTT-WP-0018 | workplan | Source Adapter Interface And Markdown Normalization Contract | markitect | done | markitect-tool | markitect | complete | 145 |
|
|
2026-05-14 | 2026-05-14 | c4e4511f-13ea-40b4-9083-6d9ab6d12dad |
MKTT-WP-0018: Source Adapter Interface And Markdown Normalization Contract
Purpose
Define the markitect-tool framework for source-format adapters and canonical
markdown normalization so a separate markitect-filter repository can provide
concrete adapters for EPUB3 first, and later PDF, DOCX, ODT, HTML, and other
formats.
This workplan deliberately does not make markitect-tool a document
conversion product. It keeps markitect-tool focused on the syntax layer:
- structured markdown contracts
- canonical normalized markdown representations
- adapter protocols and descriptors
- registry/discovery hooks
- deterministic validation and diagnostics
Concrete format extraction lives outside the core toolkit, initially in
markitect-filter.
Background
The Lefevre EPUB3 example in infospace-bench exposed a boundary problem.
infospace-bench can pragmatically read EPUB-like files, but that logic is not
application-layer work. The durable split should be:
source formats
-> markitect-filter concrete adapters
-> markitect-tool source adapter protocol and markdown normalization contract
-> infospace-bench generation workflows
-> optional kontextual-engine persistence, retrieval, governance
This lets infospace-bench consume normalized markdown sources without owning
EPUB/PDF/DOCX details, while kontextual-engine can later use the same
adapter outputs for managed ingestion.
Decision
Establish a markitect-tool source adapter framework and normalization
contract first. Then implement EPUB3 as the first concrete read adapter in a
new markitect-filter repo.
MKTT-WP-0019 must run first and pin the v1 contract details for this
implementation: field-level normalized model semantics, read-only protocol
shape, external package entry point discovery, CLI/API output envelopes, and
fake adapter contract-test expectations. Those decisions are captured in
docs/source-adapter-contract.md.
markitect-tool should define:
- adapter request/result types
- canonical markdown document and segment models
- provenance, metadata, quality, and diagnostic envelopes
- adapter capability descriptors
- registry and package discovery hooks
- CLI/API affordances for adapter inspection and conversion
- contract tests that an external adapter package can satisfy
markitect-filter should define:
- concrete source readers
- optional dependencies needed for each source format
- format-specific tests and fixtures
- EPUB3 spine/nav/body extraction as its first implementation
Write/export adapters remain future optional scope until a format-specific preservation contract exists.
Non-Goals
- Do not implement EPUB3 parsing in
markitect-toolcore. - Do not add heavyweight PDF, DOCX, or ODT dependencies to the base install.
- Do not move infospace lifecycle or generation concerns into
markitect-tool. - Do not implement persistent ingestion, permissions, retrieval, or governance;
those remain
kontextual-engineresponsibilities. - Do not define domain-specific entity/relation workflows here.
- Do not implement source writer/export adapters in the v1 slice; keep the protocol read-only unless a later workplan deliberately opens write scope.
P18.1 - Architecture boundary and markitect-filter handoff
id: MKTT-WP-0018-T001
status: done
priority: high
state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1"
Implement the cross-repo architecture pinned by MKTT-WP-0019:
markitect-toolowns adapter contracts and markdown normalizationmarkitect-filterowns concrete source-format adaptersinfospace-benchconsumes normalized markdown for concrete infospaceskontextual-enginecan ingest adapter outputs into durable knowledge assets
Output: architecture note covering responsibilities, extension package shape,
the docs/source-adapter-contract.md entry point contract, dependency policy,
and migration path from the current infospace-bench EPUB spike.
Implemented: docs/source-adapter-contract.md defines the contract boundary,
external package shape, dependency policy, entry point group, and
markitect-filter EPUB3 handoff. docs/source-adapter-migration.md documents
the sibling-repo migration path.
P18.2 - Canonical source-to-markdown data model
id: MKTT-WP-0018-T002
status: done
priority: high
state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb"
Implement the normalized output model specified by
docs/source-adapter-contract.md:
SourceAssetSourceMetadataNormalizedMarkdownDocumentNormalizedMarkdownSegmentSourceProvenanceNormalizationDiagnosticNormalizationQuality- optional
SourceBinaryAttachmentor asset reference envelope
The model should represent:
- original path/URI/media type
- title, author/creator, language, rights, source URL, publication metadata
- ordered markdown content
- segment IDs, headings, anchors, page/section references
- digest and cache keys
- extraction diagnostics
- lossiness/quality signals
- adapter name/version/options
Output: public data model, serialization tests using
examples/source-adapters/normalized-document.json, and normalization contract
documentation matching the field-level v1 specification.
Implemented: markitect_tool.source exposes SourceAsset, SourceMetadata,
NormalizedMarkdownDocument, NormalizedMarkdownSegment,
SourceProvenance, and NormalizationQuality, with stable dictionary
serialization, round-trip tests, digest/cache-key support, diagnostics, and
fixture coverage.
P18.3 - Source adapter protocol and capability descriptors
id: MKTT-WP-0018-T003
status: done
priority: high
state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9"
Define the read adapter interface:
- source reader protocol:
can_read,inspect,read - media type and file extension matching
- adapter option schema
- capability descriptor shape
- safety and dependency flags
- deterministic diagnostics
Do not add writer protocols in this implementation slice. Preserve room for a
future writer protocol by keeping descriptors capability-based, but avoid
shipping can_write/write contracts before there is a format-specific
preservation model.
The first implementation slice can ship a fake in-tree adapter for tests only.
Concrete EPUB3 implementation belongs in markitect-filter.
Output: protocol module, descriptor integration, tests for matching, inspection, reading, diagnostics, and unsupported-format behavior.
Implemented: SourceReadAdapter, request/result types,
SourceAdapterDescriptor, deterministic selection, dependency diagnostics,
unsupported-format diagnostics, and read-only capability descriptors live in
markitect_tool.source.
P18.4 - Adapter registry and discovery hooks
id: MKTT-WP-0018-T004
status: done
priority: high
state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3"
Wire source adapters into the existing internal extension framework:
- register source adapter descriptors
- discover package-provided adapters through the entry point group pinned by
docs/source-adapter-contract.md - expose adapter capabilities via extension listing/inspection
- report missing optional dependency diagnostics
- ensure adapter packages can remain independently versioned
Output: registry implementation, package discovery tests, and compatibility
notes for markitect-filter.
Implemented: SourceAdapterRegistry, discover_source_adapters, and
default_source_adapter_registry discover descriptors through
markitect_tool.source_adapters, expose source adapter descriptors through the
extension catalog, and report missing optional dependencies deterministically.
P18.5 - Normalization CLI and public API surface
id: MKTT-WP-0018-T005
status: done
priority: medium
state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3"
Expose a small CLI/API surface:
mkt source adaptersmkt source inspect <path-or-uri>mkt source normalize <path-or-uri> --format markdown- structured JSON output for inspection and diagnostics
- markdown output for normalized content
- public API exports for adapter discovery and normalization
Output: CLI commands, API exports, generated command/API docs updates, and tests.
Implemented: mkt source adapters, mkt source inspect, and
mkt source normalize expose JSON/YAML/text/Markdown behavior. Public API
exports were added to markitect_tool.__all__, and generated CLI/API docs were
refreshed.
P18.6 - Contract tests and fake adapter fixture
id: MKTT-WP-0018-T006
status: done
priority: high
state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3"
Add deterministic contract tests proving that an external read adapter can:
- register through the extension framework
- advertise read capabilities
- inspect a source without full conversion
- normalize a source into canonical markdown documents and segments
- emit provenance, metadata, diagnostics, and quality signals
- fail gracefully for unsupported or malformed sources
Output: fake adapter fixture, reusable contract-test helpers, and documentation
for markitect-filter adapter implementers.
Implemented: tests/test_source_adapter_contract.py provides fake adapter
coverage for model serialization, cache keys, registry selection, entry point
discovery, dependency diagnostics, CLI JSON/Markdown envelopes, and public API
exports. examples/source-adapters/ contains expected-output fixtures.
P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter
id: MKTT-WP-0018-T007
status: done
priority: medium
state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4"
Document how the new contract affects sibling repos:
infospace-benchshould replace its local EPUB/read normalization spike with calls to the source adapter APImarkitect-filtershould implement EPUB3 first against this contractkontextual-engineshould treat normalized source outputs as ingestible knowledge asset derivatives when it needs durable ingestion
Output: migration note and follow-up workplan seeds for markitect-filter and
infospace-bench.
Implemented: docs/source-adapter-migration.md documents the migration path
for markitect-filter, infospace-bench, and kontextual-engine, including
follow-up workplan seeds.
Acceptance
markitect-toolexposes a stable source adapter protocol and canonical markdown normalization contract.- The base install remains markdown-native and does not pull heavyweight format-conversion dependencies.
- External adapter packages can register and be discovered through the existing extension framework.
- CLI and API users can inspect available source adapters and normalize a source through a registered adapter.
- Tests prove the contract with a fake adapter and no network dependency.
- Documentation clearly assigns EPUB3 implementation to
markitect-filter, notmarkitect-toolorinfospace-bench. - Writer/export adapter support is explicitly deferred beyond the v1 read adapter contract.
- Implementation behavior matches
docs/source-adapter-contract.mdand the fixtures inexamples/source-adapters/.