generated from coulomb/repo-seed
Workplans to establish markitect-filter integration
This commit is contained in:
286
workplans/MKTT-WP-0018-source-adapter-contract.md
Normal file
286
workplans/MKTT-WP-0018-source-adapter-contract.md
Normal file
@@ -0,0 +1,286 @@
|
||||
---
|
||||
id: MKTT-WP-0018
|
||||
type: workplan
|
||||
title: "Source Adapter Interface And Markdown Normalization Contract"
|
||||
domain: markitect
|
||||
status: todo
|
||||
owner: markitect-tool
|
||||
topic_slug: markitect
|
||||
planning_priority: P1
|
||||
planning_order: 145
|
||||
depends_on_workplans:
|
||||
- MKTT-WP-0013
|
||||
- MKTT-WP-0017
|
||||
- MKTT-WP-0019
|
||||
related_workplans:
|
||||
- MKTT-WP-0010
|
||||
- MKTT-WP-0011
|
||||
- MKTT-WP-0012
|
||||
- MKTT-WP-0019
|
||||
created: "2026-05-14"
|
||||
updated: "2026-05-14"
|
||||
state_hub_workstream_id: "c4e4511f-13ea-40b4-9083-6d9ab6d12dad"
|
||||
---
|
||||
|
||||
# MKTT-WP-0018: Source Adapter Interface And Markdown Normalization Contract
|
||||
|
||||
## Purpose
|
||||
|
||||
Define the `markitect-tool` framework for source-format adapters and canonical
|
||||
markdown normalization so a separate `markitect-filter` repository can provide
|
||||
concrete adapters for EPUB3 first, and later PDF, DOCX, ODT, HTML, and other
|
||||
formats.
|
||||
|
||||
This workplan deliberately does **not** make `markitect-tool` a document
|
||||
conversion product. It keeps `markitect-tool` focused on the syntax layer:
|
||||
|
||||
- structured markdown contracts
|
||||
- canonical normalized markdown representations
|
||||
- adapter protocols and descriptors
|
||||
- registry/discovery hooks
|
||||
- deterministic validation and diagnostics
|
||||
|
||||
Concrete format extraction lives outside the core toolkit, initially in
|
||||
`markitect-filter`.
|
||||
|
||||
## Background
|
||||
|
||||
The Lefevre EPUB3 example in `infospace-bench` exposed a boundary problem.
|
||||
`infospace-bench` can pragmatically read EPUB-like files, but that logic is not
|
||||
application-layer work. The durable split should be:
|
||||
|
||||
```text
|
||||
source formats
|
||||
-> markitect-filter concrete adapters
|
||||
-> markitect-tool source adapter protocol and markdown normalization contract
|
||||
-> infospace-bench generation workflows
|
||||
-> optional kontextual-engine persistence, retrieval, governance
|
||||
```
|
||||
|
||||
This lets `infospace-bench` consume normalized markdown sources without owning
|
||||
EPUB/PDF/DOCX details, while `kontextual-engine` can later use the same
|
||||
adapter outputs for managed ingestion.
|
||||
|
||||
## Decision
|
||||
|
||||
Establish a `markitect-tool` source adapter framework and normalization
|
||||
contract first. Then implement EPUB3 as the first concrete read adapter in a
|
||||
new `markitect-filter` repo.
|
||||
|
||||
`MKTT-WP-0019` must run first and pin the v1 contract details for this
|
||||
implementation: field-level normalized model semantics, read-only protocol
|
||||
shape, external package entry point discovery, CLI/API output envelopes, and
|
||||
fake adapter contract-test expectations.
|
||||
|
||||
`markitect-tool` should define:
|
||||
|
||||
- adapter request/result types
|
||||
- canonical markdown document and segment models
|
||||
- provenance, metadata, quality, and diagnostic envelopes
|
||||
- adapter capability descriptors
|
||||
- registry and package discovery hooks
|
||||
- CLI/API affordances for adapter inspection and conversion
|
||||
- contract tests that an external adapter package can satisfy
|
||||
|
||||
`markitect-filter` should define:
|
||||
|
||||
- concrete source readers
|
||||
- optional dependencies needed for each source format
|
||||
- format-specific tests and fixtures
|
||||
- EPUB3 spine/nav/body extraction as its first implementation
|
||||
|
||||
Write/export adapters remain future optional scope until a format-specific
|
||||
preservation contract exists.
|
||||
|
||||
## Non-Goals
|
||||
|
||||
- Do not implement EPUB3 parsing in `markitect-tool` core.
|
||||
- Do not add heavyweight PDF, DOCX, or ODT dependencies to the base install.
|
||||
- Do not move infospace lifecycle or generation concerns into `markitect-tool`.
|
||||
- Do not implement persistent ingestion, permissions, retrieval, or governance;
|
||||
those remain `kontextual-engine` responsibilities.
|
||||
- Do not define domain-specific entity/relation workflows here.
|
||||
- Do not implement source writer/export adapters in the v1 slice; keep the
|
||||
protocol read-only unless a later workplan deliberately opens write scope.
|
||||
|
||||
## P18.1 - Architecture boundary and markitect-filter handoff
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1"
|
||||
```
|
||||
|
||||
Implement the cross-repo architecture pinned by `MKTT-WP-0019`:
|
||||
|
||||
- `markitect-tool` owns adapter contracts and markdown normalization
|
||||
- `markitect-filter` owns concrete source-format adapters
|
||||
- `infospace-bench` consumes normalized markdown for concrete infospaces
|
||||
- `kontextual-engine` can ingest adapter outputs into durable knowledge assets
|
||||
|
||||
Output: architecture note covering responsibilities, extension package shape,
|
||||
the pinned entry point contract, dependency policy, and migration path from the
|
||||
current `infospace-bench` EPUB spike.
|
||||
|
||||
## P18.2 - Canonical source-to-markdown data model
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T002
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb"
|
||||
```
|
||||
|
||||
Implement the normalized output model specified by `MKTT-WP-0019`:
|
||||
|
||||
- `SourceAsset`
|
||||
- `SourceMetadata`
|
||||
- `NormalizedMarkdownDocument`
|
||||
- `NormalizedMarkdownSegment`
|
||||
- `SourceProvenance`
|
||||
- `NormalizationDiagnostic`
|
||||
- `NormalizationQuality`
|
||||
- optional `SourceBinaryAttachment` or asset reference envelope
|
||||
|
||||
The model should represent:
|
||||
|
||||
- original path/URI/media type
|
||||
- title, author/creator, language, rights, source URL, publication metadata
|
||||
- ordered markdown content
|
||||
- segment IDs, headings, anchors, page/section references
|
||||
- digest and cache keys
|
||||
- extraction diagnostics
|
||||
- lossiness/quality signals
|
||||
- adapter name/version/options
|
||||
|
||||
Output: public data model, serialization tests, and normalization contract
|
||||
documentation matching the field-level v1 specification.
|
||||
|
||||
## P18.3 - Source adapter protocol and capability descriptors
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T003
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9"
|
||||
```
|
||||
|
||||
Define the read adapter interface:
|
||||
|
||||
- source reader protocol: `can_read`, `inspect`, `read`
|
||||
- media type and file extension matching
|
||||
- adapter option schema
|
||||
- capability descriptor shape
|
||||
- safety and dependency flags
|
||||
- deterministic diagnostics
|
||||
|
||||
Do not add writer protocols in this implementation slice. Preserve room for a
|
||||
future writer protocol by keeping descriptors capability-based, but avoid
|
||||
shipping `can_write`/`write` contracts before there is a format-specific
|
||||
preservation model.
|
||||
|
||||
The first implementation slice can ship a fake in-tree adapter for tests only.
|
||||
Concrete EPUB3 implementation belongs in `markitect-filter`.
|
||||
|
||||
Output: protocol module, descriptor integration, tests for matching,
|
||||
inspection, reading, diagnostics, and unsupported-format behavior.
|
||||
|
||||
## P18.4 - Adapter registry and discovery hooks
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T004
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3"
|
||||
```
|
||||
|
||||
Wire source adapters into the existing internal extension framework:
|
||||
|
||||
- register source adapter descriptors
|
||||
- discover package-provided adapters through the entry point group pinned by
|
||||
`MKTT-WP-0019`
|
||||
- expose adapter capabilities via extension listing/inspection
|
||||
- report missing optional dependency diagnostics
|
||||
- ensure adapter packages can remain independently versioned
|
||||
|
||||
Output: registry implementation, package discovery tests, and compatibility
|
||||
notes for `markitect-filter`.
|
||||
|
||||
## P18.5 - Normalization CLI and public API surface
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T005
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3"
|
||||
```
|
||||
|
||||
Expose a small CLI/API surface:
|
||||
|
||||
- `mkt source adapters`
|
||||
- `mkt source inspect <path-or-uri>`
|
||||
- `mkt source normalize <path-or-uri> --format markdown`
|
||||
- structured JSON output for inspection and diagnostics
|
||||
- markdown output for normalized content
|
||||
- public API exports for adapter discovery and normalization
|
||||
|
||||
Output: CLI commands, API exports, generated command/API docs updates, and
|
||||
tests.
|
||||
|
||||
## P18.6 - Contract tests and fake adapter fixture
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T006
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3"
|
||||
```
|
||||
|
||||
Add deterministic contract tests proving that an external read adapter can:
|
||||
|
||||
- register through the extension framework
|
||||
- advertise read capabilities
|
||||
- inspect a source without full conversion
|
||||
- normalize a source into canonical markdown documents and segments
|
||||
- emit provenance, metadata, diagnostics, and quality signals
|
||||
- fail gracefully for unsupported or malformed sources
|
||||
|
||||
Output: fake adapter fixture, reusable contract-test helpers, and documentation
|
||||
for `markitect-filter` adapter implementers.
|
||||
|
||||
## P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter
|
||||
|
||||
```task
|
||||
id: MKTT-WP-0018-T007
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4"
|
||||
```
|
||||
|
||||
Document how the new contract affects sibling repos:
|
||||
|
||||
- `infospace-bench` should replace its local EPUB/read normalization spike with
|
||||
calls to the source adapter API
|
||||
- `markitect-filter` should implement EPUB3 first against this contract
|
||||
- `kontextual-engine` should treat normalized source outputs as ingestible
|
||||
knowledge asset derivatives when it needs durable ingestion
|
||||
|
||||
Output: migration note and follow-up workplan seeds for `markitect-filter` and
|
||||
`infospace-bench`.
|
||||
|
||||
## Acceptance
|
||||
|
||||
- `markitect-tool` exposes a stable source adapter protocol and canonical
|
||||
markdown normalization contract.
|
||||
- The base install remains markdown-native and does not pull heavyweight
|
||||
format-conversion dependencies.
|
||||
- External adapter packages can register and be discovered through the existing
|
||||
extension framework.
|
||||
- CLI and API users can inspect available source adapters and normalize a source
|
||||
through a registered adapter.
|
||||
- Tests prove the contract with a fake adapter and no network dependency.
|
||||
- Documentation clearly assigns EPUB3 implementation to `markitect-filter`, not
|
||||
`markitect-tool` or `infospace-bench`.
|
||||
- Writer/export adapter support is explicitly deferred beyond the v1 read
|
||||
adapter contract.
|
||||
Reference in New Issue
Block a user