From 28ce4b3f658fb96382eb6520f51c70ea0b14470e Mon Sep 17 00:00:00 2001 From: tegwick Date: Thu, 14 May 2026 21:33:43 +0200 Subject: [PATCH] Workplans to establish markitect-filter integration --- docs/workplan-planning-map.md | 14 + .../MKTT-WP-0018-source-adapter-contract.md | 286 ++++++++++++++++++ ...0019-source-adapter-contract-refinement.md | 234 ++++++++++++++ 3 files changed, 534 insertions(+) create mode 100644 workplans/MKTT-WP-0018-source-adapter-contract.md create mode 100644 workplans/MKTT-WP-0019-source-adapter-contract-refinement.md diff --git a/docs/workplan-planning-map.md b/docs/workplan-planning-map.md index 01da16c..cee6584 100644 --- a/docs/workplan-planning-map.md +++ b/docs/workplan-planning-map.md @@ -42,6 +42,8 @@ and descriptions mirror the operational view. | `MKTT-WP-0012` | complete | done | `MKTT-WP-0004`, `MKTT-WP-0010`, `MKTT-WP-0011` | Document function layer is complete: deterministic Markdown-native function descriptors, registry, inline/fenced syntax, pipelines, context bindings, CLI, docs, examples, diagnostics, provenance, and extension descriptor. | | `MKTT-WP-0008` | complete | done | `MKTT-WP-0006`, `MKTT-WP-0007`, `MKTT-WP-0009` | Agent working-memory context cache is complete: context package schema, local registry, package creation from queries/search/manifests, deterministic summaries, namespaces, activation/deactivation/refresh/explain lifecycle, policy re-checks, CLI, docs, and examples. | | `MKTT-WP-0017` | complete | done | `MKTT-WP-0003`, `MKTT-WP-0013` | CLI/API polish and practical adoption track is complete: shell completion, extension discovery, generated CLI/API docs, usecase relevance matrix, E2E fixture matrix, large-corpus smoke coverage, first-use docs, examples index, and command cheat sheet. | +| `MKTT-WP-0019` | P0 | active | `MKTT-WP-0013`, `MKTT-WP-0017` | Source adapter contract refinement: pin v1 read-only scope, field-level normalized model semantics, external adapter entry point discovery, CLI/API envelopes, fake adapter contract tests, and `markitect-filter` handoff before implementation. | +| `MKTT-WP-0018` | P1 | todo | `MKTT-WP-0013`, `MKTT-WP-0017`, `MKTT-WP-0019` | Source adapter framework implementation: implement the contract refined in `MKTT-WP-0019`, keeping format extraction in `markitect-filter` and the base install free of heavyweight conversion dependencies. | | `MKTT-WP-0015` | P2 | todo | `MKTT-WP-0010`, `MKTT-WP-0011`, `MKTT-WP-0012` | Future render and document-function extensions: typed values, richer syntax, document-local reusable functions, Quarkdown/export adapters, render-aware references, assets, and permission sandboxing. Defer unless publishing/export pressure becomes current. | | `MKTT-WP-0016` | P2 | todo | `MKTT-WP-0008`, `MKTT-WP-0007`, `MKTT-WP-0009`, `MKTT-WP-0013` | Follow-on agentic memory architecture: reasoning decision graphs, conversational paths, long-term knowledge graphs, memory service blueprints/profiles, graph-to-context-package compilation, and adapter boundaries. | @@ -123,6 +125,13 @@ before the remaining advanced extension work because users and agents need a complete, documented, shell-friendly, test-backed public surface before the tool grows further. +`MKTT-WP-0019` is the current source-adapter refinement track. It intentionally +precedes `MKTT-WP-0018` so the implementation work does not have to invent +field-level normalized model semantics, package entry point discovery, read +protocol behavior, CLI/API envelopes, or `markitect-filter` handoff criteria +while coding. The v1 contract should stay read-only; writer/export adapters +belong in later format-specific work once preservation semantics are explicit. + ## State Hub Mirror Native State Hub dependency edges should mirror the whole-workstream @@ -162,3 +171,8 @@ dependencies: - `MKTT-WP-0016 -> MKTT-WP-0013` - `MKTT-WP-0017 -> MKTT-WP-0003` - `MKTT-WP-0017 -> MKTT-WP-0013` +- `MKTT-WP-0019 -> MKTT-WP-0013` +- `MKTT-WP-0019 -> MKTT-WP-0017` +- `MKTT-WP-0018 -> MKTT-WP-0013` +- `MKTT-WP-0018 -> MKTT-WP-0017` +- `MKTT-WP-0018 -> MKTT-WP-0019` diff --git a/workplans/MKTT-WP-0018-source-adapter-contract.md b/workplans/MKTT-WP-0018-source-adapter-contract.md new file mode 100644 index 0000000..4197c59 --- /dev/null +++ b/workplans/MKTT-WP-0018-source-adapter-contract.md @@ -0,0 +1,286 @@ +--- +id: MKTT-WP-0018 +type: workplan +title: "Source Adapter Interface And Markdown Normalization Contract" +domain: markitect +status: todo +owner: markitect-tool +topic_slug: markitect +planning_priority: P1 +planning_order: 145 +depends_on_workplans: + - MKTT-WP-0013 + - MKTT-WP-0017 + - MKTT-WP-0019 +related_workplans: + - MKTT-WP-0010 + - MKTT-WP-0011 + - MKTT-WP-0012 + - MKTT-WP-0019 +created: "2026-05-14" +updated: "2026-05-14" +state_hub_workstream_id: "c4e4511f-13ea-40b4-9083-6d9ab6d12dad" +--- + +# MKTT-WP-0018: Source Adapter Interface And Markdown Normalization Contract + +## Purpose + +Define the `markitect-tool` framework for source-format adapters and canonical +markdown normalization so a separate `markitect-filter` repository can provide +concrete adapters for EPUB3 first, and later PDF, DOCX, ODT, HTML, and other +formats. + +This workplan deliberately does **not** make `markitect-tool` a document +conversion product. It keeps `markitect-tool` focused on the syntax layer: + +- structured markdown contracts +- canonical normalized markdown representations +- adapter protocols and descriptors +- registry/discovery hooks +- deterministic validation and diagnostics + +Concrete format extraction lives outside the core toolkit, initially in +`markitect-filter`. + +## Background + +The Lefevre EPUB3 example in `infospace-bench` exposed a boundary problem. +`infospace-bench` can pragmatically read EPUB-like files, but that logic is not +application-layer work. The durable split should be: + +```text +source formats + -> markitect-filter concrete adapters + -> markitect-tool source adapter protocol and markdown normalization contract + -> infospace-bench generation workflows + -> optional kontextual-engine persistence, retrieval, governance +``` + +This lets `infospace-bench` consume normalized markdown sources without owning +EPUB/PDF/DOCX details, while `kontextual-engine` can later use the same +adapter outputs for managed ingestion. + +## Decision + +Establish a `markitect-tool` source adapter framework and normalization +contract first. Then implement EPUB3 as the first concrete read adapter in a +new `markitect-filter` repo. + +`MKTT-WP-0019` must run first and pin the v1 contract details for this +implementation: field-level normalized model semantics, read-only protocol +shape, external package entry point discovery, CLI/API output envelopes, and +fake adapter contract-test expectations. + +`markitect-tool` should define: + +- adapter request/result types +- canonical markdown document and segment models +- provenance, metadata, quality, and diagnostic envelopes +- adapter capability descriptors +- registry and package discovery hooks +- CLI/API affordances for adapter inspection and conversion +- contract tests that an external adapter package can satisfy + +`markitect-filter` should define: + +- concrete source readers +- optional dependencies needed for each source format +- format-specific tests and fixtures +- EPUB3 spine/nav/body extraction as its first implementation + +Write/export adapters remain future optional scope until a format-specific +preservation contract exists. + +## Non-Goals + +- Do not implement EPUB3 parsing in `markitect-tool` core. +- Do not add heavyweight PDF, DOCX, or ODT dependencies to the base install. +- Do not move infospace lifecycle or generation concerns into `markitect-tool`. +- Do not implement persistent ingestion, permissions, retrieval, or governance; + those remain `kontextual-engine` responsibilities. +- Do not define domain-specific entity/relation workflows here. +- Do not implement source writer/export adapters in the v1 slice; keep the + protocol read-only unless a later workplan deliberately opens write scope. + +## P18.1 - Architecture boundary and markitect-filter handoff + +```task +id: MKTT-WP-0018-T001 +status: todo +priority: high +state_hub_task_id: "a5d05b2a-b9d8-43c6-9e52-5a77094b49d1" +``` + +Implement the cross-repo architecture pinned by `MKTT-WP-0019`: + +- `markitect-tool` owns adapter contracts and markdown normalization +- `markitect-filter` owns concrete source-format adapters +- `infospace-bench` consumes normalized markdown for concrete infospaces +- `kontextual-engine` can ingest adapter outputs into durable knowledge assets + +Output: architecture note covering responsibilities, extension package shape, +the pinned entry point contract, dependency policy, and migration path from the +current `infospace-bench` EPUB spike. + +## P18.2 - Canonical source-to-markdown data model + +```task +id: MKTT-WP-0018-T002 +status: todo +priority: high +state_hub_task_id: "f8164264-a9c1-4c82-8617-76bbb84a51bb" +``` + +Implement the normalized output model specified by `MKTT-WP-0019`: + +- `SourceAsset` +- `SourceMetadata` +- `NormalizedMarkdownDocument` +- `NormalizedMarkdownSegment` +- `SourceProvenance` +- `NormalizationDiagnostic` +- `NormalizationQuality` +- optional `SourceBinaryAttachment` or asset reference envelope + +The model should represent: + +- original path/URI/media type +- title, author/creator, language, rights, source URL, publication metadata +- ordered markdown content +- segment IDs, headings, anchors, page/section references +- digest and cache keys +- extraction diagnostics +- lossiness/quality signals +- adapter name/version/options + +Output: public data model, serialization tests, and normalization contract +documentation matching the field-level v1 specification. + +## P18.3 - Source adapter protocol and capability descriptors + +```task +id: MKTT-WP-0018-T003 +status: todo +priority: high +state_hub_task_id: "5036ff34-49f4-4900-9e90-95c4555b4ce9" +``` + +Define the read adapter interface: + +- source reader protocol: `can_read`, `inspect`, `read` +- media type and file extension matching +- adapter option schema +- capability descriptor shape +- safety and dependency flags +- deterministic diagnostics + +Do not add writer protocols in this implementation slice. Preserve room for a +future writer protocol by keeping descriptors capability-based, but avoid +shipping `can_write`/`write` contracts before there is a format-specific +preservation model. + +The first implementation slice can ship a fake in-tree adapter for tests only. +Concrete EPUB3 implementation belongs in `markitect-filter`. + +Output: protocol module, descriptor integration, tests for matching, +inspection, reading, diagnostics, and unsupported-format behavior. + +## P18.4 - Adapter registry and discovery hooks + +```task +id: MKTT-WP-0018-T004 +status: todo +priority: high +state_hub_task_id: "391fb723-8990-4086-ac6c-656a3d637ba3" +``` + +Wire source adapters into the existing internal extension framework: + +- register source adapter descriptors +- discover package-provided adapters through the entry point group pinned by + `MKTT-WP-0019` +- expose adapter capabilities via extension listing/inspection +- report missing optional dependency diagnostics +- ensure adapter packages can remain independently versioned + +Output: registry implementation, package discovery tests, and compatibility +notes for `markitect-filter`. + +## P18.5 - Normalization CLI and public API surface + +```task +id: MKTT-WP-0018-T005 +status: todo +priority: medium +state_hub_task_id: "c6233bd1-0403-498b-a6ed-c1874b172aa3" +``` + +Expose a small CLI/API surface: + +- `mkt source adapters` +- `mkt source inspect ` +- `mkt source normalize --format markdown` +- structured JSON output for inspection and diagnostics +- markdown output for normalized content +- public API exports for adapter discovery and normalization + +Output: CLI commands, API exports, generated command/API docs updates, and +tests. + +## P18.6 - Contract tests and fake adapter fixture + +```task +id: MKTT-WP-0018-T006 +status: todo +priority: high +state_hub_task_id: "263d0351-2942-4c2a-b333-b3aa96f2b8e3" +``` + +Add deterministic contract tests proving that an external read adapter can: + +- register through the extension framework +- advertise read capabilities +- inspect a source without full conversion +- normalize a source into canonical markdown documents and segments +- emit provenance, metadata, diagnostics, and quality signals +- fail gracefully for unsupported or malformed sources + +Output: fake adapter fixture, reusable contract-test helpers, and documentation +for `markitect-filter` adapter implementers. + +## P18.7 - Cross-repo migration notes for infospace-bench and markitect-filter + +```task +id: MKTT-WP-0018-T007 +status: todo +priority: medium +state_hub_task_id: "dfc81c61-f7ca-4266-8908-56b221101fd4" +``` + +Document how the new contract affects sibling repos: + +- `infospace-bench` should replace its local EPUB/read normalization spike with + calls to the source adapter API +- `markitect-filter` should implement EPUB3 first against this contract +- `kontextual-engine` should treat normalized source outputs as ingestible + knowledge asset derivatives when it needs durable ingestion + +Output: migration note and follow-up workplan seeds for `markitect-filter` and +`infospace-bench`. + +## Acceptance + +- `markitect-tool` exposes a stable source adapter protocol and canonical + markdown normalization contract. +- The base install remains markdown-native and does not pull heavyweight + format-conversion dependencies. +- External adapter packages can register and be discovered through the existing + extension framework. +- CLI and API users can inspect available source adapters and normalize a source + through a registered adapter. +- Tests prove the contract with a fake adapter and no network dependency. +- Documentation clearly assigns EPUB3 implementation to `markitect-filter`, not + `markitect-tool` or `infospace-bench`. +- Writer/export adapter support is explicitly deferred beyond the v1 read + adapter contract. diff --git a/workplans/MKTT-WP-0019-source-adapter-contract-refinement.md b/workplans/MKTT-WP-0019-source-adapter-contract-refinement.md new file mode 100644 index 0000000..1eee6c6 --- /dev/null +++ b/workplans/MKTT-WP-0019-source-adapter-contract-refinement.md @@ -0,0 +1,234 @@ +--- +id: MKTT-WP-0019 +type: workplan +title: "Source Adapter Contract Refinement" +domain: markitect +status: active +owner: markitect-tool +topic_slug: markitect +planning_priority: P0 +planning_order: 142 +depends_on_workplans: + - MKTT-WP-0013 + - MKTT-WP-0017 +related_workplans: + - MKTT-WP-0018 + - MKTT-WP-0010 + - MKTT-WP-0011 +created: "2026-05-14" +updated: "2026-05-14" +state_hub_workstream_id: "702e94d9-35e8-4a83-b0ea-40cc19afbe51" +--- + +# MKTT-WP-0019: Source Adapter Contract Refinement + +## Purpose + +Refine the source adapter contract before implementing +`MKTT-WP-0018`. The goal is to remove the remaining ambiguity in the external +adapter surface so `markitect-tool` can implement the framework and +`markitect-filter` can implement EPUB3 without guessing about model fields, +entry points, CLI behavior, or contract-test expectations. + +This is a short gating workplan. It should produce decisions, documentation, +and test fixtures that make `MKTT-WP-0018` implementation straightforward. + +## Background + +`MKTT-WP-0018` establishes the correct architecture boundary: + +```text +markitect-tool -> contracts, normalized markdown model, registry, CLI/API +markitect-filter -> concrete source-format adapters, EPUB3 first +``` + +The boundary is sound, but a feasibility review found that the implementation +workplan still leaves several decisions too implicit: + +- the existing internal extension framework does not yet define external + package entry point discovery +- the normalized source-to-markdown model names are listed, but field-level + contracts and serialization rules are not pinned +- v1 should be read-only, with write/export support reserved for a later + format-by-format decision +- CLI/API output envelopes, adapter selection, and unsupported-format behavior + need deterministic contracts +- `markitect-filter` needs a concrete handoff shape for its first EPUB3 adapter + +## Decision + +Add a refinement pass ahead of `MKTT-WP-0018`. This workplan should define the +minimum stable v1 contract and explicitly defer nonessential scope. + +The v1 source adapter contract should be: + +- read-only +- deterministic +- local-file-first, with URI support documented as future or explicitly scoped +- discoverable through a named package entry point group +- serializable without heavyweight optional format dependencies +- testable through fake adapters and small fixtures + +## Non-Goals + +- Do not implement EPUB3 parsing here. +- Do not implement the full `markitect-tool` source adapter framework here. +- Do not add PDF, DOCX, ODT, OCR, or browser dependencies. +- Do not design write/export adapters beyond recording the future extension + point. +- Do not make `markitect-filter` a knowledge platform or ingestion service. + +## P19.1 - Pin v1 scope and external adapter package shape + +```task +id: MKTT-WP-0019-T001 +status: todo +priority: high +state_hub_task_id: "0aa1d9a3-6cf8-47ab-8585-f23b2512d19b" +``` + +Define the v1 source adapter scope: + +- read adapters only +- local filesystem inputs first +- explicit future status for URI inputs, binary attachments, and write adapters +- expected external package layout for `markitect-filter` +- dependency policy for optional format libraries +- compatibility expectations between `markitect-tool` and adapter packages + +Output: concise architecture note or source-adapter contract section that +`MKTT-WP-0018` can implement directly. + +## P19.2 - Specify normalized data model fields and serialization + +```task +id: MKTT-WP-0019-T002 +status: todo +priority: high +state_hub_task_id: "fabd3e76-3c2c-43cb-92b2-2322bd933fa7" +``` + +Specify the field-level v1 model for: + +- `SourceAsset` +- `SourceMetadata` +- `NormalizedMarkdownDocument` +- `NormalizedMarkdownSegment` +- `SourceProvenance` +- `NormalizationQuality` +- adapter diagnostics using the existing `Diagnostic`/`SourceLocation` shape +- optional asset reference envelopes, if needed for v1 + +The specification should define required vs optional fields, stable dict/JSON +serialization, digest/cache-key inputs, segment ordering, segment IDs, headings, +anchors, source hrefs, page/section references, and adapter metadata. + +Output: model contract documentation and fixture-shaped examples. + +## P19.3 - Specify read adapter protocol and selection semantics + +```task +id: MKTT-WP-0019-T003 +status: todo +priority: high +state_hub_task_id: "2d559e3b-1515-4c88-8ed9-3895026cd2ca" +``` + +Define the v1 read protocol: + +- request/result type names and fields +- `can_read`, `inspect`, and `read` method signatures +- media type and file extension matching rules +- adapter option schema conventions +- malformed-source and unsupported-format diagnostics +- deterministic adapter selection when multiple adapters match +- behavior when optional adapter dependencies are missing + +Output: protocol contract that can be implemented as Python `Protocol` +classes in `MKTT-WP-0018`. + +## P19.4 - Define package entry point and registry contract + +```task +id: MKTT-WP-0019-T004 +status: todo +priority: high +state_hub_task_id: "3d661d24-2496-405a-b525-c7e6d8eb4e68" +``` + +Define how external source adapter packages register with `markitect-tool`: + +- entry point group name, initially `markitect_tool.source_adapters` +- expected entry point object shape +- descriptor ID and versioning rules +- relationship between source adapter descriptors and + `ExtensionDescriptor` +- duplicate descriptor handling +- dependency diagnostics for missing optional format libraries +- compatibility notes for separately versioned packages + +Output: discovery contract and fake entry point test plan for +`MKTT-WP-0018`. + +## P19.5 - Pin CLI/API output envelopes and exit behavior + +```task +id: MKTT-WP-0019-T005 +status: todo +priority: medium +state_hub_task_id: "2c30b0c7-683e-4d60-8268-0b49660f2e30" +``` + +Specify the public source commands and library functions: + +- `mkt source adapters` +- `mkt source inspect ` +- `mkt source normalize --format markdown` +- JSON output for adapters, inspection, normalization, and diagnostics +- Markdown output for normalized document content +- adapter selection and explicit adapter override options +- exit behavior for unsupported, malformed, or dependency-missing inputs +- public API names that should be exported from `markitect_tool` + +Output: CLI/API contract note and expected-output fixtures. + +## P19.6 - Prepare contract-test and markitect-filter handoff criteria + +```task +id: MKTT-WP-0019-T006 +status: todo +priority: high +state_hub_task_id: "f6845a4d-3465-40b3-970a-714cfafe282c" +``` + +Define the contract tests that `MKTT-WP-0018` must implement: + +- fake in-tree adapter for core behavior +- fake external adapter package or monkeypatched entry point for discovery +- serialization round trips for normalized model fixtures +- unsupported-format and missing-dependency diagnostics +- CLI JSON and Markdown output fixtures +- reusable adapter conformance expectations for `markitect-filter` + +Also seed the `markitect-filter` handoff: + +- expected package entry point declaration +- first EPUB3 adapter descriptor shape +- minimal fixture expectations for EPUB3 spine/nav/body extraction +- follow-up workplan seed for `markitect-filter` implementation + +Output: contract-test checklist and handoff note. + +## Acceptance + +- `MKTT-WP-0018` has no unresolved v1 contract ambiguity around model fields, + read protocol shape, entry point discovery, CLI/API output, or fake adapter + tests. +- v1 is explicitly read-only; write/export support is deferred to a later + workplan. +- External adapter discovery has a named entry point group and descriptor + object contract. +- `markitect-filter` has enough handoff detail to implement EPUB3 without + importing implementation decisions from `infospace-bench`. +- The existing `MKTT-WP-0018` workplan is updated to depend on this refinement + pass and to reference the pinned decisions rather than reopening them.