6.6 KiB
INTENT
Purpose
This repository exists to provide source-format adapters for converting heterogeneous content formats into canonical Markitect markdown representations, and eventually writing supported markdown representations back to selected formats where that is practical and safe.
It is the adapter layer between external document/media formats and the
markdown-native tooling provided by markitect-tool.
The first concrete target is EPUB3 read support.
Primary Utility
The repository provides concrete filters/adapters that:
- Inspect source files and report useful format metadata
- Read supported source formats into normalized markdown documents and segments
- Preserve provenance from source paths, package entries, anchors, pages, chapters, and metadata fields where available
- Emit diagnostics about extraction quality, unsupported features, lossiness, malformed sources, and skipped boilerplate
- Keep optional format dependencies isolated from the
markitect-toolcore - Implement the adapter interfaces and markdown normalization contracts defined
by
markitect-tool
It turns external document formats into structured markdown inputs that higher layers can trust, inspect, cache, validate, and transform.
Initial read targets may include:
- EPUB3
- HTML/article exports
- plain text variants
- DOCX
- ODT
Write/export targets should be added more cautiously, format by format, only when there is a clear contract for preserving structure and provenance.
Intended Users
- Developers building source-ingestion pipelines on top of Markitect
- Automation systems (
atm) normalizing local document corpora - LLM agents (
agt) needing predictable markdown access to source documents - Higher-level repositories such as
infospace-benchandkontextual-enginethat need reusable source normalization without owning format-specific code
Strategic Role in the System
This repository is part of the Markitect layered knowledge system:
markitect-tool→ defines markdown structure, contracts, and adapter interfacesmarkitect-filter→ implements concrete source-format filters/adapterskontextual-engine→ makes knowledge persistent, retrievable, and operableinfospace-bench→ builds concrete, evaluated infospaces from normalized sources
The responsibility split is:
-
Syntax contract layer —
markitect-toolDefines canonical markdown models, adapter protocols, validation, parsing, templating, and extension contracts. -
Format adapter layer —
markitect-filterImplements format-specific read/write adapters that satisfy the Markitect source adapter contract. -
Knowledge operations layer —
kontextual-enginePersists, governs, indexes, retrieves, and orchestrates knowledge assets. -
Application layer —
infospace-benchApplies normalized source content to concrete infospace workflows, evaluation, generation, and review.
This repository should remain a format adapter library, not a knowledge platform and not the canonical markdown core.
Strategic Boundaries
This repository is not intended to:
- Define the core Markitect markdown AST, parser, schema system, or contract model
- Own the source adapter interface itself; that belongs in
markitect-tool - Persist long-lived knowledge assets, permissions, indexes, or operational state
- Build domain-specific infospaces, entity models, relation models, or evaluation workflows
- Become a full document editor, publishing suite, OCR product, or general ETL platform
- Hide extraction uncertainty; lossiness and diagnostics should remain visible
- Force heavyweight optional dependencies into projects that only need markdown core functionality
Format-specific complexity belongs here, but only behind clear Markitect contracts.
Design Principles
-
Adapters over platforms Provide focused source-format adapters, not an end-user document system.
-
Markdown as the interchange surface Every read adapter should produce canonical structured markdown plus metadata, provenance, diagnostics, and quality signals.
-
Contract-first integration
markitect-tooldefines the adapter protocol; this repo implements it. -
Optional dependencies by format EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries. Those dependencies should be isolated and discoverable.
-
Traceability over pretty conversion The output should preserve where content came from, what was skipped, and what may have been lost.
-
Deterministic core behavior Source conversion should be deterministic by default. AI-assisted repair or enrichment may exist later, but only as explicit optional behavior.
-
Small fixtures, real formats Tests should use small fixtures that exercise real package structures, metadata, ordering, anchors, malformed inputs, and boilerplate handling.
First Milestone: EPUB3 Read Adapter
The first concrete implementation should normalize EPUB3 into Markitect markdown by:
- Reading
META-INF/container.xml - Parsing the OPF package document
- Preserving Dublin Core and package metadata
- Following spine reading order
- Resolving navigation/chapter labels
- Extracting body XHTML into markdown segments
- Preserving source hrefs, anchors, and section/page references where possible
- Classifying or skipping cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit policy
- Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets, and lossy extraction
This EPUB3 reader should be usable by infospace-bench to replace its current
local EPUB intake spike.
Maturity Target
A mature version of this repository should:
- Provide a suite of reliable source-format adapters that implement the Markitect source adapter contract
- Support clear inspection and normalization CLIs for each installed adapter
- Keep format dependencies modular and optional
- Produce canonical markdown outputs suitable for parsing, validation, indexing, workflow execution, and infospace generation
- Preserve provenance and diagnostics well enough for human review and agent-safe automation
- Serve as the default adapter package family for Markitect-based systems
Stability Note
Changes to this file represent a deliberate shift in the role of
markitect-filter within the Markitect ecosystem.
Such changes should be made with care, because this repository defines where
format-specific source conversion lives relative to markitect-tool,
kontextual-engine, and infospace-bench.