# INTENT ## Purpose This repository exists to provide **source-format adapters for converting heterogeneous content formats into canonical Markitect markdown representations, and eventually writing supported markdown representations back to selected formats where that is practical and safe**. It is the adapter layer between external document/media formats and the markdown-native tooling provided by `markitect-tool`. The first concrete target is **EPUB3 read support**. --- ## Primary Utility The repository provides concrete filters/adapters that: * Inspect source files and report useful format metadata * Read supported source formats into normalized markdown documents and segments * Preserve provenance from source paths, package entries, anchors, pages, chapters, and metadata fields where available * Emit diagnostics about extraction quality, unsupported features, lossiness, malformed sources, and skipped boilerplate * Keep optional format dependencies isolated from the `markitect-tool` core * Implement the adapter interfaces and markdown normalization contracts defined by `markitect-tool` It turns external document formats into **structured markdown inputs that higher layers can trust, inspect, cache, validate, and transform**. Initial read targets may include: * EPUB3 * HTML/article exports * plain text variants * PDF * DOCX * ODT Write/export targets should be added more cautiously, format by format, only when there is a clear contract for preserving structure and provenance. --- ## Intended Users * Developers building source-ingestion pipelines on top of Markitect * Automation systems (`atm`) normalizing local document corpora * LLM agents (`agt`) needing predictable markdown access to source documents * Higher-level repositories such as `infospace-bench` and `kontextual-engine` that need reusable source normalization without owning format-specific code --- ## Strategic Role in the System This repository is part of the Markitect layered knowledge system: - `markitect-tool` → defines markdown structure, contracts, and adapter interfaces - **`markitect-filter`** → implements concrete source-format filters/adapters - `kontextual-engine` → makes knowledge persistent, retrievable, and operable - `infospace-bench` → builds concrete, evaluated infospaces from normalized sources The responsibility split is: * **Syntax contract layer** — `markitect-tool` Defines canonical markdown models, adapter protocols, validation, parsing, templating, and extension contracts. * **Format adapter layer** — `markitect-filter` Implements format-specific read/write adapters that satisfy the Markitect source adapter contract. * **Knowledge operations layer** — `kontextual-engine` Persists, governs, indexes, retrieves, and orchestrates knowledge assets. * **Application layer** — `infospace-bench` Applies normalized source content to concrete infospace workflows, evaluation, generation, and review. This repository should remain a **format adapter library**, not a knowledge platform and not the canonical markdown core. --- ## Strategic Boundaries This repository is **not** intended to: * Define the core Markitect markdown AST, parser, schema system, or contract model * Own the source adapter interface itself; that belongs in `markitect-tool` * Persist long-lived knowledge assets, permissions, indexes, or operational state * Build domain-specific infospaces, entity models, relation models, or evaluation workflows * Become a full document editor, publishing suite, OCR product, or general ETL platform * Hide extraction uncertainty; lossiness and diagnostics should remain visible * Force heavyweight optional dependencies into projects that only need markdown core functionality Format-specific complexity belongs here, but only behind clear Markitect contracts. --- ## Design Principles * **Adapters over platforms** Provide focused source-format adapters, not an end-user document system. * **Markdown as the interchange surface** Every read adapter should produce canonical structured markdown plus metadata, provenance, diagnostics, and quality signals. * **Contract-first integration** `markitect-tool` defines the adapter protocol; this repo implements it. * **Optional dependencies by format** EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries. Those dependencies should be isolated and discoverable. * **Traceability over pretty conversion** The output should preserve where content came from, what was skipped, and what may have been lost. * **Deterministic core behavior** Source conversion should be deterministic by default. AI-assisted repair or enrichment may exist later, but only as explicit optional behavior. * **Small fixtures, real formats** Tests should use small fixtures that exercise real package structures, metadata, ordering, anchors, malformed inputs, and boilerplate handling. --- ## First Milestone: EPUB3 Read Adapter The first concrete implementation should normalize EPUB3 into Markitect markdown by: * Reading `META-INF/container.xml` * Parsing the OPF package document * Preserving Dublin Core and package metadata * Following spine reading order * Resolving navigation/chapter labels * Extracting body XHTML into markdown segments * Preserving source hrefs, anchors, and section/page references where possible * Classifying or skipping cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit policy * Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets, and lossy extraction This EPUB3 reader should be usable by `infospace-bench` to replace its current local EPUB intake spike. --- ## Maturity Target A mature version of this repository should: * Provide a suite of reliable source-format adapters that implement the Markitect source adapter contract * Support clear inspection and normalization CLIs for each installed adapter * Keep format dependencies modular and optional * Produce canonical markdown outputs suitable for parsing, validation, indexing, workflow execution, and infospace generation * Preserve provenance and diagnostics well enough for human review and agent-safe automation * Serve as the default adapter package family for Markitect-based systems --- ## Stability Note Changes to this file represent a deliberate shift in the role of `markitect-filter` within the Markitect ecosystem. Such changes should be made with care, because this repository defines where format-specific source conversion lives relative to `markitect-tool`, `kontextual-engine`, and `infospace-bench`.