From 8d62b2d2417a1bc578c023c22b3f422411cebe47 Mon Sep 17 00:00:00 2001 From: tegwick Date: Thu, 14 May 2026 20:40:24 +0200 Subject: [PATCH] Added intent specification --- INTENT.md | 186 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 186 insertions(+) create mode 100644 INTENT.md diff --git a/INTENT.md b/INTENT.md new file mode 100644 index 0000000..4753b90 --- /dev/null +++ b/INTENT.md @@ -0,0 +1,186 @@ +# INTENT + +## Purpose + +This repository exists to provide **source-format adapters for converting +heterogeneous content formats into canonical Markitect markdown representations, +and eventually writing supported markdown representations back to selected +formats where that is practical and safe**. + +It is the adapter layer between external document/media formats and the +markdown-native tooling provided by `markitect-tool`. + +The first concrete target is **EPUB3 read support**. + +--- + +## Primary Utility + +The repository provides concrete filters/adapters that: + +* Inspect source files and report useful format metadata +* Read supported source formats into normalized markdown documents and segments +* Preserve provenance from source paths, package entries, anchors, pages, + chapters, and metadata fields where available +* Emit diagnostics about extraction quality, unsupported features, lossiness, + malformed sources, and skipped boilerplate +* Keep optional format dependencies isolated from the `markitect-tool` core +* Implement the adapter interfaces and markdown normalization contracts defined + by `markitect-tool` + +It turns external document formats into **structured markdown inputs that higher +layers can trust, inspect, cache, validate, and transform**. + +Initial read targets may include: + +* EPUB3 +* HTML/article exports +* plain text variants +* PDF +* DOCX +* ODT + +Write/export targets should be added more cautiously, format by format, only +when there is a clear contract for preserving structure and provenance. + +--- + +## Intended Users + +* Developers building source-ingestion pipelines on top of Markitect +* Automation systems (`atm`) normalizing local document corpora +* LLM agents (`agt`) needing predictable markdown access to source documents +* Higher-level repositories such as `infospace-bench` and `kontextual-engine` + that need reusable source normalization without owning format-specific code + +--- + +## Strategic Role in the System + +This repository is part of the Markitect layered knowledge system: + +- `markitect-tool` → defines markdown structure, contracts, and adapter interfaces +- **`markitect-filter`** → implements concrete source-format filters/adapters +- `kontextual-engine` → makes knowledge persistent, retrievable, and operable +- `infospace-bench` → builds concrete, evaluated infospaces from normalized sources + +The responsibility split is: + +* **Syntax contract layer** — `markitect-tool` + Defines canonical markdown models, adapter protocols, validation, parsing, + templating, and extension contracts. + +* **Format adapter layer** — `markitect-filter` + Implements format-specific read/write adapters that satisfy the Markitect + source adapter contract. + +* **Knowledge operations layer** — `kontextual-engine` + Persists, governs, indexes, retrieves, and orchestrates knowledge assets. + +* **Application layer** — `infospace-bench` + Applies normalized source content to concrete infospace workflows, evaluation, + generation, and review. + +This repository should remain a **format adapter library**, not a knowledge +platform and not the canonical markdown core. + +--- + +## Strategic Boundaries + +This repository is **not** intended to: + +* Define the core Markitect markdown AST, parser, schema system, or contract + model +* Own the source adapter interface itself; that belongs in `markitect-tool` +* Persist long-lived knowledge assets, permissions, indexes, or operational + state +* Build domain-specific infospaces, entity models, relation models, or + evaluation workflows +* Become a full document editor, publishing suite, OCR product, or general ETL + platform +* Hide extraction uncertainty; lossiness and diagnostics should remain visible +* Force heavyweight optional dependencies into projects that only need markdown + core functionality + +Format-specific complexity belongs here, but only behind clear Markitect +contracts. + +--- + +## Design Principles + +* **Adapters over platforms** + Provide focused source-format adapters, not an end-user document system. + +* **Markdown as the interchange surface** + Every read adapter should produce canonical structured markdown plus metadata, + provenance, diagnostics, and quality signals. + +* **Contract-first integration** + `markitect-tool` defines the adapter protocol; this repo implements it. + +* **Optional dependencies by format** + EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries. + Those dependencies should be isolated and discoverable. + +* **Traceability over pretty conversion** + The output should preserve where content came from, what was skipped, and what + may have been lost. + +* **Deterministic core behavior** + Source conversion should be deterministic by default. AI-assisted repair or + enrichment may exist later, but only as explicit optional behavior. + +* **Small fixtures, real formats** + Tests should use small fixtures that exercise real package structures, + metadata, ordering, anchors, malformed inputs, and boilerplate handling. + +--- + +## First Milestone: EPUB3 Read Adapter + +The first concrete implementation should normalize EPUB3 into Markitect +markdown by: + +* Reading `META-INF/container.xml` +* Parsing the OPF package document +* Preserving Dublin Core and package metadata +* Following spine reading order +* Resolving navigation/chapter labels +* Extracting body XHTML into markdown segments +* Preserving source hrefs, anchors, and section/page references where possible +* Classifying or skipping cover, navigation, table-of-contents, header, footer, + license, and transcriber-note material through explicit policy +* Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets, + and lossy extraction + +This EPUB3 reader should be usable by `infospace-bench` to replace its current +local EPUB intake spike. + +--- + +## Maturity Target + +A mature version of this repository should: + +* Provide a suite of reliable source-format adapters that implement the + Markitect source adapter contract +* Support clear inspection and normalization CLIs for each installed adapter +* Keep format dependencies modular and optional +* Produce canonical markdown outputs suitable for parsing, validation, indexing, + workflow execution, and infospace generation +* Preserve provenance and diagnostics well enough for human review and + agent-safe automation +* Serve as the default adapter package family for Markitect-based systems + +--- + +## Stability Note + +Changes to this file represent a deliberate shift in the role of +`markitect-filter` within the Markitect ecosystem. + +Such changes should be made with care, because this repository defines where +format-specific source conversion lives relative to `markitect-tool`, +`kontextual-engine`, and `infospace-bench`.