Added intent specification

2026-05-14 20:40:24 +02:00
parent 328c59b2c2
commit 8d62b2d241
1 changed files with 186 additions and 0 deletions
--- a/INTENT.md
+++ b/INTENT.md
@@ -0,0 +1,186 @@
+# INTENT
+
+## Purpose
+
+This repository exists to provide **source-format adapters for converting
+heterogeneous content formats into canonical Markitect markdown representations,
+and eventually writing supported markdown representations back to selected
+formats where that is practical and safe**.
+
+It is the adapter layer between external document/media formats and the
+markdown-native tooling provided by `markitect-tool`.
+
+The first concrete target is **EPUB3 read support**.
+
+---
+
+## Primary Utility
+
+The repository provides concrete filters/adapters that:
+
+* Inspect source files and report useful format metadata
+* Read supported source formats into normalized markdown documents and segments
+* Preserve provenance from source paths, package entries, anchors, pages,
+  chapters, and metadata fields where available
+* Emit diagnostics about extraction quality, unsupported features, lossiness,
+  malformed sources, and skipped boilerplate
+* Keep optional format dependencies isolated from the `markitect-tool` core
+* Implement the adapter interfaces and markdown normalization contracts defined
+  by `markitect-tool`
+
+It turns external document formats into **structured markdown inputs that higher
+layers can trust, inspect, cache, validate, and transform**.
+
+Initial read targets may include:
+
+* EPUB3
+* HTML/article exports
+* plain text variants
+* PDF
+* DOCX
+* ODT
+
+Write/export targets should be added more cautiously, format by format, only
+when there is a clear contract for preserving structure and provenance.
+
+---
+
+## Intended Users
+
+* Developers building source-ingestion pipelines on top of Markitect
+* Automation systems (`atm`) normalizing local document corpora
+* LLM agents (`agt`) needing predictable markdown access to source documents
+* Higher-level repositories such as `infospace-bench` and `kontextual-engine`
+  that need reusable source normalization without owning format-specific code
+
+---
+
+## Strategic Role in the System
+
+This repository is part of the Markitect layered knowledge system:
+
+- `markitect-tool`     → defines markdown structure, contracts, and adapter interfaces
+- **`markitect-filter`**   → implements concrete source-format filters/adapters
+- `kontextual-engine`  → makes knowledge persistent, retrievable, and operable
+- `infospace-bench`    → builds concrete, evaluated infospaces from normalized sources
+
+The responsibility split is:
+
+* **Syntax contract layer** — `markitect-tool`
+  Defines canonical markdown models, adapter protocols, validation, parsing,
+  templating, and extension contracts.
+
+* **Format adapter layer** — `markitect-filter`
+  Implements format-specific read/write adapters that satisfy the Markitect
+  source adapter contract.
+
+* **Knowledge operations layer** — `kontextual-engine`
+  Persists, governs, indexes, retrieves, and orchestrates knowledge assets.
+
+* **Application layer** — `infospace-bench`
+  Applies normalized source content to concrete infospace workflows, evaluation,
+  generation, and review.
+
+This repository should remain a **format adapter library**, not a knowledge
+platform and not the canonical markdown core.
+
+---
+
+## Strategic Boundaries
+
+This repository is **not** intended to:
+
+* Define the core Markitect markdown AST, parser, schema system, or contract
+  model
+* Own the source adapter interface itself; that belongs in `markitect-tool`
+* Persist long-lived knowledge assets, permissions, indexes, or operational
+  state
+* Build domain-specific infospaces, entity models, relation models, or
+  evaluation workflows
+* Become a full document editor, publishing suite, OCR product, or general ETL
+  platform
+* Hide extraction uncertainty; lossiness and diagnostics should remain visible
+* Force heavyweight optional dependencies into projects that only need markdown
+  core functionality
+
+Format-specific complexity belongs here, but only behind clear Markitect
+contracts.
+
+---
+
+## Design Principles
+
+* **Adapters over platforms**
+  Provide focused source-format adapters, not an end-user document system.
+
+* **Markdown as the interchange surface**
+  Every read adapter should produce canonical structured markdown plus metadata,
+  provenance, diagnostics, and quality signals.
+
+* **Contract-first integration**
+  `markitect-tool` defines the adapter protocol; this repo implements it.
+
+* **Optional dependencies by format**
+  EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries.
+  Those dependencies should be isolated and discoverable.
+
+* **Traceability over pretty conversion**
+  The output should preserve where content came from, what was skipped, and what
+  may have been lost.
+
+* **Deterministic core behavior**
+  Source conversion should be deterministic by default. AI-assisted repair or
+  enrichment may exist later, but only as explicit optional behavior.
+
+* **Small fixtures, real formats**
+  Tests should use small fixtures that exercise real package structures,
+  metadata, ordering, anchors, malformed inputs, and boilerplate handling.
+
+---
+
+## First Milestone: EPUB3 Read Adapter
+
+The first concrete implementation should normalize EPUB3 into Markitect
+markdown by:
+
+* Reading `META-INF/container.xml`
+* Parsing the OPF package document
+* Preserving Dublin Core and package metadata
+* Following spine reading order
+* Resolving navigation/chapter labels
+* Extracting body XHTML into markdown segments
+* Preserving source hrefs, anchors, and section/page references where possible
+* Classifying or skipping cover, navigation, table-of-contents, header, footer,
+  license, and transcriber-note material through explicit policy
+* Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets,
+  and lossy extraction
+
+This EPUB3 reader should be usable by `infospace-bench` to replace its current
+local EPUB intake spike.
+
+---
+
+## Maturity Target
+
+A mature version of this repository should:
+
+* Provide a suite of reliable source-format adapters that implement the
+  Markitect source adapter contract
+* Support clear inspection and normalization CLIs for each installed adapter
+* Keep format dependencies modular and optional
+* Produce canonical markdown outputs suitable for parsing, validation, indexing,
+  workflow execution, and infospace generation
+* Preserve provenance and diagnostics well enough for human review and
+  agent-safe automation
+* Serve as the default adapter package family for Markitect-based systems
+
+---
+
+## Stability Note
+
+Changes to this file represent a deliberate shift in the role of
+`markitect-filter` within the Markitect ecosystem.
+
+Such changes should be made with care, because this repository defines where
+format-specific source conversion lives relative to `markitect-tool`,
+`kontextual-engine`, and `infospace-bench`.