Added intent specification

This commit is contained in:
2026-05-14 20:40:24 +02:00
parent 328c59b2c2
commit 8d62b2d241

186
INTENT.md Normal file
View File

@@ -0,0 +1,186 @@
# INTENT
## Purpose
This repository exists to provide **source-format adapters for converting
heterogeneous content formats into canonical Markitect markdown representations,
and eventually writing supported markdown representations back to selected
formats where that is practical and safe**.
It is the adapter layer between external document/media formats and the
markdown-native tooling provided by `markitect-tool`.
The first concrete target is **EPUB3 read support**.
---
## Primary Utility
The repository provides concrete filters/adapters that:
* Inspect source files and report useful format metadata
* Read supported source formats into normalized markdown documents and segments
* Preserve provenance from source paths, package entries, anchors, pages,
chapters, and metadata fields where available
* Emit diagnostics about extraction quality, unsupported features, lossiness,
malformed sources, and skipped boilerplate
* Keep optional format dependencies isolated from the `markitect-tool` core
* Implement the adapter interfaces and markdown normalization contracts defined
by `markitect-tool`
It turns external document formats into **structured markdown inputs that higher
layers can trust, inspect, cache, validate, and transform**.
Initial read targets may include:
* EPUB3
* HTML/article exports
* plain text variants
* PDF
* DOCX
* ODT
Write/export targets should be added more cautiously, format by format, only
when there is a clear contract for preserving structure and provenance.
---
## Intended Users
* Developers building source-ingestion pipelines on top of Markitect
* Automation systems (`atm`) normalizing local document corpora
* LLM agents (`agt`) needing predictable markdown access to source documents
* Higher-level repositories such as `infospace-bench` and `kontextual-engine`
that need reusable source normalization without owning format-specific code
---
## Strategic Role in the System
This repository is part of the Markitect layered knowledge system:
- `markitect-tool` → defines markdown structure, contracts, and adapter interfaces
- **`markitect-filter`** → implements concrete source-format filters/adapters
- `kontextual-engine` → makes knowledge persistent, retrievable, and operable
- `infospace-bench` → builds concrete, evaluated infospaces from normalized sources
The responsibility split is:
* **Syntax contract layer** — `markitect-tool`
Defines canonical markdown models, adapter protocols, validation, parsing,
templating, and extension contracts.
* **Format adapter layer** — `markitect-filter`
Implements format-specific read/write adapters that satisfy the Markitect
source adapter contract.
* **Knowledge operations layer** — `kontextual-engine`
Persists, governs, indexes, retrieves, and orchestrates knowledge assets.
* **Application layer** — `infospace-bench`
Applies normalized source content to concrete infospace workflows, evaluation,
generation, and review.
This repository should remain a **format adapter library**, not a knowledge
platform and not the canonical markdown core.
---
## Strategic Boundaries
This repository is **not** intended to:
* Define the core Markitect markdown AST, parser, schema system, or contract
model
* Own the source adapter interface itself; that belongs in `markitect-tool`
* Persist long-lived knowledge assets, permissions, indexes, or operational
state
* Build domain-specific infospaces, entity models, relation models, or
evaluation workflows
* Become a full document editor, publishing suite, OCR product, or general ETL
platform
* Hide extraction uncertainty; lossiness and diagnostics should remain visible
* Force heavyweight optional dependencies into projects that only need markdown
core functionality
Format-specific complexity belongs here, but only behind clear Markitect
contracts.
---
## Design Principles
* **Adapters over platforms**
Provide focused source-format adapters, not an end-user document system.
* **Markdown as the interchange surface**
Every read adapter should produce canonical structured markdown plus metadata,
provenance, diagnostics, and quality signals.
* **Contract-first integration**
`markitect-tool` defines the adapter protocol; this repo implements it.
* **Optional dependencies by format**
EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries.
Those dependencies should be isolated and discoverable.
* **Traceability over pretty conversion**
The output should preserve where content came from, what was skipped, and what
may have been lost.
* **Deterministic core behavior**
Source conversion should be deterministic by default. AI-assisted repair or
enrichment may exist later, but only as explicit optional behavior.
* **Small fixtures, real formats**
Tests should use small fixtures that exercise real package structures,
metadata, ordering, anchors, malformed inputs, and boilerplate handling.
---
## First Milestone: EPUB3 Read Adapter
The first concrete implementation should normalize EPUB3 into Markitect
markdown by:
* Reading `META-INF/container.xml`
* Parsing the OPF package document
* Preserving Dublin Core and package metadata
* Following spine reading order
* Resolving navigation/chapter labels
* Extracting body XHTML into markdown segments
* Preserving source hrefs, anchors, and section/page references where possible
* Classifying or skipping cover, navigation, table-of-contents, header, footer,
license, and transcriber-note material through explicit policy
* Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets,
and lossy extraction
This EPUB3 reader should be usable by `infospace-bench` to replace its current
local EPUB intake spike.
---
## Maturity Target
A mature version of this repository should:
* Provide a suite of reliable source-format adapters that implement the
Markitect source adapter contract
* Support clear inspection and normalization CLIs for each installed adapter
* Keep format dependencies modular and optional
* Produce canonical markdown outputs suitable for parsing, validation, indexing,
workflow execution, and infospace generation
* Preserve provenance and diagnostics well enough for human review and
agent-safe automation
* Serve as the default adapter package family for Markitect-based systems
---
## Stability Note
Changes to this file represent a deliberate shift in the role of
`markitect-filter` within the Markitect ecosystem.
Such changes should be made with care, because this repository defines where
format-specific source conversion lives relative to `markitect-tool`,
`kontextual-engine`, and `infospace-bench`.