generated from coulomb/repo-seed
Added intent specification
This commit is contained in:
186
INTENT.md
Normal file
186
INTENT.md
Normal file
@@ -0,0 +1,186 @@
|
||||
# INTENT
|
||||
|
||||
## Purpose
|
||||
|
||||
This repository exists to provide **source-format adapters for converting
|
||||
heterogeneous content formats into canonical Markitect markdown representations,
|
||||
and eventually writing supported markdown representations back to selected
|
||||
formats where that is practical and safe**.
|
||||
|
||||
It is the adapter layer between external document/media formats and the
|
||||
markdown-native tooling provided by `markitect-tool`.
|
||||
|
||||
The first concrete target is **EPUB3 read support**.
|
||||
|
||||
---
|
||||
|
||||
## Primary Utility
|
||||
|
||||
The repository provides concrete filters/adapters that:
|
||||
|
||||
* Inspect source files and report useful format metadata
|
||||
* Read supported source formats into normalized markdown documents and segments
|
||||
* Preserve provenance from source paths, package entries, anchors, pages,
|
||||
chapters, and metadata fields where available
|
||||
* Emit diagnostics about extraction quality, unsupported features, lossiness,
|
||||
malformed sources, and skipped boilerplate
|
||||
* Keep optional format dependencies isolated from the `markitect-tool` core
|
||||
* Implement the adapter interfaces and markdown normalization contracts defined
|
||||
by `markitect-tool`
|
||||
|
||||
It turns external document formats into **structured markdown inputs that higher
|
||||
layers can trust, inspect, cache, validate, and transform**.
|
||||
|
||||
Initial read targets may include:
|
||||
|
||||
* EPUB3
|
||||
* HTML/article exports
|
||||
* plain text variants
|
||||
* PDF
|
||||
* DOCX
|
||||
* ODT
|
||||
|
||||
Write/export targets should be added more cautiously, format by format, only
|
||||
when there is a clear contract for preserving structure and provenance.
|
||||
|
||||
---
|
||||
|
||||
## Intended Users
|
||||
|
||||
* Developers building source-ingestion pipelines on top of Markitect
|
||||
* Automation systems (`atm`) normalizing local document corpora
|
||||
* LLM agents (`agt`) needing predictable markdown access to source documents
|
||||
* Higher-level repositories such as `infospace-bench` and `kontextual-engine`
|
||||
that need reusable source normalization without owning format-specific code
|
||||
|
||||
---
|
||||
|
||||
## Strategic Role in the System
|
||||
|
||||
This repository is part of the Markitect layered knowledge system:
|
||||
|
||||
- `markitect-tool` → defines markdown structure, contracts, and adapter interfaces
|
||||
- **`markitect-filter`** → implements concrete source-format filters/adapters
|
||||
- `kontextual-engine` → makes knowledge persistent, retrievable, and operable
|
||||
- `infospace-bench` → builds concrete, evaluated infospaces from normalized sources
|
||||
|
||||
The responsibility split is:
|
||||
|
||||
* **Syntax contract layer** — `markitect-tool`
|
||||
Defines canonical markdown models, adapter protocols, validation, parsing,
|
||||
templating, and extension contracts.
|
||||
|
||||
* **Format adapter layer** — `markitect-filter`
|
||||
Implements format-specific read/write adapters that satisfy the Markitect
|
||||
source adapter contract.
|
||||
|
||||
* **Knowledge operations layer** — `kontextual-engine`
|
||||
Persists, governs, indexes, retrieves, and orchestrates knowledge assets.
|
||||
|
||||
* **Application layer** — `infospace-bench`
|
||||
Applies normalized source content to concrete infospace workflows, evaluation,
|
||||
generation, and review.
|
||||
|
||||
This repository should remain a **format adapter library**, not a knowledge
|
||||
platform and not the canonical markdown core.
|
||||
|
||||
---
|
||||
|
||||
## Strategic Boundaries
|
||||
|
||||
This repository is **not** intended to:
|
||||
|
||||
* Define the core Markitect markdown AST, parser, schema system, or contract
|
||||
model
|
||||
* Own the source adapter interface itself; that belongs in `markitect-tool`
|
||||
* Persist long-lived knowledge assets, permissions, indexes, or operational
|
||||
state
|
||||
* Build domain-specific infospaces, entity models, relation models, or
|
||||
evaluation workflows
|
||||
* Become a full document editor, publishing suite, OCR product, or general ETL
|
||||
platform
|
||||
* Hide extraction uncertainty; lossiness and diagnostics should remain visible
|
||||
* Force heavyweight optional dependencies into projects that only need markdown
|
||||
core functionality
|
||||
|
||||
Format-specific complexity belongs here, but only behind clear Markitect
|
||||
contracts.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
* **Adapters over platforms**
|
||||
Provide focused source-format adapters, not an end-user document system.
|
||||
|
||||
* **Markdown as the interchange surface**
|
||||
Every read adapter should produce canonical structured markdown plus metadata,
|
||||
provenance, diagnostics, and quality signals.
|
||||
|
||||
* **Contract-first integration**
|
||||
`markitect-tool` defines the adapter protocol; this repo implements it.
|
||||
|
||||
* **Optional dependencies by format**
|
||||
EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries.
|
||||
Those dependencies should be isolated and discoverable.
|
||||
|
||||
* **Traceability over pretty conversion**
|
||||
The output should preserve where content came from, what was skipped, and what
|
||||
may have been lost.
|
||||
|
||||
* **Deterministic core behavior**
|
||||
Source conversion should be deterministic by default. AI-assisted repair or
|
||||
enrichment may exist later, but only as explicit optional behavior.
|
||||
|
||||
* **Small fixtures, real formats**
|
||||
Tests should use small fixtures that exercise real package structures,
|
||||
metadata, ordering, anchors, malformed inputs, and boilerplate handling.
|
||||
|
||||
---
|
||||
|
||||
## First Milestone: EPUB3 Read Adapter
|
||||
|
||||
The first concrete implementation should normalize EPUB3 into Markitect
|
||||
markdown by:
|
||||
|
||||
* Reading `META-INF/container.xml`
|
||||
* Parsing the OPF package document
|
||||
* Preserving Dublin Core and package metadata
|
||||
* Following spine reading order
|
||||
* Resolving navigation/chapter labels
|
||||
* Extracting body XHTML into markdown segments
|
||||
* Preserving source hrefs, anchors, and section/page references where possible
|
||||
* Classifying or skipping cover, navigation, table-of-contents, header, footer,
|
||||
license, and transcriber-note material through explicit policy
|
||||
* Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets,
|
||||
and lossy extraction
|
||||
|
||||
This EPUB3 reader should be usable by `infospace-bench` to replace its current
|
||||
local EPUB intake spike.
|
||||
|
||||
---
|
||||
|
||||
## Maturity Target
|
||||
|
||||
A mature version of this repository should:
|
||||
|
||||
* Provide a suite of reliable source-format adapters that implement the
|
||||
Markitect source adapter contract
|
||||
* Support clear inspection and normalization CLIs for each installed adapter
|
||||
* Keep format dependencies modular and optional
|
||||
* Produce canonical markdown outputs suitable for parsing, validation, indexing,
|
||||
workflow execution, and infospace generation
|
||||
* Preserve provenance and diagnostics well enough for human review and
|
||||
agent-safe automation
|
||||
* Serve as the default adapter package family for Markitect-based systems
|
||||
|
||||
---
|
||||
|
||||
## Stability Note
|
||||
|
||||
Changes to this file represent a deliberate shift in the role of
|
||||
`markitect-filter` within the Markitect ecosystem.
|
||||
|
||||
Such changes should be made with care, because this repository defines where
|
||||
format-specific source conversion lives relative to `markitect-tool`,
|
||||
`kontextual-engine`, and `infospace-bench`.
|
||||
Reference in New Issue
Block a user