generated from coulomb/repo-seed
187 lines
6.6 KiB
Markdown
187 lines
6.6 KiB
Markdown
# INTENT
|
|
|
|
## Purpose
|
|
|
|
This repository exists to provide **source-format adapters for converting
|
|
heterogeneous content formats into canonical Markitect markdown representations,
|
|
and eventually writing supported markdown representations back to selected
|
|
formats where that is practical and safe**.
|
|
|
|
It is the adapter layer between external document/media formats and the
|
|
markdown-native tooling provided by `markitect-tool`.
|
|
|
|
The first concrete target is **EPUB3 read support**.
|
|
|
|
---
|
|
|
|
## Primary Utility
|
|
|
|
The repository provides concrete filters/adapters that:
|
|
|
|
* Inspect source files and report useful format metadata
|
|
* Read supported source formats into normalized markdown documents and segments
|
|
* Preserve provenance from source paths, package entries, anchors, pages,
|
|
chapters, and metadata fields where available
|
|
* Emit diagnostics about extraction quality, unsupported features, lossiness,
|
|
malformed sources, and skipped boilerplate
|
|
* Keep optional format dependencies isolated from the `markitect-tool` core
|
|
* Implement the adapter interfaces and markdown normalization contracts defined
|
|
by `markitect-tool`
|
|
|
|
It turns external document formats into **structured markdown inputs that higher
|
|
layers can trust, inspect, cache, validate, and transform**.
|
|
|
|
Initial read targets may include:
|
|
|
|
* EPUB3
|
|
* HTML/article exports
|
|
* plain text variants
|
|
* PDF
|
|
* DOCX
|
|
* ODT
|
|
|
|
Write/export targets should be added more cautiously, format by format, only
|
|
when there is a clear contract for preserving structure and provenance.
|
|
|
|
---
|
|
|
|
## Intended Users
|
|
|
|
* Developers building source-ingestion pipelines on top of Markitect
|
|
* Automation systems (`atm`) normalizing local document corpora
|
|
* LLM agents (`agt`) needing predictable markdown access to source documents
|
|
* Higher-level repositories such as `infospace-bench` and `kontextual-engine`
|
|
that need reusable source normalization without owning format-specific code
|
|
|
|
---
|
|
|
|
## Strategic Role in the System
|
|
|
|
This repository is part of the Markitect layered knowledge system:
|
|
|
|
- `markitect-tool` → defines markdown structure, contracts, and adapter interfaces
|
|
- **`markitect-filter`** → implements concrete source-format filters/adapters
|
|
- `kontextual-engine` → makes knowledge persistent, retrievable, and operable
|
|
- `infospace-bench` → builds concrete, evaluated infospaces from normalized sources
|
|
|
|
The responsibility split is:
|
|
|
|
* **Syntax contract layer** — `markitect-tool`
|
|
Defines canonical markdown models, adapter protocols, validation, parsing,
|
|
templating, and extension contracts.
|
|
|
|
* **Format adapter layer** — `markitect-filter`
|
|
Implements format-specific read/write adapters that satisfy the Markitect
|
|
source adapter contract.
|
|
|
|
* **Knowledge operations layer** — `kontextual-engine`
|
|
Persists, governs, indexes, retrieves, and orchestrates knowledge assets.
|
|
|
|
* **Application layer** — `infospace-bench`
|
|
Applies normalized source content to concrete infospace workflows, evaluation,
|
|
generation, and review.
|
|
|
|
This repository should remain a **format adapter library**, not a knowledge
|
|
platform and not the canonical markdown core.
|
|
|
|
---
|
|
|
|
## Strategic Boundaries
|
|
|
|
This repository is **not** intended to:
|
|
|
|
* Define the core Markitect markdown AST, parser, schema system, or contract
|
|
model
|
|
* Own the source adapter interface itself; that belongs in `markitect-tool`
|
|
* Persist long-lived knowledge assets, permissions, indexes, or operational
|
|
state
|
|
* Build domain-specific infospaces, entity models, relation models, or
|
|
evaluation workflows
|
|
* Become a full document editor, publishing suite, OCR product, or general ETL
|
|
platform
|
|
* Hide extraction uncertainty; lossiness and diagnostics should remain visible
|
|
* Force heavyweight optional dependencies into projects that only need markdown
|
|
core functionality
|
|
|
|
Format-specific complexity belongs here, but only behind clear Markitect
|
|
contracts.
|
|
|
|
---
|
|
|
|
## Design Principles
|
|
|
|
* **Adapters over platforms**
|
|
Provide focused source-format adapters, not an end-user document system.
|
|
|
|
* **Markdown as the interchange surface**
|
|
Every read adapter should produce canonical structured markdown plus metadata,
|
|
provenance, diagnostics, and quality signals.
|
|
|
|
* **Contract-first integration**
|
|
`markitect-tool` defines the adapter protocol; this repo implements it.
|
|
|
|
* **Optional dependencies by format**
|
|
EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries.
|
|
Those dependencies should be isolated and discoverable.
|
|
|
|
* **Traceability over pretty conversion**
|
|
The output should preserve where content came from, what was skipped, and what
|
|
may have been lost.
|
|
|
|
* **Deterministic core behavior**
|
|
Source conversion should be deterministic by default. AI-assisted repair or
|
|
enrichment may exist later, but only as explicit optional behavior.
|
|
|
|
* **Small fixtures, real formats**
|
|
Tests should use small fixtures that exercise real package structures,
|
|
metadata, ordering, anchors, malformed inputs, and boilerplate handling.
|
|
|
|
---
|
|
|
|
## First Milestone: EPUB3 Read Adapter
|
|
|
|
The first concrete implementation should normalize EPUB3 into Markitect
|
|
markdown by:
|
|
|
|
* Reading `META-INF/container.xml`
|
|
* Parsing the OPF package document
|
|
* Preserving Dublin Core and package metadata
|
|
* Following spine reading order
|
|
* Resolving navigation/chapter labels
|
|
* Extracting body XHTML into markdown segments
|
|
* Preserving source hrefs, anchors, and section/page references where possible
|
|
* Classifying or skipping cover, navigation, table-of-contents, header, footer,
|
|
license, and transcriber-note material through explicit policy
|
|
* Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets,
|
|
and lossy extraction
|
|
|
|
This EPUB3 reader should be usable by `infospace-bench` to replace its current
|
|
local EPUB intake spike.
|
|
|
|
---
|
|
|
|
## Maturity Target
|
|
|
|
A mature version of this repository should:
|
|
|
|
* Provide a suite of reliable source-format adapters that implement the
|
|
Markitect source adapter contract
|
|
* Support clear inspection and normalization CLIs for each installed adapter
|
|
* Keep format dependencies modular and optional
|
|
* Produce canonical markdown outputs suitable for parsing, validation, indexing,
|
|
workflow execution, and infospace generation
|
|
* Preserve provenance and diagnostics well enough for human review and
|
|
agent-safe automation
|
|
* Serve as the default adapter package family for Markitect-based systems
|
|
|
|
---
|
|
|
|
## Stability Note
|
|
|
|
Changes to this file represent a deliberate shift in the role of
|
|
`markitect-filter` within the Markitect ecosystem.
|
|
|
|
Such changes should be made with care, because this repository defines where
|
|
format-specific source conversion lives relative to `markitect-tool`,
|
|
`kontextual-engine`, and `infospace-bench`.
|