Files
markitect-filter/INTENT.md
2026-05-14 20:40:24 +02:00

6.6 KiB

INTENT

Purpose

This repository exists to provide source-format adapters for converting heterogeneous content formats into canonical Markitect markdown representations, and eventually writing supported markdown representations back to selected formats where that is practical and safe.

It is the adapter layer between external document/media formats and the markdown-native tooling provided by markitect-tool.

The first concrete target is EPUB3 read support.


Primary Utility

The repository provides concrete filters/adapters that:

  • Inspect source files and report useful format metadata
  • Read supported source formats into normalized markdown documents and segments
  • Preserve provenance from source paths, package entries, anchors, pages, chapters, and metadata fields where available
  • Emit diagnostics about extraction quality, unsupported features, lossiness, malformed sources, and skipped boilerplate
  • Keep optional format dependencies isolated from the markitect-tool core
  • Implement the adapter interfaces and markdown normalization contracts defined by markitect-tool

It turns external document formats into structured markdown inputs that higher layers can trust, inspect, cache, validate, and transform.

Initial read targets may include:

  • EPUB3
  • HTML/article exports
  • plain text variants
  • PDF
  • DOCX
  • ODT

Write/export targets should be added more cautiously, format by format, only when there is a clear contract for preserving structure and provenance.


Intended Users

  • Developers building source-ingestion pipelines on top of Markitect
  • Automation systems (atm) normalizing local document corpora
  • LLM agents (agt) needing predictable markdown access to source documents
  • Higher-level repositories such as infospace-bench and kontextual-engine that need reusable source normalization without owning format-specific code

Strategic Role in the System

This repository is part of the Markitect layered knowledge system:

  • markitect-tool → defines markdown structure, contracts, and adapter interfaces
  • markitect-filter → implements concrete source-format filters/adapters
  • kontextual-engine → makes knowledge persistent, retrievable, and operable
  • infospace-bench → builds concrete, evaluated infospaces from normalized sources

The responsibility split is:

  • Syntax contract layermarkitect-tool Defines canonical markdown models, adapter protocols, validation, parsing, templating, and extension contracts.

  • Format adapter layermarkitect-filter Implements format-specific read/write adapters that satisfy the Markitect source adapter contract.

  • Knowledge operations layerkontextual-engine Persists, governs, indexes, retrieves, and orchestrates knowledge assets.

  • Application layerinfospace-bench Applies normalized source content to concrete infospace workflows, evaluation, generation, and review.

This repository should remain a format adapter library, not a knowledge platform and not the canonical markdown core.


Strategic Boundaries

This repository is not intended to:

  • Define the core Markitect markdown AST, parser, schema system, or contract model
  • Own the source adapter interface itself; that belongs in markitect-tool
  • Persist long-lived knowledge assets, permissions, indexes, or operational state
  • Build domain-specific infospaces, entity models, relation models, or evaluation workflows
  • Become a full document editor, publishing suite, OCR product, or general ETL platform
  • Hide extraction uncertainty; lossiness and diagnostics should remain visible
  • Force heavyweight optional dependencies into projects that only need markdown core functionality

Format-specific complexity belongs here, but only behind clear Markitect contracts.


Design Principles

  • Adapters over platforms Provide focused source-format adapters, not an end-user document system.

  • Markdown as the interchange surface Every read adapter should produce canonical structured markdown plus metadata, provenance, diagnostics, and quality signals.

  • Contract-first integration markitect-tool defines the adapter protocol; this repo implements it.

  • Optional dependencies by format EPUB, PDF, DOCX, ODT, and other adapters may depend on different libraries. Those dependencies should be isolated and discoverable.

  • Traceability over pretty conversion The output should preserve where content came from, what was skipped, and what may have been lost.

  • Deterministic core behavior Source conversion should be deterministic by default. AI-assisted repair or enrichment may exist later, but only as explicit optional behavior.

  • Small fixtures, real formats Tests should use small fixtures that exercise real package structures, metadata, ordering, anchors, malformed inputs, and boilerplate handling.


First Milestone: EPUB3 Read Adapter

The first concrete implementation should normalize EPUB3 into Markitect markdown by:

  • Reading META-INF/container.xml
  • Parsing the OPF package document
  • Preserving Dublin Core and package metadata
  • Following spine reading order
  • Resolving navigation/chapter labels
  • Extracting body XHTML into markdown segments
  • Preserving source hrefs, anchors, and section/page references where possible
  • Classifying or skipping cover, navigation, table-of-contents, header, footer, license, and transcriber-note material through explicit policy
  • Reporting diagnostics for malformed EPUBs, unsupported media, skipped assets, and lossy extraction

This EPUB3 reader should be usable by infospace-bench to replace its current local EPUB intake spike.


Maturity Target

A mature version of this repository should:

  • Provide a suite of reliable source-format adapters that implement the Markitect source adapter contract
  • Support clear inspection and normalization CLIs for each installed adapter
  • Keep format dependencies modular and optional
  • Produce canonical markdown outputs suitable for parsing, validation, indexing, workflow execution, and infospace generation
  • Preserve provenance and diagnostics well enough for human review and agent-safe automation
  • Serve as the default adapter package family for Markitect-based systems

Stability Note

Changes to this file represent a deliberate shift in the role of markitect-filter within the Markitect ecosystem.

Such changes should be made with care, because this repository defines where format-specific source conversion lives relative to markitect-tool, kontextual-engine, and infospace-bench.