10 KiB
Content References, Processors, and Literate Workflows
Date: 2026-05-04
Purpose
This note records the follow-up research after the first transform, compose,
and include implementation. The goal is to keep markitect-tool close to
Markdown while preserving the richer ideas that made markitect-main
interesting: reversible explode/implode, transclusion, processors, namespaces,
content references, and Knuth-style weave/tangle workflows.
Research Inputs
- WEB on CTAN and Knuth/Levy CWEB: literate source is processed in two directions, one for compilable source and one for readable documentation.
- noweb Hacker's Guide: language-independent literate programming benefits from a pipeline representation and named chunks that tools can extend.
- Org Babel: source blocks can be executable, parameterized, named, reused, tangled to files, and woven into reproducible documents.
- CommonMark fenced code blocks: fenced blocks are first-class Markdown structure and must be handled by the parser, not by naive global text rewrites.
- Asciidoctor include directives and tagged regions: includes need predictable base-dir resolution, safe-mode boundaries, line and tag selection, and source-code-region reuse.
- Sphinx literalinclude: code inclusion commonly needs line ranges, object-level extraction, highlighting metadata, dedent, and original line-number handling.
- DITA conref and conkeyref: content reuse becomes much stronger when references have IDs, keys, scoped indirection, validity checks, and clear attribute merge rules.
- W3C XInclude: inclusion should have an explicit processing model, target addressing, and fallback behavior.
- JSON-LD 1.1 contexts: namespaces can map short terms to stable global identifiers while retaining compact authoring.
- Python C3 MRO and CLOS concepts: multiple inheritance needs deterministic linearization, monotonicity, local precedence, and explicit method/slot combination rules.
- Pandoc filters: processors can be cleanly modeled as AST transformations over document nodes and code blocks.
Lessons
Markdown can carry a surprisingly rich system if the extra semantics are placed in stable, inspectable constructs:
- Frontmatter declares document-level identity, namespaces, and defaults.
- Headings and fenced blocks become addressable content units.
- Include/transclusion is a resolver over content references, not only file expansion.
- Processors operate on typed blocks and produce diagnostics, dependencies, generated content, or files.
- Weave/tangle is a special case of named content units plus processor targets.
- Explode/implode needs a manifest with source spans and stable IDs so the directory form is not a lossy export.
- Multiple inheritance is useful for document templates, regulatory overlays, style/persona overlays, and reusable content classes, but only if merge behavior is deterministic and diagnosable.
Use Cases
1. Reversible Large-Document Editing
An author explodes a long PRD/FRS into a directory, edits sections in separate files, then implodes it back into a canonical single Markdown document. The manifest preserves frontmatter policy, heading levels, ordering, source spans, and generated filenames.
2. Knuth-Style Markdown Weave/Tangle
A document explains a program in the order best for human understanding. Named code chunks are declared in fenced blocks, cross-reference each other, and tangle into one or more source files. The woven output keeps prose, chunk cross-links, and optionally generated indexes.
3. Executable Documentation Pipelines
Fenced blocks act as processors: shell, Python, SQL, validation, diagram, or custom processors can consume inputs, emit outputs, and record dependencies. Execution is optional and controlled; pure transforms remain deterministic.
4. Reusable Legal, Contract, and Product Clauses
Common clauses are defined once with stable IDs. Documents include them by namespace/key and can select variants by jurisdiction, customer type, language, or document class. Diagnostics explain missing keys and conflicting variants.
5. Source Snippet Documentation
Docs include code by tag, line range, parser object, or named block while preserving source line references. This supports API docs, changelog examples, and tutorials that stay aligned with source files.
6. Content Classes and Multiple Inheritance
A document can be treated as an instance of several content classes: for
example base:prd, market:enterprise, jurisdiction:eu, and
style:board-brief. Slot values, assertions, sections, and snippets resolve in
a deterministic order with explicit merge strategies.
7. Agent Context Packages
An agent can request a namespace, topic, chunk, section, or graph slice and get a bounded context package with provenance, dependencies, hashes, and security labels. This dovetails with later cache and memory work.
8. Security-Sensitive Knowledge Gateways
References and processor outputs carry labels. Policy can filter or redact content before transclusion, weaving, tangling, or context-package creation.
Architecture Blueprint
Content Unit Model
The parser should expose addressable units beyond the current document, section, and block lists:
- document
- frontmatter path
- section
- block
- fenced block
- named region
- named chunk
- processor result
Each unit should have:
- stable local ID
- optional global name
- source path and source span
- kind/type
- content hash
- dependency list
- labels/policy metadata
Reference Syntax
Keep Markdown readable and allow several levels of precision:
<!-- mkt:include ref="std:clauses/payment" -->
<!-- mkt:include path="sections/intro.md" selector="sections[heading=Summary]" -->
<!-- mkt:include ref="src:api#tag:create-user" mode="literal" -->
Frontmatter can define namespaces:
namespaces:
std: ./standards/
src: ../src/
contract: ./contracts/
References should resolve through a single resolver API:
namespace + address + selector + mode + context -> resolved content unit(s)
Region and Chunk Syntax
Use comments for regions so they can live inside Markdown or source files:
<!-- mkt:region id="overview" -->
Reusable content.
<!-- /mkt:region -->
Use fenced blocks for executable or tangible chunks:
```python {#load-config tangle="src/config.py"}
def load_config(path):
return {}
```
Chunk references can stay close to noweb:
<<load-config>>
The processor layer decides whether chunk references are expanded during tangle, displayed during weave, or left literal.
Processor Registry
Processors should be pluggable but explicit. A processor receives:
- unit content and metadata
- resolver
- execution context
- policy context
- output target request
It returns:
- transformed content, generated files, or computed values
- diagnostics
- dependency edges
- provenance events
Core processors should start deterministic: include, region, explode/implode, tangle, weave, and simple text/Markdown transforms. Executing arbitrary code is a later, opt-in capability.
Explode/Implode
Explode/implode should become a first-class reversible operation, not a loose directory export. The manifest should include:
- original path and hash
- variant type (
flat,hierarchical,semantic) - frontmatter preservation policy
- section/chunk/source-span entries
- file paths and order
- heading-level policy
- warnings and non-lossy roundtrip checks
The old markitect-main flat/hierarchical/semantic variants are worth
reimplementing behind a small variant interface.
Weave/Tangle
Tangle extracts named chunks to target files, expanding chunk references in a deterministic dependency order. Weave renders human-readable documentation with chunk backlinks and optional source indexes.
Minimum useful MVP:
- discover named fenced blocks
- support
tangle="<path>" - concatenate multiple chunks for the same target in document order
- expand
<<chunk-id>>inside code - detect missing/cyclic chunk references
- emit source mapping comments optionally
Content Class and Multiple Inheritance
Document classes should be data, not Python inheritance. A class can define:
- slots
- required sections
- snippets
- assertions
- processors
- merge policies
An instance declares:
document_class:
extends:
- contract:prd
- market:enterprise
- jurisdiction:eu
Resolution should use a C3-like linearization. Merge policies must be explicit:
replaceappendprependdeep_mergebefore:<slot>after:<slot>error_on_conflict
Diagnostics should report inconsistent precedence, ambiguous slot definitions, and merge-policy violations.
Comparison with Current Implementation
What we have now is a good kernel:
- Parser/frontmatter/sections/blocks
- Contracts and deterministic diagnostics
- Query/extraction over structured documents
- Transform, compose, and include operations
- Safe include path boundaries and cycle checks
What is missing for the richer framework:
- stable content IDs and namespaces
- region/tag selectors
- fenced-block-aware transforms
- operation provenance and dependency graphs
- structured include diagnostics instead of fail-fast exceptions only
- reversible explode/implode with manifests
- processor registry
- named chunks and weave/tangle
- class/object composition with deterministic multi-inheritance
- line/source maps across generated outputs
- security labels and policy hooks on resolved units
The clean path is to keep current ops as the small deterministic surface and grow this richer system as a framework layer. That protects simple CLI use while opening a strong route to sophisticated knowledge/programming pipelines.