# Content References, Processors, and Literate Workflows Date: 2026-05-04 ## Purpose This note records the follow-up research after the first transform, compose, and include implementation. The goal is to keep `markitect-tool` close to Markdown while preserving the richer ideas that made `markitect-main` interesting: reversible explode/implode, transclusion, processors, namespaces, content references, and Knuth-style weave/tangle workflows. ## Research Inputs - [WEB on CTAN](https://ctan.org/pkg/web) and [Knuth/Levy CWEB](https://cs.stanford.edu/~knuth/cweb.html): literate source is processed in two directions, one for compilable source and one for readable documentation. - [noweb Hacker's Guide](https://www.cs.tufts.edu/~nr/noweb/guide.html): language-independent literate programming benefits from a pipeline representation and named chunks that tools can extend. - [Org Babel](https://orgmode.org/worg/org-contrib/babel/intro.html): source blocks can be executable, parameterized, named, reused, tangled to files, and woven into reproducible documents. - [CommonMark fenced code blocks](https://spec.commonmark.org/0.31.2/#fenced-code-blocks): fenced blocks are first-class Markdown structure and must be handled by the parser, not by naive global text rewrites. - [Asciidoctor include directives](https://docs.asciidoctor.org/asciidoc/latest/directives/include/) and [tagged regions](https://docs.asciidoctor.org/asciidoc/latest/directives/include-tagged-regions/): includes need predictable base-dir resolution, safe-mode boundaries, line and tag selection, and source-code-region reuse. - [Sphinx literalinclude](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html): code inclusion commonly needs line ranges, object-level extraction, highlighting metadata, dedent, and original line-number handling. - [DITA conref](https://docs.oasis-open.org/dita/dita/v1.3/os/part2-tech-content/archSpec/base/conref.html) and [conkeyref](https://docs.oasis-open.org/dita/dita/v1.3/os/part1-base/langRef/attributes/theconkeyrefattribute.html): content reuse becomes much stronger when references have IDs, keys, scoped indirection, validity checks, and clear attribute merge rules. - [W3C XInclude](https://www.w3.org/TR/xinclude/): inclusion should have an explicit processing model, target addressing, and fallback behavior. - [JSON-LD 1.1 contexts](https://www.w3.org/TR/json-ld/): namespaces can map short terms to stable global identifiers while retaining compact authoring. - [Python C3 MRO](https://www.python.org/download/releases/2.3/mro/) and [CLOS concepts](https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node261.html): multiple inheritance needs deterministic linearization, monotonicity, local precedence, and explicit method/slot combination rules. - [Pandoc filters](https://pandoc.org/filters.html): processors can be cleanly modeled as AST transformations over document nodes and code blocks. ## Lessons Markdown can carry a surprisingly rich system if the extra semantics are placed in stable, inspectable constructs: - Frontmatter declares document-level identity, namespaces, and defaults. - Headings and fenced blocks become addressable content units. - Include/transclusion is a resolver over content references, not only file expansion. - Processors operate on typed blocks and produce diagnostics, dependencies, generated content, or files. - Weave/tangle is a special case of named content units plus processor targets. - Explode/implode needs a manifest with source spans and stable IDs so the directory form is not a lossy export. - Multiple inheritance is useful for document templates, regulatory overlays, style/persona overlays, and reusable content classes, but only if merge behavior is deterministic and diagnosable. ## Use Cases ### 1. Reversible Large-Document Editing An author explodes a long PRD/FRS into a directory, edits sections in separate files, then implodes it back into a canonical single Markdown document. The manifest preserves frontmatter policy, heading levels, ordering, source spans, and generated filenames. ### 2. Knuth-Style Markdown Weave/Tangle A document explains a program in the order best for human understanding. Named code chunks are declared in fenced blocks, cross-reference each other, and tangle into one or more source files. The woven output keeps prose, chunk cross-links, and optionally generated indexes. ### 3. Executable Documentation Pipelines Fenced blocks act as processors: shell, Python, SQL, validation, diagram, or custom processors can consume inputs, emit outputs, and record dependencies. Execution is optional and controlled; pure transforms remain deterministic. ### 4. Reusable Legal, Contract, and Product Clauses Common clauses are defined once with stable IDs. Documents include them by namespace/key and can select variants by jurisdiction, customer type, language, or document class. Diagnostics explain missing keys and conflicting variants. ### 5. Source Snippet Documentation Docs include code by tag, line range, parser object, or named block while preserving source line references. This supports API docs, changelog examples, and tutorials that stay aligned with source files. ### 6. Content Classes and Multiple Inheritance A document can be treated as an instance of several content classes: for example `base:prd`, `market:enterprise`, `jurisdiction:eu`, and `style:board-brief`. Slot values, assertions, sections, and snippets resolve in a deterministic order with explicit merge strategies. ### 7. Agent Context Packages An agent can request a namespace, topic, chunk, section, or graph slice and get a bounded context package with provenance, dependencies, hashes, and security labels. This dovetails with later cache and memory work. ### 8. Security-Sensitive Knowledge Gateways References and processor outputs carry labels. Policy can filter or redact content before transclusion, weaving, tangling, or context-package creation. ## Architecture Blueprint ### Content Unit Model The parser should expose addressable units beyond the current document, section, and block lists: - document - frontmatter path - section - block - fenced block - named region - named chunk - processor result Each unit should have: - stable local ID - optional global name - source path and source span - kind/type - content hash - dependency list - labels/policy metadata ### Reference Syntax Keep Markdown readable and allow several levels of precision: ```markdown ``` Frontmatter can define namespaces: ```yaml namespaces: std: ./standards/ src: ../src/ contract: ./contracts/ ``` References should resolve through a single resolver API: ```text namespace + address + selector + mode + context -> resolved content unit(s) ``` ### Region and Chunk Syntax Use comments for regions so they can live inside Markdown or source files: ```markdown Reusable content. ``` Use fenced blocks for executable or tangible chunks: ````markdown ```python {#load-config tangle="src/config.py"} def load_config(path): return {} ``` ```` Chunk references can stay close to noweb: ```text <> ``` The processor layer decides whether chunk references are expanded during tangle, displayed during weave, or left literal. ### Processor Registry Processors should be pluggable but explicit. A processor receives: - unit content and metadata - resolver - execution context - policy context - output target request It returns: - transformed content, generated files, or computed values - diagnostics - dependency edges - provenance events Core processors should start deterministic: include, region, explode/implode, tangle, weave, and simple text/Markdown transforms. Executing arbitrary code is a later, opt-in capability. ### Explode/Implode Explode/implode should become a first-class reversible operation, not a loose directory export. The manifest should include: - original path and hash - variant type (`flat`, `hierarchical`, `semantic`) - frontmatter preservation policy - section/chunk/source-span entries - file paths and order - heading-level policy - warnings and non-lossy roundtrip checks The old `markitect-main` flat/hierarchical/semantic variants are worth reimplementing behind a small variant interface. ### Weave/Tangle Tangle extracts named chunks to target files, expanding chunk references in a deterministic dependency order. Weave renders human-readable documentation with chunk backlinks and optional source indexes. Minimum useful MVP: - discover named fenced blocks - support `tangle=""` - concatenate multiple chunks for the same target in document order - expand `<>` inside code - detect missing/cyclic chunk references - emit source mapping comments optionally ### Content Class and Multiple Inheritance Document classes should be data, not Python inheritance. A class can define: - slots - required sections - snippets - assertions - processors - merge policies An instance declares: ```yaml document_class: extends: - contract:prd - market:enterprise - jurisdiction:eu ``` Resolution should use a C3-like linearization. Merge policies must be explicit: - `replace` - `append` - `prepend` - `deep_merge` - `before:` - `after:` - `error_on_conflict` Diagnostics should report inconsistent precedence, ambiguous slot definitions, and merge-policy violations. ## Comparison with Current Implementation What we have now is a good kernel: - Parser/frontmatter/sections/blocks - Contracts and deterministic diagnostics - Query/extraction over structured documents - Transform, compose, and include operations - Safe include path boundaries and cycle checks What is missing for the richer framework: - stable content IDs and namespaces - region/tag selectors - fenced-block-aware transforms - operation provenance and dependency graphs - structured include diagnostics instead of fail-fast exceptions only - reversible explode/implode with manifests - processor registry - named chunks and weave/tangle - class/object composition with deterministic multi-inheritance - line/source maps across generated outputs - security labels and policy hooks on resolved units The clean path is to keep current ops as the small deterministic surface and grow this richer system as a framework layer. That protects simple CLI use while opening a strong route to sophisticated knowledge/programming pipelines.