Files
markitect-tool/docs/content-reference-literate-workflow-research.md

312 lines
10 KiB
Markdown

# Content References, Processors, and Literate Workflows
Date: 2026-05-04
## Purpose
This note records the follow-up research after the first transform, compose,
and include implementation. The goal is to keep `markitect-tool` close to
Markdown while preserving the richer ideas that made `markitect-main`
interesting: reversible explode/implode, transclusion, processors, namespaces,
content references, and Knuth-style weave/tangle workflows.
## Research Inputs
- [WEB on CTAN](https://ctan.org/pkg/web) and
[Knuth/Levy CWEB](https://cs.stanford.edu/~knuth/cweb.html): literate source
is processed in two directions, one for compilable source and one for
readable documentation.
- [noweb Hacker's Guide](https://www.cs.tufts.edu/~nr/noweb/guide.html):
language-independent literate programming benefits from a pipeline
representation and named chunks that tools can extend.
- [Org Babel](https://orgmode.org/worg/org-contrib/babel/intro.html): source
blocks can be executable, parameterized, named, reused, tangled to files, and
woven into reproducible documents.
- [CommonMark fenced code blocks](https://spec.commonmark.org/0.31.2/#fenced-code-blocks):
fenced blocks are first-class Markdown structure and must be handled by the
parser, not by naive global text rewrites.
- [Asciidoctor include directives](https://docs.asciidoctor.org/asciidoc/latest/directives/include/)
and [tagged regions](https://docs.asciidoctor.org/asciidoc/latest/directives/include-tagged-regions/):
includes need predictable base-dir resolution, safe-mode boundaries, line and
tag selection, and source-code-region reuse.
- [Sphinx literalinclude](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html):
code inclusion commonly needs line ranges, object-level extraction,
highlighting metadata, dedent, and original line-number handling.
- [DITA conref](https://docs.oasis-open.org/dita/dita/v1.3/os/part2-tech-content/archSpec/base/conref.html)
and [conkeyref](https://docs.oasis-open.org/dita/dita/v1.3/os/part1-base/langRef/attributes/theconkeyrefattribute.html):
content reuse becomes much stronger when references have IDs, keys, scoped
indirection, validity checks, and clear attribute merge rules.
- [W3C XInclude](https://www.w3.org/TR/xinclude/): inclusion should have an
explicit processing model, target addressing, and fallback behavior.
- [JSON-LD 1.1 contexts](https://www.w3.org/TR/json-ld/): namespaces can map
short terms to stable global identifiers while retaining compact authoring.
- [Python C3 MRO](https://www.python.org/download/releases/2.3/mro/) and
[CLOS concepts](https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node261.html):
multiple inheritance needs deterministic linearization, monotonicity, local
precedence, and explicit method/slot combination rules.
- [Pandoc filters](https://pandoc.org/filters.html): processors can be cleanly
modeled as AST transformations over document nodes and code blocks.
## Lessons
Markdown can carry a surprisingly rich system if the extra semantics are placed
in stable, inspectable constructs:
- Frontmatter declares document-level identity, namespaces, and defaults.
- Headings and fenced blocks become addressable content units.
- Include/transclusion is a resolver over content references, not only file
expansion.
- Processors operate on typed blocks and produce diagnostics, dependencies,
generated content, or files.
- Weave/tangle is a special case of named content units plus processor targets.
- Explode/implode needs a manifest with source spans and stable IDs so the
directory form is not a lossy export.
- Multiple inheritance is useful for document templates, regulatory overlays,
style/persona overlays, and reusable content classes, but only if merge
behavior is deterministic and diagnosable.
## Use Cases
### 1. Reversible Large-Document Editing
An author explodes a long PRD/FRS into a directory, edits sections in separate
files, then implodes it back into a canonical single Markdown document. The
manifest preserves frontmatter policy, heading levels, ordering, source spans,
and generated filenames.
### 2. Knuth-Style Markdown Weave/Tangle
A document explains a program in the order best for human understanding. Named
code chunks are declared in fenced blocks, cross-reference each other, and
tangle into one or more source files. The woven output keeps prose, chunk
cross-links, and optionally generated indexes.
### 3. Executable Documentation Pipelines
Fenced blocks act as processors: shell, Python, SQL, validation, diagram, or
custom processors can consume inputs, emit outputs, and record dependencies.
Execution is optional and controlled; pure transforms remain deterministic.
### 4. Reusable Legal, Contract, and Product Clauses
Common clauses are defined once with stable IDs. Documents include them by
namespace/key and can select variants by jurisdiction, customer type, language,
or document class. Diagnostics explain missing keys and conflicting variants.
### 5. Source Snippet Documentation
Docs include code by tag, line range, parser object, or named block while
preserving source line references. This supports API docs, changelog examples,
and tutorials that stay aligned with source files.
### 6. Content Classes and Multiple Inheritance
A document can be treated as an instance of several content classes: for
example `base:prd`, `market:enterprise`, `jurisdiction:eu`, and
`style:board-brief`. Slot values, assertions, sections, and snippets resolve in
a deterministic order with explicit merge strategies.
### 7. Agent Context Packages
An agent can request a namespace, topic, chunk, section, or graph slice and get
a bounded context package with provenance, dependencies, hashes, and security
labels. This dovetails with later cache and memory work.
### 8. Security-Sensitive Knowledge Gateways
References and processor outputs carry labels. Policy can filter or redact
content before transclusion, weaving, tangling, or context-package creation.
## Architecture Blueprint
### Content Unit Model
The parser should expose addressable units beyond the current document,
section, and block lists:
- document
- frontmatter path
- section
- block
- fenced block
- named region
- named chunk
- processor result
Each unit should have:
- stable local ID
- optional global name
- source path and source span
- kind/type
- content hash
- dependency list
- labels/policy metadata
### Reference Syntax
Keep Markdown readable and allow several levels of precision:
```markdown
<!-- mkt:include ref="std:clauses/payment" -->
<!-- mkt:include path="sections/intro.md" selector="sections[heading=Summary]" -->
<!-- mkt:include ref="src:api#tag:create-user" mode="literal" -->
```
Frontmatter can define namespaces:
```yaml
namespaces:
std: ./standards/
src: ../src/
contract: ./contracts/
```
References should resolve through a single resolver API:
```text
namespace + address + selector + mode + context -> resolved content unit(s)
```
### Region and Chunk Syntax
Use comments for regions so they can live inside Markdown or source files:
```markdown
<!-- mkt:region id="overview" -->
Reusable content.
<!-- /mkt:region -->
```
Use fenced blocks for executable or tangible chunks:
````markdown
```python {#load-config tangle="src/config.py"}
def load_config(path):
return {}
```
````
Chunk references can stay close to noweb:
```text
<<load-config>>
```
The processor layer decides whether chunk references are expanded during
tangle, displayed during weave, or left literal.
### Processor Registry
Processors should be pluggable but explicit. A processor receives:
- unit content and metadata
- resolver
- execution context
- policy context
- output target request
It returns:
- transformed content, generated files, or computed values
- diagnostics
- dependency edges
- provenance events
Core processors should start deterministic: include, region, explode/implode,
tangle, weave, and simple text/Markdown transforms. Executing arbitrary code is
a later, opt-in capability.
### Explode/Implode
Explode/implode should become a first-class reversible operation, not a loose
directory export. The manifest should include:
- original path and hash
- variant type (`flat`, `hierarchical`, `semantic`)
- frontmatter preservation policy
- section/chunk/source-span entries
- file paths and order
- heading-level policy
- warnings and non-lossy roundtrip checks
The old `markitect-main` flat/hierarchical/semantic variants are worth
reimplementing behind a small variant interface.
### Weave/Tangle
Tangle extracts named chunks to target files, expanding chunk references in a
deterministic dependency order. Weave renders human-readable documentation with
chunk backlinks and optional source indexes.
Minimum useful MVP:
- discover named fenced blocks
- support `tangle="<path>"`
- concatenate multiple chunks for the same target in document order
- expand `<<chunk-id>>` inside code
- detect missing/cyclic chunk references
- emit source mapping comments optionally
### Content Class and Multiple Inheritance
Document classes should be data, not Python inheritance. A class can define:
- slots
- required sections
- snippets
- assertions
- processors
- merge policies
An instance declares:
```yaml
document_class:
extends:
- contract:prd
- market:enterprise
- jurisdiction:eu
```
Resolution should use a C3-like linearization. Merge policies must be explicit:
- `replace`
- `append`
- `prepend`
- `deep_merge`
- `before:<slot>`
- `after:<slot>`
- `error_on_conflict`
Diagnostics should report inconsistent precedence, ambiguous slot definitions,
and merge-policy violations.
## Comparison with Current Implementation
What we have now is a good kernel:
- Parser/frontmatter/sections/blocks
- Contracts and deterministic diagnostics
- Query/extraction over structured documents
- Transform, compose, and include operations
- Safe include path boundaries and cycle checks
What is missing for the richer framework:
- stable content IDs and namespaces
- region/tag selectors
- fenced-block-aware transforms
- operation provenance and dependency graphs
- structured include diagnostics instead of fail-fast exceptions only
- reversible explode/implode with manifests
- processor registry
- named chunks and weave/tangle
- class/object composition with deterministic multi-inheritance
- line/source maps across generated outputs
- security labels and policy hooks on resolved units
The clean path is to keep current ops as the small deterministic surface and
grow this richer system as a framework layer. That protects simple CLI use while
opening a strong route to sophisticated knowledge/programming pipelines.