markitect-tool/docs/content-reference-literate-workflow-research.md

# Content References, Processors, and Literate Workflows

Date: 2026-05-04

## Purpose

This note records the follow-up research after the first transform, compose,
and include implementation. The goal is to keep `markitect-tool` close to
Markdown while preserving the richer ideas that made `markitect-main`
interesting: reversible explode/implode, transclusion, processors, namespaces,
content references, and Knuth-style weave/tangle workflows.

## Research Inputs

- [WEB on CTAN](https://ctan.org/pkg/web) and
  [Knuth/Levy CWEB](https://cs.stanford.edu/~knuth/cweb.html): literate source
  is processed in two directions, one for compilable source and one for
  readable documentation.
- [noweb Hacker's Guide](https://www.cs.tufts.edu/~nr/noweb/guide.html):
  language-independent literate programming benefits from a pipeline
  representation and named chunks that tools can extend.
- [Org Babel](https://orgmode.org/worg/org-contrib/babel/intro.html): source
  blocks can be executable, parameterized, named, reused, tangled to files, and
  woven into reproducible documents.
- [CommonMark fenced code blocks](https://spec.commonmark.org/0.31.2/#fenced-code-blocks):
  fenced blocks are first-class Markdown structure and must be handled by the
  parser, not by naive global text rewrites.
- [Asciidoctor include directives](https://docs.asciidoctor.org/asciidoc/latest/directives/include/)
  and [tagged regions](https://docs.asciidoctor.org/asciidoc/latest/directives/include-tagged-regions/):
  includes need predictable base-dir resolution, safe-mode boundaries, line and
  tag selection, and source-code-region reuse.
- [Sphinx literalinclude](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html):
  code inclusion commonly needs line ranges, object-level extraction,
  highlighting metadata, dedent, and original line-number handling.
- [DITA conref](https://docs.oasis-open.org/dita/dita/v1.3/os/part2-tech-content/archSpec/base/conref.html)
  and [conkeyref](https://docs.oasis-open.org/dita/dita/v1.3/os/part1-base/langRef/attributes/theconkeyrefattribute.html):
  content reuse becomes much stronger when references have IDs, keys, scoped
  indirection, validity checks, and clear attribute merge rules.
- [W3C XInclude](https://www.w3.org/TR/xinclude/): inclusion should have an
  explicit processing model, target addressing, and fallback behavior.
- [JSON-LD 1.1 contexts](https://www.w3.org/TR/json-ld/): namespaces can map
  short terms to stable global identifiers while retaining compact authoring.
- [Python C3 MRO](https://www.python.org/download/releases/2.3/mro/) and
  [CLOS concepts](https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node261.html):
  multiple inheritance needs deterministic linearization, monotonicity, local
  precedence, and explicit method/slot combination rules.
- [Pandoc filters](https://pandoc.org/filters.html): processors can be cleanly
  modeled as AST transformations over document nodes and code blocks.

## Lessons

Markdown can carry a surprisingly rich system if the extra semantics are placed
in stable, inspectable constructs:

- Frontmatter declares document-level identity, namespaces, and defaults.
- Headings and fenced blocks become addressable content units.
- Include/transclusion is a resolver over content references, not only file
  expansion.
- Processors operate on typed blocks and produce diagnostics, dependencies,
  generated content, or files.
- Weave/tangle is a special case of named content units plus processor targets.
- Explode/implode needs a manifest with source spans and stable IDs so the
  directory form is not a lossy export.
- Multiple inheritance is useful for document templates, regulatory overlays,
  style/persona overlays, and reusable content classes, but only if merge
  behavior is deterministic and diagnosable.

## Use Cases

### 1. Reversible Large-Document Editing

An author explodes a long PRD/FRS into a directory, edits sections in separate
files, then implodes it back into a canonical single Markdown document. The
manifest preserves frontmatter policy, heading levels, ordering, source spans,
and generated filenames.

### 2. Knuth-Style Markdown Weave/Tangle

A document explains a program in the order best for human understanding. Named
code chunks are declared in fenced blocks, cross-reference each other, and
tangle into one or more source files. The woven output keeps prose, chunk
cross-links, and optionally generated indexes.

### 3. Executable Documentation Pipelines

Fenced blocks act as processors: shell, Python, SQL, validation, diagram, or
custom processors can consume inputs, emit outputs, and record dependencies.
Execution is optional and controlled; pure transforms remain deterministic.

### 4. Reusable Legal, Contract, and Product Clauses

Common clauses are defined once with stable IDs. Documents include them by
namespace/key and can select variants by jurisdiction, customer type, language,
or document class. Diagnostics explain missing keys and conflicting variants.

### 5. Source Snippet Documentation

Docs include code by tag, line range, parser object, or named block while
preserving source line references. This supports API docs, changelog examples,
and tutorials that stay aligned with source files.

### 6. Content Classes and Multiple Inheritance

A document can be treated as an instance of several content classes: for
example `base:prd`, `market:enterprise`, `jurisdiction:eu`, and
`style:board-brief`. Slot values, assertions, sections, and snippets resolve in
a deterministic order with explicit merge strategies.

### 7. Agent Context Packages

An agent can request a namespace, topic, chunk, section, or graph slice and get
a bounded context package with provenance, dependencies, hashes, and security
labels. This dovetails with later cache and memory work.

### 8. Security-Sensitive Knowledge Gateways

References and processor outputs carry labels. Policy can filter or redact
content before transclusion, weaving, tangling, or context-package creation.

## Architecture Blueprint

### Content Unit Model

The parser should expose addressable units beyond the current document,
section, and block lists:

- document
- frontmatter path
- section
- block
- fenced block
- named region
- named chunk
- processor result

Each unit should have:

- stable local ID
- optional global name
- source path and source span
- kind/type
- content hash
- dependency list
- labels/policy metadata

### Reference Syntax

Keep Markdown readable and allow several levels of precision:

```markdown
<!-- mkt:include ref="std:clauses/payment" -->
<!-- mkt:include path="sections/intro.md" selector="sections[heading=Summary]" -->
<!-- mkt:include ref="src:api#tag:create-user" mode="literal" -->
```

Frontmatter can define namespaces:

```yaml
namespaces:
  std: ./standards/
  src: ../src/
  contract: ./contracts/
```

References should resolve through a single resolver API:

```text
namespace + address + selector + mode + context -> resolved content unit(s)
```

### Region and Chunk Syntax

Use comments for regions so they can live inside Markdown or source files:

```markdown
<!-- mkt:region id="overview" -->
Reusable content.
<!-- /mkt:region -->
```

Use fenced blocks for executable or tangible chunks:

````markdown
```python {#load-config tangle="src/config.py"}
def load_config(path):
    return {}
```
````

Chunk references can stay close to noweb:

```text
<<load-config>>
```

The processor layer decides whether chunk references are expanded during
tangle, displayed during weave, or left literal.

### Processor Registry

Processors should be pluggable but explicit. A processor receives:

- unit content and metadata
- resolver
- execution context
- policy context
- output target request

It returns:

- transformed content, generated files, or computed values
- diagnostics
- dependency edges
- provenance events

Core processors should start deterministic: include, region, explode/implode,
tangle, weave, and simple text/Markdown transforms. Executing arbitrary code is
a later, opt-in capability.

### Explode/Implode

Explode/implode should become a first-class reversible operation, not a loose
directory export. The manifest should include:

- original path and hash
- variant type (`flat`, `hierarchical`, `semantic`)
- frontmatter preservation policy
- section/chunk/source-span entries
- file paths and order
- heading-level policy
- warnings and non-lossy roundtrip checks

The old `markitect-main` flat/hierarchical/semantic variants are worth
reimplementing behind a small variant interface.

### Weave/Tangle

Tangle extracts named chunks to target files, expanding chunk references in a
deterministic dependency order. Weave renders human-readable documentation with
chunk backlinks and optional source indexes.

Minimum useful MVP:

- discover named fenced blocks
- support `tangle="<path>"`
- concatenate multiple chunks for the same target in document order
- expand `<<chunk-id>>` inside code
- detect missing/cyclic chunk references
- emit source mapping comments optionally

### Content Class and Multiple Inheritance

Document classes should be data, not Python inheritance. A class can define:

- slots
- required sections
- snippets
- assertions
- processors
- merge policies

An instance declares:

```yaml
document_class:
  extends:
    - contract:prd
    - market:enterprise
    - jurisdiction:eu
```

Resolution should use a C3-like linearization. Merge policies must be explicit:

- `replace`
- `append`
- `prepend`
- `deep_merge`
- `before:<slot>`
- `after:<slot>`
- `error_on_conflict`

Diagnostics should report inconsistent precedence, ambiguous slot definitions,
and merge-policy violations.

## Comparison with Current Implementation

What we have now is a good kernel:

- Parser/frontmatter/sections/blocks
- Contracts and deterministic diagnostics
- Query/extraction over structured documents
- Transform, compose, and include operations
- Safe include path boundaries and cycle checks

What is missing for the richer framework:

- stable content IDs and namespaces
- region/tag selectors
- fenced-block-aware transforms
- operation provenance and dependency graphs
- structured include diagnostics instead of fail-fast exceptions only
- reversible explode/implode with manifests
- processor registry
- named chunks and weave/tangle
- class/object composition with deterministic multi-inheritance
- line/source maps across generated outputs
- security labels and policy hooks on resolved units

The clean path is to keep current ops as the small deterministic surface and
grow this richer system as a framework layer. That protects simple CLI use while
opening a strong route to sophisticated knowledge/programming pipelines.