generated from coulomb/repo-seed
312 lines
10 KiB
Markdown
312 lines
10 KiB
Markdown
# Content References, Processors, and Literate Workflows
|
|
|
|
Date: 2026-05-04
|
|
|
|
## Purpose
|
|
|
|
This note records the follow-up research after the first transform, compose,
|
|
and include implementation. The goal is to keep `markitect-tool` close to
|
|
Markdown while preserving the richer ideas that made `markitect-main`
|
|
interesting: reversible explode/implode, transclusion, processors, namespaces,
|
|
content references, and Knuth-style weave/tangle workflows.
|
|
|
|
## Research Inputs
|
|
|
|
- [WEB on CTAN](https://ctan.org/pkg/web) and
|
|
[Knuth/Levy CWEB](https://cs.stanford.edu/~knuth/cweb.html): literate source
|
|
is processed in two directions, one for compilable source and one for
|
|
readable documentation.
|
|
- [noweb Hacker's Guide](https://www.cs.tufts.edu/~nr/noweb/guide.html):
|
|
language-independent literate programming benefits from a pipeline
|
|
representation and named chunks that tools can extend.
|
|
- [Org Babel](https://orgmode.org/worg/org-contrib/babel/intro.html): source
|
|
blocks can be executable, parameterized, named, reused, tangled to files, and
|
|
woven into reproducible documents.
|
|
- [CommonMark fenced code blocks](https://spec.commonmark.org/0.31.2/#fenced-code-blocks):
|
|
fenced blocks are first-class Markdown structure and must be handled by the
|
|
parser, not by naive global text rewrites.
|
|
- [Asciidoctor include directives](https://docs.asciidoctor.org/asciidoc/latest/directives/include/)
|
|
and [tagged regions](https://docs.asciidoctor.org/asciidoc/latest/directives/include-tagged-regions/):
|
|
includes need predictable base-dir resolution, safe-mode boundaries, line and
|
|
tag selection, and source-code-region reuse.
|
|
- [Sphinx literalinclude](https://www.sphinx-doc.org/en/master/usage/restructuredtext/directives.html):
|
|
code inclusion commonly needs line ranges, object-level extraction,
|
|
highlighting metadata, dedent, and original line-number handling.
|
|
- [DITA conref](https://docs.oasis-open.org/dita/dita/v1.3/os/part2-tech-content/archSpec/base/conref.html)
|
|
and [conkeyref](https://docs.oasis-open.org/dita/dita/v1.3/os/part1-base/langRef/attributes/theconkeyrefattribute.html):
|
|
content reuse becomes much stronger when references have IDs, keys, scoped
|
|
indirection, validity checks, and clear attribute merge rules.
|
|
- [W3C XInclude](https://www.w3.org/TR/xinclude/): inclusion should have an
|
|
explicit processing model, target addressing, and fallback behavior.
|
|
- [JSON-LD 1.1 contexts](https://www.w3.org/TR/json-ld/): namespaces can map
|
|
short terms to stable global identifiers while retaining compact authoring.
|
|
- [Python C3 MRO](https://www.python.org/download/releases/2.3/mro/) and
|
|
[CLOS concepts](https://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node261.html):
|
|
multiple inheritance needs deterministic linearization, monotonicity, local
|
|
precedence, and explicit method/slot combination rules.
|
|
- [Pandoc filters](https://pandoc.org/filters.html): processors can be cleanly
|
|
modeled as AST transformations over document nodes and code blocks.
|
|
|
|
## Lessons
|
|
|
|
Markdown can carry a surprisingly rich system if the extra semantics are placed
|
|
in stable, inspectable constructs:
|
|
|
|
- Frontmatter declares document-level identity, namespaces, and defaults.
|
|
- Headings and fenced blocks become addressable content units.
|
|
- Include/transclusion is a resolver over content references, not only file
|
|
expansion.
|
|
- Processors operate on typed blocks and produce diagnostics, dependencies,
|
|
generated content, or files.
|
|
- Weave/tangle is a special case of named content units plus processor targets.
|
|
- Explode/implode needs a manifest with source spans and stable IDs so the
|
|
directory form is not a lossy export.
|
|
- Multiple inheritance is useful for document templates, regulatory overlays,
|
|
style/persona overlays, and reusable content classes, but only if merge
|
|
behavior is deterministic and diagnosable.
|
|
|
|
## Use Cases
|
|
|
|
### 1. Reversible Large-Document Editing
|
|
|
|
An author explodes a long PRD/FRS into a directory, edits sections in separate
|
|
files, then implodes it back into a canonical single Markdown document. The
|
|
manifest preserves frontmatter policy, heading levels, ordering, source spans,
|
|
and generated filenames.
|
|
|
|
### 2. Knuth-Style Markdown Weave/Tangle
|
|
|
|
A document explains a program in the order best for human understanding. Named
|
|
code chunks are declared in fenced blocks, cross-reference each other, and
|
|
tangle into one or more source files. The woven output keeps prose, chunk
|
|
cross-links, and optionally generated indexes.
|
|
|
|
### 3. Executable Documentation Pipelines
|
|
|
|
Fenced blocks act as processors: shell, Python, SQL, validation, diagram, or
|
|
custom processors can consume inputs, emit outputs, and record dependencies.
|
|
Execution is optional and controlled; pure transforms remain deterministic.
|
|
|
|
### 4. Reusable Legal, Contract, and Product Clauses
|
|
|
|
Common clauses are defined once with stable IDs. Documents include them by
|
|
namespace/key and can select variants by jurisdiction, customer type, language,
|
|
or document class. Diagnostics explain missing keys and conflicting variants.
|
|
|
|
### 5. Source Snippet Documentation
|
|
|
|
Docs include code by tag, line range, parser object, or named block while
|
|
preserving source line references. This supports API docs, changelog examples,
|
|
and tutorials that stay aligned with source files.
|
|
|
|
### 6. Content Classes and Multiple Inheritance
|
|
|
|
A document can be treated as an instance of several content classes: for
|
|
example `base:prd`, `market:enterprise`, `jurisdiction:eu`, and
|
|
`style:board-brief`. Slot values, assertions, sections, and snippets resolve in
|
|
a deterministic order with explicit merge strategies.
|
|
|
|
### 7. Agent Context Packages
|
|
|
|
An agent can request a namespace, topic, chunk, section, or graph slice and get
|
|
a bounded context package with provenance, dependencies, hashes, and security
|
|
labels. This dovetails with later cache and memory work.
|
|
|
|
### 8. Security-Sensitive Knowledge Gateways
|
|
|
|
References and processor outputs carry labels. Policy can filter or redact
|
|
content before transclusion, weaving, tangling, or context-package creation.
|
|
|
|
## Architecture Blueprint
|
|
|
|
### Content Unit Model
|
|
|
|
The parser should expose addressable units beyond the current document,
|
|
section, and block lists:
|
|
|
|
- document
|
|
- frontmatter path
|
|
- section
|
|
- block
|
|
- fenced block
|
|
- named region
|
|
- named chunk
|
|
- processor result
|
|
|
|
Each unit should have:
|
|
|
|
- stable local ID
|
|
- optional global name
|
|
- source path and source span
|
|
- kind/type
|
|
- content hash
|
|
- dependency list
|
|
- labels/policy metadata
|
|
|
|
### Reference Syntax
|
|
|
|
Keep Markdown readable and allow several levels of precision:
|
|
|
|
```markdown
|
|
<!-- mkt:include ref="std:clauses/payment" -->
|
|
<!-- mkt:include path="sections/intro.md" selector="sections[heading=Summary]" -->
|
|
<!-- mkt:include ref="src:api#tag:create-user" mode="literal" -->
|
|
```
|
|
|
|
Frontmatter can define namespaces:
|
|
|
|
```yaml
|
|
namespaces:
|
|
std: ./standards/
|
|
src: ../src/
|
|
contract: ./contracts/
|
|
```
|
|
|
|
References should resolve through a single resolver API:
|
|
|
|
```text
|
|
namespace + address + selector + mode + context -> resolved content unit(s)
|
|
```
|
|
|
|
### Region and Chunk Syntax
|
|
|
|
Use comments for regions so they can live inside Markdown or source files:
|
|
|
|
```markdown
|
|
<!-- mkt:region id="overview" -->
|
|
Reusable content.
|
|
<!-- /mkt:region -->
|
|
```
|
|
|
|
Use fenced blocks for executable or tangible chunks:
|
|
|
|
````markdown
|
|
```python {#load-config tangle="src/config.py"}
|
|
def load_config(path):
|
|
return {}
|
|
```
|
|
````
|
|
|
|
Chunk references can stay close to noweb:
|
|
|
|
```text
|
|
<<load-config>>
|
|
```
|
|
|
|
The processor layer decides whether chunk references are expanded during
|
|
tangle, displayed during weave, or left literal.
|
|
|
|
### Processor Registry
|
|
|
|
Processors should be pluggable but explicit. A processor receives:
|
|
|
|
- unit content and metadata
|
|
- resolver
|
|
- execution context
|
|
- policy context
|
|
- output target request
|
|
|
|
It returns:
|
|
|
|
- transformed content, generated files, or computed values
|
|
- diagnostics
|
|
- dependency edges
|
|
- provenance events
|
|
|
|
Core processors should start deterministic: include, region, explode/implode,
|
|
tangle, weave, and simple text/Markdown transforms. Executing arbitrary code is
|
|
a later, opt-in capability.
|
|
|
|
### Explode/Implode
|
|
|
|
Explode/implode should become a first-class reversible operation, not a loose
|
|
directory export. The manifest should include:
|
|
|
|
- original path and hash
|
|
- variant type (`flat`, `hierarchical`, `semantic`)
|
|
- frontmatter preservation policy
|
|
- section/chunk/source-span entries
|
|
- file paths and order
|
|
- heading-level policy
|
|
- warnings and non-lossy roundtrip checks
|
|
|
|
The old `markitect-main` flat/hierarchical/semantic variants are worth
|
|
reimplementing behind a small variant interface.
|
|
|
|
### Weave/Tangle
|
|
|
|
Tangle extracts named chunks to target files, expanding chunk references in a
|
|
deterministic dependency order. Weave renders human-readable documentation with
|
|
chunk backlinks and optional source indexes.
|
|
|
|
Minimum useful MVP:
|
|
|
|
- discover named fenced blocks
|
|
- support `tangle="<path>"`
|
|
- concatenate multiple chunks for the same target in document order
|
|
- expand `<<chunk-id>>` inside code
|
|
- detect missing/cyclic chunk references
|
|
- emit source mapping comments optionally
|
|
|
|
### Content Class and Multiple Inheritance
|
|
|
|
Document classes should be data, not Python inheritance. A class can define:
|
|
|
|
- slots
|
|
- required sections
|
|
- snippets
|
|
- assertions
|
|
- processors
|
|
- merge policies
|
|
|
|
An instance declares:
|
|
|
|
```yaml
|
|
document_class:
|
|
extends:
|
|
- contract:prd
|
|
- market:enterprise
|
|
- jurisdiction:eu
|
|
```
|
|
|
|
Resolution should use a C3-like linearization. Merge policies must be explicit:
|
|
|
|
- `replace`
|
|
- `append`
|
|
- `prepend`
|
|
- `deep_merge`
|
|
- `before:<slot>`
|
|
- `after:<slot>`
|
|
- `error_on_conflict`
|
|
|
|
Diagnostics should report inconsistent precedence, ambiguous slot definitions,
|
|
and merge-policy violations.
|
|
|
|
## Comparison with Current Implementation
|
|
|
|
What we have now is a good kernel:
|
|
|
|
- Parser/frontmatter/sections/blocks
|
|
- Contracts and deterministic diagnostics
|
|
- Query/extraction over structured documents
|
|
- Transform, compose, and include operations
|
|
- Safe include path boundaries and cycle checks
|
|
|
|
What is missing for the richer framework:
|
|
|
|
- stable content IDs and namespaces
|
|
- region/tag selectors
|
|
- fenced-block-aware transforms
|
|
- operation provenance and dependency graphs
|
|
- structured include diagnostics instead of fail-fast exceptions only
|
|
- reversible explode/implode with manifests
|
|
- processor registry
|
|
- named chunks and weave/tangle
|
|
- class/object composition with deterministic multi-inheritance
|
|
- line/source maps across generated outputs
|
|
- security labels and policy hooks on resolved units
|
|
|
|
The clean path is to keep current ops as the small deterministic surface and
|
|
grow this richer system as a framework layer. That protects simple CLI use while
|
|
opening a strong route to sophisticated knowledge/programming pipelines.
|