diff --git a/docs/content-classes.md b/docs/content-classes.md new file mode 100644 index 0000000..0ab4267 --- /dev/null +++ b/docs/content-classes.md @@ -0,0 +1,79 @@ +# Content Classes + +Date: 2026-05-04 + +## Purpose + +Content classes are data-defined composition rules for reusable document +structures, overlays, and variants. They are not Python inheritance. They are a +deterministic way to combine slots such as sections, assertions, snippets, +processors, and style guidance. + +This is the P10.7 resolver spike for future class/object-style workflows. + +## Model + +A class can declare: + +- `extends`: parent classes +- `slots`: structured values to contribute +- `merge_policies`: per-slot merge behavior + +Example: + +```yaml +classes: + base-prd: + slots: + sections: + - Problem + - Decision + enterprise: + extends: + - base-prd + slots: + sections: + - Compliance + merge_policies: + sections: append +``` + +## Linearization + +Multiple inheritance uses a C3-style linearization. That gives us: + +- deterministic parent ordering +- monotonic inheritance behavior +- explicit diagnostics for cycles, unknown parents, and inconsistent precedence + +The resolved class is merged from base to leaf according to the computed +linearization. + +## Merge Policies + +Initial policies: + +- `replace` +- `append` +- `prepend` +- `deep_merge` +- `error_on_conflict` + +Unknown policies and invalid value shapes produce diagnostics. + +## CLI + +Resolve a class: + +```bash +mkt class resolve examples/classes/prd-classes.yaml enterprise-prd +``` + +JSON/YAML output includes the linearization, merged slots, and diagnostics. + +## Extension Boundary + +The current resolver does not yet instantiate Markdown documents or inject +snippets. It establishes the deterministic inheritance and merge floor. Later +work can connect resolved slots to contracts, references, processors, and +generation plans. diff --git a/docs/content-references.md b/docs/content-references.md new file mode 100644 index 0000000..95b6b5c --- /dev/null +++ b/docs/content-references.md @@ -0,0 +1,139 @@ +# Content References + +Date: 2026-05-04 + +## Purpose + +Content references are the first WP-0010 extension layer. They give Markitect a +shared way to name and resolve Markdown content units without changing the +existing parser, query, transform, compose, include, contract, or cache APIs. + +The goal is a small resolver that later features can reuse: + +- includes can accept references as well as paths +- explode/implode can write manifests with stable unit IDs +- processors can receive typed units and dependency edges +- tangle/weave can address chunks and generated outputs +- cache and access-control backends can index the same IDs + +## Reference Syntax + +References are compact strings: + +```text +path/to/file.md +path/to/file.md#section:introduction +path/to/file.md::sections[heading=Decision] +std:clauses/payment.md +std:clauses/payment.md#payment-terms +std:clauses/payment.md#region:boilerplate +std:clauses/payment.md#tag:legal +#local-section +``` + +The parts are: + +- `namespace:`: optional namespace declared in frontmatter +- `path`: a Markdown file path relative to the current document, or relative to + the namespace target +- `#fragment`: optional unit lookup inside the target document +- `::selector`: optional existing Markitect query selector + +Fragments and selectors are mutually exclusive during resolution. Selectors are +delegated to the existing query engine, which keeps this layer small and avoids +inventing a second query language. + +## Namespaces + +Namespaces live in Markdown frontmatter: + +```yaml +--- +namespaces: + std: ./standard + product: ../product-docs +--- +``` + +Namespace keys may be written with or without a trailing colon. Namespace values +are string paths. Relative namespace paths resolve under the resolver root. All +resolved file paths must stay inside that root. + +## Content Units + +The resolver currently emits these unit kinds: + +- `document`: full Markdown file +- `section`: heading-led Markdown section +- `heading`: heading line +- existing query kinds such as `frontmatter`, `block`, `metrics`, or `section` + +Each unit includes: + +- `unit_id`: stable local ID +- `kind` +- `source_path` +- source line span when available +- `name` +- `content_hash` +- raw text +- metadata from the source or query match + +Heading and section IDs use an explicit trailing heading ID when present: + +```markdown +## Payment Terms {#payment-terms} +``` + +Otherwise the resolver derives a slug from the heading text and adds numeric +suffixes for collisions. + +Named regions use HTML comments so they can live in Markdown and many source +files without changing the rendered document: + +```markdown + +Reusable text. + +``` + +Fenced blocks can be addressed when their info string includes an ID: + +````markdown +```python {#load-config tags="code setup" tangle="src/config.py"} +def load_config(): + return {} +``` +```` + +Supported fragments now include: + +- `#section:` +- `#heading:` +- `#region:` +- `#fence:` +- `#tag:` +- `#line:` or `#line:-` +- `#` as a convenience lookup across sections, regions, fenced blocks, and + headings + +## CLI + +Resolve a reference from a context document: + +```bash +mkt ref resolve examples/references/context.md 'std:clauses.md#payment-terms' +``` + +JSON and YAML formats include the resolved text and metadata: + +```bash +mkt ref resolve examples/references/context.md 'std:clauses.md::sections[heading=Warranty]' --format json +``` + +## Extension Boundary + +This layer is intentionally read-only. It does not replace `mkt include`, +`mkt query`, or `mkt extract`. Instead it defines the address model those tools +can adopt when their next WP-0010 tasks require richer content identity, +processor dependencies, source maps, and reversible manifests. diff --git a/docs/explode-implode.md b/docs/explode-implode.md new file mode 100644 index 0000000..1f3107f --- /dev/null +++ b/docs/explode-implode.md @@ -0,0 +1,69 @@ +# Explode and Implode + +Date: 2026-05-04 + +## Purpose + +`mkt explode` and `mkt implode` reintroduce the useful old Markitect +large-document workflow as a slim WP-0010 extension. The design is +manifest-first: the exploded directory is editable, but the manifest preserves +ordering, source spans, heading metadata, hashes, frontmatter, and the selected +layout variant. + +This keeps the operation reversible without requiring a database or service. + +## Variants + +The initial variants are: + +- `flat`: writes ordered section files under `sections/`. +- `hierarchical`: writes child section files below parent heading directories. + +Both variants preserve the same manifest model. A later semantic variant can +reuse the reference and processor framework once those layers are stable. + +## CLI + +Explode a document: + +```bash +mkt explode docs/source.md --output-dir work/source-exploded +``` + +Use a hierarchical directory shape: + +```bash +mkt explode docs/source.md --output-dir work/source-tree --variant hierarchical +``` + +Implode the directory back into one Markdown file: + +```bash +mkt implode work/source-exploded --output docs/source-rebuilt.md +``` + +By default `mkt explode` refuses to write into a non-empty output directory. Use +`--force` when an explicit overwrite is intended. + +## Manifest + +The manifest is written as `markitect-explode.yaml` in the output directory. +It records: + +- manifest version +- original source path and SHA-256 hash +- variant +- raw frontmatter block +- ordered entries with file path, kind, unit ID, source line span, heading + metadata, and content hash + +Implode reads the manifest entries in order and concatenates the current entry +files. If users edit section files, the rebuilt document reflects those edits +while preserving the original frontmatter and ordering. + +## Extension Boundary + +This implementation is intentionally not semantic yet. It does not infer +contracts, classes, named chunks, or processor outputs. Instead it establishes a +small reversible substrate that later WP-0010 tasks can enrich with regions, +references, processors, source maps, and weave/tangle behavior. diff --git a/docs/literate-weave-tangle.md b/docs/literate-weave-tangle.md new file mode 100644 index 0000000..5e5ff18 --- /dev/null +++ b/docs/literate-weave-tangle.md @@ -0,0 +1,79 @@ +# Literate Weave and Tangle + +Date: 2026-05-04 + +## Purpose + +The literate workflow layer brings a small Knuth-style weave/tangle capability +to Markdown without requiring a separate language. Prose stays in Markdown. +Named code chunks live in fenced blocks. Tangling emits source files. +Weaving keeps the document readable and adds a deterministic chunk index. + +## Chunk Syntax + +Named chunks use fenced block attributes: + +````markdown +```python {#helpers} +def helper(): + return "ready" +``` +```` + +A chunk becomes an output root when it declares `tangle`: + +````markdown +```python {#main tangle="src/app.py"} +<> + +def main(): + return helper() +``` +```` + +Chunk references use noweb-style syntax: + +```text +<> +``` + +Whole-line chunk references preserve indentation when expanded. + +## CLI + +Tangle files: + +```bash +mkt tangle examples/literate/app.md --output-dir build/literate +``` + +Inspect without writing: + +```bash +mkt tangle examples/literate/app.md --format json +``` + +Weave documentation: + +```bash +mkt weave examples/literate/app.md --output build/app-woven.md +``` + +## Diagnostics + +Tangling reports structured diagnostics for missing chunks and cyclic chunk +references. Tangled files are only written by the CLI when the result is valid. + +## Extension Boundary + +The MVP deliberately keeps the model narrow: + +- named fenced blocks +- `tangle=""` +- deterministic document-order concatenation for repeated targets +- noweb-style chunk expansion +- generated chunk index during weave + +Future extensions can add richer source maps, processor execution, +language-specific extraction, and class/namespace-aware chunk selection without +changing this initial chunk model. diff --git a/docs/markitect-main-wp0010-migration-notes.md b/docs/markitect-main-wp0010-migration-notes.md new file mode 100644 index 0000000..b7da551 --- /dev/null +++ b/docs/markitect-main-wp0010-migration-notes.md @@ -0,0 +1,46 @@ +# markitect-main WP-0010 Migration Notes + +Date: 2026-05-04 + +## Purpose + +This note captures the relevant `markitect-main` ideas that WP-0010 now +preserves in successor form. + +The migration is conceptual rather than source-compatible. The successor keeps +Markdown-native behavior and removes old platform, database, infospace, and +service assumptions. + +## Parity Map + +| Legacy area | Successor shape | Status | +| --- | --- | --- | +| Explode/implode variants | `mkt explode`, `mkt implode`, manifest-first flat/hierarchical variants | Reimplemented | +| Transclusion/includes | `mkt include` for path markers; processor `mkt-include` for reference-backed content | Reimplemented with clearer boundaries | +| Spaces/infospace references | Frontmatter namespaces plus `mkt ref resolve` | Reframed as syntax-layer references | +| Fenced-block processors | Explicit deterministic processor registry | Reimplemented as opt-in extension | +| Literate workflows | `mkt tangle`, `mkt weave`, named fenced chunks, noweb references | Reimplemented as MVP | +| Content classes/overlays | Data-defined classes with C3-style linearization and merge policies | Resolver spike implemented | + +## Intentionally Not Migrated + +These old concerns stay out of the WP-0010 toolkit layer: + +- database-backed infospace lifecycle +- GraphQL/service APIs +- provider-specific LLM execution +- rendering/plugin/browser/editor infrastructure +- project finance, wishlist, and profile tooling + +## Migration Examples + +Examples live under `examples/migration/`: + +- `legacy-explode-source.md`: large document roundtrip via explode/implode. +- `legacy-transclusion-context.md`: namespace-backed reference include. +- `legacy-path-include.md`: simple path-based include marker. +- `legacy-literate.md`: named chunks tangled into source. + +The tests in `tests/test_wp0010_migration_examples.py` exercise these files as +successor fixtures. They are deliberately small, but they lock down the +behaviors we most wanted to keep from `markitect-main`. diff --git a/docs/processors.md b/docs/processors.md new file mode 100644 index 0000000..eceac5f --- /dev/null +++ b/docs/processors.md @@ -0,0 +1,81 @@ +# Fenced-Block Processors + +Date: 2026-05-04 + +## Purpose + +The processor registry is the deterministic execution boundary for WP-0010. +It lets Markdown fenced blocks opt into named processors while keeping +execution explicit, inspectable, and non-magical. + +Processors receive: + +- the fenced content unit +- resolver-capable context +- variables and policy maps + +Processors return: + +- generated content +- optional generated files +- diagnostics +- dependencies +- operation provenance + +No built-in processor runs arbitrary code. + +## Syntax + +A fenced block opts into processing by using an `mkt-` language: + +````markdown +```mkt-uppercase {#shout} +hello +``` +```` + +The processor can also be named with attributes: + +````markdown +```markdown {#example processor="identity"} +Rendered as-is by the identity processor. +``` +```` + +## Built-In Processors + +Initial deterministic processors: + +- `identity`: returns the fenced block content unchanged. +- `uppercase`: returns uppercased content; mainly a registry smoke-test. +- `include`: resolves a `ref` attribute through the content reference resolver. + +Reference-backed include: + +````markdown +```mkt-include {#payment ref="std:clauses.md#payment-terms"} +``` +```` + +The include processor returns the resolved content, records the target file as +a dependency, and emits operation provenance. + +## CLI + +Run processors in a document: + +```bash +mkt process examples/references/context.md --format json +``` + +Text output reports processor validity, block IDs, and the first generated +content line. JSON/YAML output includes diagnostics, dependencies, and +provenance. + +## Extension Boundary + +The registry is deliberately small. It does not render a final document yet and +does not execute shell, Python, SQL, or LLM calls. Those can become opt-in +processors later, but they should use the same result envelope so diagnostics, +dependencies, provenance, cache invalidation, and access-control hooks stay +consistent. diff --git a/docs/transform-compose-include.md b/docs/transform-compose-include.md index e60d5e4..10fee12 100644 --- a/docs/transform-compose-include.md +++ b/docs/transform-compose-include.md @@ -27,6 +27,10 @@ Supported operations: The API equivalent is `transform_markdown(...)`. +Heading shifts are token-safe: Markdown fenced and indented code blocks are +left untouched even if their lines look like headings. `TransformResult` +includes structured provenance events alongside the older operation-name list. + ## Compose Use `mkt compose` to concatenate Markdown inputs with predictable separators: @@ -79,5 +83,12 @@ Resolution rules: directory. - Recursive includes are resolved up to `--max-depth`. - Cycles and missing files fail with explicit errors. +- Include markers inside fenced or indented code blocks are left literal. The API equivalent is `resolve_includes(...)`. + +`IncludeResult` includes structured provenance events. Each include event +records the source marker line when available, the resolved target path, +dependency edge, selector, heading shift, and frontmatter policy. This is the +first provenance envelope used by later WP-0010 processor, source-map, and +explode/implode work. diff --git a/docs/workplan-planning-map.md b/docs/workplan-planning-map.md index ee5e57d..3491738 100644 --- a/docs/workplan-planning-map.md +++ b/docs/workplan-planning-map.md @@ -32,7 +32,7 @@ and descriptions mirror the operational view. | `MKTT-WP-0004` | complete | done | `MKTT-WP-0001`, `MKTT-WP-0002` | Contract framework is complete and informs later validation/generation work. | | `MKTT-WP-0003` | complete | done | `MKTT-WP-0001`, `MKTT-WP-0002`, `MKTT-WP-0004` | Core toolkit implementation is complete. | | `MKTT-WP-0006` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T005` | Ready after transform/composition shape is clear; should account for future reference/provenance needs. | -| `MKTT-WP-0010` | P1 | todo | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T006` | Trigger is satisfied; keep as the richer content-reference, processor, explode/implode, and weave/tangle track. | +| `MKTT-WP-0010` | complete | done | `MKTT-WP-0004`; task-level trigger: `MKTT-WP-0003-T006` | Content references, processors, explode/implode, weave/tangle, content classes, and migration examples are complete as the first WP-0010 extension layer. | | `MKTT-WP-0007` | P2 | todo | `MKTT-WP-0006` | First practical cache backend use case: AST/JSONPath/SQLite/FTS. | | `MKTT-WP-0005` | P2 | todo | `MKTT-WP-0003`, `MKTT-WP-0004` | Pick up when generation/form/context or semantic assessment pressure appears. | | `MKTT-WP-0011` | P2 | todo | `MKTT-WP-0003`; task-level triggers: `MKTT-WP-0010-T001`, `MKTT-WP-0010-T005` | Declarative Markdown dataflow workflows: source extraction, deterministic/assisted processing, and multi-output generation. | diff --git a/examples/classes/prd-classes.yaml b/examples/classes/prd-classes.yaml new file mode 100644 index 0000000..1ec9b30 --- /dev/null +++ b/examples/classes/prd-classes.yaml @@ -0,0 +1,30 @@ +classes: + base-prd: + slots: + sections: + - Problem + - Decision + assertions: + tone: plain + audience: product + + enterprise: + extends: + - base-prd + slots: + sections: + - Compliance + assertions: + audience: enterprise buyers + merge_policies: + sections: append + assertions: deep_merge + + enterprise-prd: + extends: + - enterprise + slots: + sections: + - Rollout + merge_policies: + sections: append diff --git a/examples/literate/app.md b/examples/literate/app.md new file mode 100644 index 0000000..94df958 --- /dev/null +++ b/examples/literate/app.md @@ -0,0 +1,15 @@ +# Literate App Example + +This example explains the helper before showing the application entry point. + +```python {#helpers} +def helper(): + return "ready" +``` + +```python {#main tangle="src/app.py"} +<> + +def main(): + return helper() +``` diff --git a/examples/migration/legacy-explode-source.md b/examples/migration/legacy-explode-source.md new file mode 100644 index 0000000..8d71aaa --- /dev/null +++ b/examples/migration/legacy-explode-source.md @@ -0,0 +1,17 @@ +--- +title: Legacy Explode Successor +--- + +Opening material that used to be easy to lose in section-only exports. + +# Overview + +The successor explode flow preserves preamble, headings, order, and frontmatter. + +## Detail + +Nested sections remain addressable and roundtrip through the manifest. + +# Follow-Up + +Later sections keep their document order. diff --git a/examples/migration/legacy-literate.md b/examples/migration/legacy-literate.md new file mode 100644 index 0000000..e27236d --- /dev/null +++ b/examples/migration/legacy-literate.md @@ -0,0 +1,12 @@ +# Legacy Literate Successor + +```python {#config} +CONFIG = {"ready": True} +``` + +```python {#main tangle="src/app.py"} +<> + +def main(): + return CONFIG["ready"] +``` diff --git a/examples/migration/legacy-path-include.md b/examples/migration/legacy-path-include.md new file mode 100644 index 0000000..4528dc0 --- /dev/null +++ b/examples/migration/legacy-path-include.md @@ -0,0 +1,3 @@ +# Path Include + + diff --git a/examples/migration/legacy-transclusion-context.md b/examples/migration/legacy-transclusion-context.md new file mode 100644 index 0000000..cc4c4df --- /dev/null +++ b/examples/migration/legacy-transclusion-context.md @@ -0,0 +1,13 @@ +--- +title: Legacy Transclusion Successor +namespaces: + std: ./standard +--- + +# Contract Draft + +The old broad transclusion idea is now split into path includes and +reference-backed processors. + +```mkt-include {#payment-clause ref="std:clauses.md#payment"} +``` diff --git a/examples/migration/standard/clauses.md b/examples/migration/standard/clauses.md new file mode 100644 index 0000000..2079e82 --- /dev/null +++ b/examples/migration/standard/clauses.md @@ -0,0 +1,9 @@ +# Standard Clauses + +## Payment {#payment} + +Payment is due within 30 days. + +## Warranty {#warranty} + +Warranty begins on the effective date. diff --git a/examples/references/context.md b/examples/references/context.md new file mode 100644 index 0000000..de9c9f6 --- /dev/null +++ b/examples/references/context.md @@ -0,0 +1,26 @@ +--- +title: Reference Context +namespaces: + std: ./standard +--- + +# Reference Context + +This document declares the namespaces used by reference examples. + +## Local Overview + +Local sections can be addressed with `#local-overview`. + + +This named region can be resolved with `#region:summary-snippet` or +`#tag:summary`. + + +```python {#example-loader tags="code demo" tangle="src/example_loader.py"} +def load_example(): + return "ready" +``` + +```mkt-include {#payment-example ref="std:clauses.md#payment-terms"} +``` diff --git a/examples/references/standard/clauses.md b/examples/references/standard/clauses.md new file mode 100644 index 0000000..715ab18 --- /dev/null +++ b/examples/references/standard/clauses.md @@ -0,0 +1,9 @@ +# Standard Clauses + +## Payment Terms {#payment-terms} + +Payment is due within 30 days unless a governing contract says otherwise. + +## Warranty + +The warranty period starts on the effective date. diff --git a/src/markitect_tool/__init__.py b/src/markitect_tool/__init__.py index 4717cbf..ce557cf 100644 --- a/src/markitect_tool/__init__.py +++ b/src/markitect_tool/__init__.py @@ -32,7 +32,26 @@ from markitect_tool.cache import ( save_cache, scan_markdown_files, ) +from markitect_tool.content_class import ( + ClassCompositionResult, + ContentClass, + ContentClassRegistry, + ContentClassResolutionError, + load_content_class_file, + load_content_classes, +) from markitect_tool.diagnostics import Diagnostic, SourceLocation +from markitect_tool.explode import ( + EXPLODE_MANIFEST_NAME, + ExplodeEntry, + ExplodeError, + ExplodeManifest, + ExplodeResult, + ImplodeResult, + explode_markdown_file, + implode_markdown_directory, + load_explode_manifest, +) from markitect_tool.generation import ( GeneratedDocument, GenerationHookRequest, @@ -44,21 +63,55 @@ from markitect_tool.generation import ( load_generation_plan_file, run_generation_plan, ) +from markitect_tool.literate import ( + CodeChunk, + LiterateFile, + TangleResult, + WeaveResult, + discover_code_chunks, + tangle_markdown, + weave_markdown, + write_tangle_files, +) from markitect_tool.ops import ( ComposeResult, IncludeError, IncludeResult, + OperationProvenance, TransformResult, compose_files, resolve_includes, transform_markdown, ) +from markitect_tool.processor import ( + FencedProcessorBlock, + ProcessorContext, + ProcessorOutputFile, + ProcessorRegistry, + ProcessorRequest, + ProcessorResult, + ProcessorRun, + default_processor_registry, + discover_fenced_processors, + run_fenced_processors, +) from markitect_tool.query import ( InvalidQueryError, QueryMatch, extract_document, query_document, ) +from markitect_tool.reference import ( + ContentUnit, + ReferenceAddress, + ReferenceContext, + ReferenceResolution, + ReferenceResolutionError, + SourceSpan as ReferenceSourceSpan, + load_namespaces, + parse_reference, + resolve_reference, +) from markitect_tool.schema import ( MarkdownSchema, SchemaValidationResult, @@ -109,8 +162,23 @@ __all__ = [ "load_cache", "save_cache", "scan_markdown_files", + "ClassCompositionResult", + "ContentClass", + "ContentClassRegistry", + "ContentClassResolutionError", + "load_content_class_file", + "load_content_classes", "Diagnostic", "SourceLocation", + "EXPLODE_MANIFEST_NAME", + "ExplodeEntry", + "ExplodeError", + "ExplodeManifest", + "ExplodeResult", + "ImplodeResult", + "explode_markdown_file", + "implode_markdown_directory", + "load_explode_manifest", "GeneratedDocument", "GenerationHookRequest", "GenerationHookResult", @@ -120,17 +188,45 @@ __all__ = [ "generate_with_hook", "load_generation_plan_file", "run_generation_plan", + "CodeChunk", + "LiterateFile", + "TangleResult", + "WeaveResult", + "discover_code_chunks", + "tangle_markdown", + "weave_markdown", + "write_tangle_files", "ComposeResult", "IncludeError", "IncludeResult", + "OperationProvenance", "TransformResult", "compose_files", "resolve_includes", "transform_markdown", + "FencedProcessorBlock", + "ProcessorContext", + "ProcessorOutputFile", + "ProcessorRegistry", + "ProcessorRequest", + "ProcessorResult", + "ProcessorRun", + "default_processor_registry", + "discover_fenced_processors", + "run_fenced_processors", "InvalidQueryError", "QueryMatch", "extract_document", "query_document", + "ContentUnit", + "ReferenceAddress", + "ReferenceContext", + "ReferenceResolution", + "ReferenceResolutionError", + "ReferenceSourceSpan", + "load_namespaces", + "parse_reference", + "resolve_reference", "MissingTemplateVariable", "TemplateAnalysis", "TemplateError", diff --git a/src/markitect_tool/cli/main.py b/src/markitect_tool/cli/main.py index 12c5c91..05a7122 100644 --- a/src/markitect_tool/cli/main.py +++ b/src/markitect_tool/cli/main.py @@ -16,6 +16,10 @@ from markitect_tool.cache import ( load_cache, save_cache, ) +from markitect_tool.content_class import ( + ContentClassResolutionError, + load_content_class_file, +) from markitect_tool.core import parse_markdown_file from markitect_tool.contract import ( ContractLoaderError, @@ -24,6 +28,11 @@ from markitect_tool.contract import ( load_contract_file, validate_contract, ) +from markitect_tool.explode import ( + ExplodeError, + explode_markdown_file, + implode_markdown_directory, +) from markitect_tool.generation import ( GenerationPlanError, generate_stub_from_contract, @@ -31,8 +40,16 @@ from markitect_tool.generation import ( load_generation_plan_file, run_generation_plan, ) +from markitect_tool.literate import tangle_markdown, weave_markdown, write_tangle_files from markitect_tool.ops import IncludeError, compose_files, resolve_includes, transform_markdown +from markitect_tool.processor import ProcessorContext, run_fenced_processors from markitect_tool.query import InvalidQueryError, extract_document, query_document +from markitect_tool.reference import ( + ReferenceContext, + ReferenceResolutionError, + load_namespaces, + resolve_reference, +) from markitect_tool.schema import load_schema_file, validate_markdown_file, validate_schema from markitect_tool.template import ( MissingTemplateVariable, @@ -296,6 +313,224 @@ def include( _emit_markdown_result(result.to_dict(), output_format, output) +@main.command() +@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option( + "--output-dir", + required=True, + type=click.Path(file_okay=False, path_type=Path), + help="Directory to write exploded Markdown files and manifest into.", +) +@click.option( + "--variant", + type=click.Choice(["flat", "hierarchical"], case_sensitive=False), + default="flat", + show_default=True, +) +@click.option("--force", is_flag=True, help="Allow writing into a non-empty output directory.") +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def explode( + file: Path, + output_dir: Path, + variant: str, + force: bool, + output_format: str, +) -> None: + """Explode a Markdown file into reversible section files.""" + + try: + result = explode_markdown_file(file, output_dir, variant=variant, overwrite=force) + except ExplodeError as exc: + raise click.ClickException(str(exc)) from exc + _emit_explode_result(result.to_dict(), output_format) + + +@main.command() +@click.argument("directory", type=click.Path(exists=True, file_okay=False, path_type=Path)) +@click.option( + "--manifest", + "manifest_path", + type=click.Path(exists=True, dir_okay=False, path_type=Path), + help="Manifest path. Defaults to markitect-explode.yaml in the input directory.", +) +@click.option( + "--output", + type=click.Path(dir_okay=False, path_type=Path), + help="Write imploded Markdown to a file.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["markdown", "json", "yaml"], case_sensitive=False), + default="markdown", + show_default=True, +) +def implode( + directory: Path, + manifest_path: Path | None, + output: Path | None, + output_format: str, +) -> None: + """Implode a Markdown directory created by `mkt explode`.""" + + try: + result = implode_markdown_directory(directory, manifest_path=manifest_path) + except ExplodeError as exc: + raise click.ClickException(str(exc)) from exc + _emit_markdown_result(result.to_dict(), output_format, output) + + +@main.group("ref") +def ref_group() -> None: + """Resolve namespaced Markdown content references.""" + + +@ref_group.command("resolve") +@click.argument("context_file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.argument("reference") +@click.option( + "--root", + type=click.Path(exists=True, file_okay=False, path_type=Path), + default=Path("."), + show_default=True, + help="Root that relative paths and namespaces must stay within.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def ref_resolve(context_file: Path, reference: str, root: Path, output_format: str) -> None: + """Resolve a content reference using a Markdown document as context.""" + + context_document = parse_markdown_file(context_file) + context = ReferenceContext.from_document( + context_document, + root=root, + current_path=context_file, + ) + try: + resolution = resolve_reference(reference, context=context) + except ReferenceResolutionError as exc: + raise click.ClickException(str(exc)) from exc + _emit_reference_result(resolution.to_dict(), output_format) + + +@main.command("process") +@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option( + "--root", + type=click.Path(exists=True, file_okay=False, path_type=Path), + default=Path("."), + show_default=True, + help="Root used for relative processor references.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def process(file: Path, root: Path, output_format: str) -> None: + """Run deterministic fenced-block processors in a Markdown file.""" + + document = parse_markdown_file(file) + context = ProcessorContext( + root=root, + current_path=file, + namespaces=load_namespaces(document.frontmatter), + ) + result = run_fenced_processors( + file.read_text(encoding="utf-8"), + context=context, + source_path=file, + ) + _emit_processor_run(result.to_dict(), output_format) + raise click.exceptions.Exit(0 if result.valid else 1) + + +@main.group("class") +def class_group() -> None: + """Resolve deterministic content classes.""" + + +@class_group.command("resolve") +@click.argument("class_file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.argument("class_name") +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def class_resolve(class_file: Path, class_name: str, output_format: str) -> None: + """Resolve content class inheritance and merged slots.""" + + try: + registry = load_content_class_file(class_file) + result = registry.compose(class_name) + except ContentClassResolutionError as exc: + raise click.ClickException(str(exc)) from exc + _emit_content_class_result(result.to_dict(), output_format) + raise click.exceptions.Exit(0 if result.valid else 1) + + +@main.command() +@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option( + "--output-dir", + type=click.Path(file_okay=False, path_type=Path), + help="Write tangled files under this directory. Omit for dry JSON/YAML/text output.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["json", "yaml", "text"], case_sensitive=False), + default="text", + show_default=True, +) +def tangle(file: Path, output_dir: Path | None, output_format: str) -> None: + """Tangle named Markdown code chunks into target files.""" + + result = tangle_markdown(file.read_text(encoding="utf-8"), source_path=file) + data = result.to_dict() + if output_dir and result.valid: + data["written_files"] = write_tangle_files(result, output_dir) + _emit_tangle_result(data, output_format) + raise click.exceptions.Exit(0 if result.valid else 1) + + +@main.command() +@click.argument("file", type=click.Path(exists=True, dir_okay=False, path_type=Path)) +@click.option( + "--output", + type=click.Path(dir_okay=False, path_type=Path), + help="Write woven Markdown to a file.", +) +@click.option( + "--format", + "output_format", + type=click.Choice(["markdown", "json", "yaml"], case_sensitive=False), + default="markdown", + show_default=True, +) +def weave(file: Path, output: Path | None, output_format: str) -> None: + """Weave Markdown documentation with a deterministic chunk index.""" + + result = weave_markdown(file.read_text(encoding="utf-8"), source_path=file) + _emit_markdown_result(result.to_dict(), output_format, output) + + @main.group() def cache() -> None: """Fingerprint Markdown files and detect changed inputs.""" @@ -788,6 +1023,83 @@ def _emit_cache_data(data: dict, output_format: str) -> None: click.echo(f"written: {data['written']}") +def _emit_reference_result(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + click.echo(f"{data['count']} unit(s)") + click.echo(f"target: {data['target_path']}") + for unit in data["units"]: + span = unit.get("span", {}) + line = f":{span['line_start']}" if span.get("line_start") else "" + click.echo(f"- {unit['kind']} {unit['unit_id']} {unit['source_path']}{line}") + if unit.get("name"): + click.echo(f" {unit['name']}") + + +def _emit_explode_result(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + manifest = data["manifest"] + click.echo(f"manifest: {data['manifest_path']}") + click.echo(f"variant: {manifest['variant']}") + click.echo(f"entries: {len(manifest['entries'])}") + for entry in manifest["entries"]: + click.echo(f"- {entry['kind']} {entry['file']}") + + +def _emit_processor_run(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + click.echo("valid" if data["valid"] else "invalid") + click.echo(f"processors: {data['count']}") + for block, result in zip(data["blocks"], data["results"], strict=False): + line = f":{block['line_start']}" if block.get("line_start") else "" + click.echo(f"- {block['processor']} {block['unit_id']}{line}") + if result.get("content"): + click.echo(f" content: {result['content'].splitlines()[0]}") + for diagnostic in result.get("diagnostics", []): + click.echo(f" [{diagnostic['severity']}] {diagnostic['code']}: {diagnostic['message']}") + + +def _emit_content_class_result(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + click.echo("valid" if data["valid"] else "invalid") + click.echo("linearization: " + " -> ".join(data["linearization"])) + for slot, value in data.get("slots", {}).items(): + click.echo(f"- {slot}: {value}") + for diagnostic in data.get("diagnostics", []): + click.echo(f"! [{diagnostic['severity']}] {diagnostic['code']}: {diagnostic['message']}") + + +def _emit_tangle_result(data: dict, output_format: str) -> None: + if output_format == "json": + click.echo(json.dumps(data, indent=2, ensure_ascii=False)) + elif output_format == "yaml": + click.echo(yaml.safe_dump(data, sort_keys=False)) + else: + click.echo("valid" if data["valid"] else "invalid") + click.echo(f"files: {len(data['files'])}") + for file in data["files"]: + click.echo(f"- {file['path']}: {', '.join(file['chunk_ids'])}") + for diagnostic in data.get("diagnostics", []): + click.echo(f"! [{diagnostic['severity']}] {diagnostic['code']}: {diagnostic['message']}") + for written in data.get("written_files", []): + click.echo(f"written: {written}") + + def _emit_jsonish(data: dict, output_format: str) -> None: if output_format == "yaml": click.echo(yaml.safe_dump(data, sort_keys=False)) diff --git a/src/markitect_tool/content_class/__init__.py b/src/markitect_tool/content_class/__init__.py new file mode 100644 index 0000000..b14b724 --- /dev/null +++ b/src/markitect_tool/content_class/__init__.py @@ -0,0 +1,19 @@ +"""Deterministic content class composition.""" + +from markitect_tool.content_class.engine import ( + ClassCompositionResult, + ContentClass, + ContentClassRegistry, + ContentClassResolutionError, + load_content_class_file, + load_content_classes, +) + +__all__ = [ + "ClassCompositionResult", + "ContentClass", + "ContentClassRegistry", + "ContentClassResolutionError", + "load_content_class_file", + "load_content_classes", +] diff --git a/src/markitect_tool/content_class/engine.py b/src/markitect_tool/content_class/engine.py new file mode 100644 index 0000000..2884f20 --- /dev/null +++ b/src/markitect_tool/content_class/engine.py @@ -0,0 +1,225 @@ +"""Small deterministic content class resolver.""" + +from __future__ import annotations + +from copy import deepcopy +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any + +import yaml + +from markitect_tool.diagnostics import Diagnostic + + +class ContentClassResolutionError(ValueError): + """Raised when content class definitions cannot be loaded.""" + + +@dataclass(frozen=True) +class ContentClass: + """A data-defined content class.""" + + name: str + extends: list[str] = field(default_factory=list) + slots: dict[str, Any] = field(default_factory=dict) + merge_policies: dict[str, str] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + return {key: value for key, value in asdict(self).items() if value not in ({}, [], None)} + + +@dataclass(frozen=True) +class ClassCompositionResult: + """Resolved content class slots plus diagnostics.""" + + class_name: str + linearization: list[str] + slots: dict[str, Any] + diagnostics: list[Diagnostic] = field(default_factory=list) + + @property + def valid(self) -> bool: + return not any(diagnostic.severity == "error" for diagnostic in self.diagnostics) + + def to_dict(self) -> dict[str, Any]: + return { + "valid": self.valid, + "class_name": self.class_name, + "linearization": self.linearization, + "slots": self.slots, + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + } + + +class ContentClassRegistry: + """Registry and resolver for content classes.""" + + def __init__(self, classes: dict[str, ContentClass] | None = None) -> None: + self.classes = classes or {} + + def add(self, content_class: ContentClass) -> None: + self.classes[content_class.name] = content_class + + def linearize(self, class_name: str) -> list[str]: + if class_name not in self.classes: + raise ContentClassResolutionError(f"Unknown content class `{class_name}`") + return self._linearize(class_name, []) + + def compose(self, class_name: str) -> ClassCompositionResult: + diagnostics: list[Diagnostic] = [] + try: + linearization = self.linearize(class_name) + except ContentClassResolutionError as exc: + return ClassCompositionResult( + class_name=class_name, + linearization=[], + slots={}, + diagnostics=[ + Diagnostic( + severity="error", + code="content_class.resolution_error", + message=str(exc), + ) + ], + ) + + slots: dict[str, Any] = {} + for name in reversed(linearization): + content_class = self.classes[name] + for slot, value in content_class.slots.items(): + policy = content_class.merge_policies.get(slot, "replace") + try: + slots[slot] = _merge_slot(slots.get(slot), value, policy) + except ContentClassResolutionError as exc: + diagnostics.append( + Diagnostic( + severity="error", + code="content_class.merge_conflict", + message=str(exc), + details={"class": name, "slot": slot, "policy": policy}, + ) + ) + return ClassCompositionResult( + class_name=class_name, + linearization=linearization, + slots=slots, + diagnostics=diagnostics, + ) + + def _linearize(self, class_name: str, stack: list[str]) -> list[str]: + if class_name in stack: + raise ContentClassResolutionError( + "Cyclic content class inheritance: " + " -> ".join(stack + [class_name]) + ) + content_class = self.classes[class_name] + parent_mros = [ + self._linearize(parent, stack + [class_name]) + for parent in content_class.extends + if _known_parent(parent, self.classes) + ] + missing = [parent for parent in content_class.extends if parent not in self.classes] + if missing: + raise ContentClassResolutionError( + f"Content class `{class_name}` extends unknown class(es): {', '.join(missing)}" + ) + return [class_name] + _c3_merge(parent_mros + [list(content_class.extends)]) + + +def load_content_class_file(path: str | Path) -> ContentClassRegistry: + """Load content class definitions from YAML.""" + + data = yaml.safe_load(Path(path).read_text(encoding="utf-8")) + if not isinstance(data, dict): + raise ContentClassResolutionError("Content class file must be a mapping") + return load_content_classes(data) + + +def load_content_classes(data: dict[str, Any]) -> ContentClassRegistry: + """Load content class definitions from a mapping.""" + + raw_classes = data.get("classes", data) + if not isinstance(raw_classes, dict): + raise ContentClassResolutionError("Content classes must be a mapping") + classes: dict[str, ContentClass] = {} + for name, raw_class in raw_classes.items(): + if not isinstance(raw_class, dict): + raise ContentClassResolutionError(f"Content class `{name}` must be a mapping") + extends = raw_class.get("extends", []) + if isinstance(extends, str): + extends = [extends] + if not isinstance(extends, list): + raise ContentClassResolutionError(f"Content class `{name}` extends must be a list") + slots = raw_class.get("slots", {}) + policies = raw_class.get("merge_policies", {}) + if not isinstance(slots, dict) or not isinstance(policies, dict): + raise ContentClassResolutionError( + f"Content class `{name}` slots and merge_policies must be mappings" + ) + classes[str(name)] = ContentClass( + name=str(name), + extends=[str(parent) for parent in extends], + slots=slots, + merge_policies={str(key): str(value) for key, value in policies.items()}, + ) + return ContentClassRegistry(classes) + + +def _c3_merge(sequences: list[list[str]]) -> list[str]: + result: list[str] = [] + sequences = [list(sequence) for sequence in sequences if sequence] + while sequences: + candidate = None + for sequence in sequences: + head = sequence[0] + if not any(head in other[1:] for other in sequences): + candidate = head + break + if candidate is None: + raise ContentClassResolutionError("Inconsistent content class precedence order") + result.append(candidate) + sequences = [ + [item for item in sequence if item != candidate] + for sequence in sequences + ] + sequences = [sequence for sequence in sequences if sequence] + return result + + +def _merge_slot(existing: Any, value: Any, policy: str) -> Any: + incoming = deepcopy(value) + if existing is None: + return incoming + if policy == "replace": + return incoming + if policy == "append": + return _as_list(existing) + _as_list(incoming) + if policy == "prepend": + return _as_list(incoming) + _as_list(existing) + if policy == "deep_merge": + if not isinstance(existing, dict) or not isinstance(incoming, dict): + raise ContentClassResolutionError("deep_merge requires mapping values") + return _deep_merge(existing, incoming) + if policy == "error_on_conflict": + if existing != incoming: + raise ContentClassResolutionError("slot conflict") + return existing + raise ContentClassResolutionError(f"Unknown merge policy `{policy}`") + + +def _deep_merge(left: dict[str, Any], right: dict[str, Any]) -> dict[str, Any]: + merged = deepcopy(left) + for key, value in right.items(): + if isinstance(merged.get(key), dict) and isinstance(value, dict): + merged[key] = _deep_merge(merged[key], value) + else: + merged[key] = deepcopy(value) + return merged + + +def _as_list(value: Any) -> list[Any]: + return value if isinstance(value, list) else [value] + + +def _known_parent(parent: str, classes: dict[str, ContentClass]) -> bool: + return parent in classes diff --git a/src/markitect_tool/explode/__init__.py b/src/markitect_tool/explode/__init__.py new file mode 100644 index 0000000..e6651b1 --- /dev/null +++ b/src/markitect_tool/explode/__init__.py @@ -0,0 +1,25 @@ +"""Reversible explode/implode operations for Markdown documents.""" + +from markitect_tool.explode.engine import ( + EXPLODE_MANIFEST_NAME, + ExplodeEntry, + ExplodeError, + ExplodeManifest, + ExplodeResult, + ImplodeResult, + explode_markdown_file, + implode_markdown_directory, + load_explode_manifest, +) + +__all__ = [ + "EXPLODE_MANIFEST_NAME", + "ExplodeEntry", + "ExplodeError", + "ExplodeManifest", + "ExplodeResult", + "ImplodeResult", + "explode_markdown_file", + "implode_markdown_directory", + "load_explode_manifest", +] diff --git a/src/markitect_tool/explode/engine.py b/src/markitect_tool/explode/engine.py new file mode 100644 index 0000000..d014f05 --- /dev/null +++ b/src/markitect_tool/explode/engine.py @@ -0,0 +1,324 @@ +"""Manifest-first reversible explode/implode for Markdown files.""" + +from __future__ import annotations + +import hashlib +import re +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any + +import yaml + +from markitect_tool.core import Heading, parse_markdown + + +EXPLODE_MANIFEST_NAME = "markitect-explode.yaml" + + +class ExplodeError(ValueError): + """Raised when explode or implode cannot preserve a safe roundtrip.""" + + +@dataclass(frozen=True) +class ExplodeEntry: + """One file entry in an exploded Markdown directory.""" + + kind: str + file: str + order: int + unit_id: str + line_start: int + line_end: int + heading_level: int | None = None + heading_text: str | None = None + content_hash: str = "" + + def to_dict(self) -> dict[str, Any]: + return {key: value for key, value in asdict(self).items() if value is not None} + + +@dataclass(frozen=True) +class ExplodeManifest: + """Manifest used to implode an exploded Markdown directory.""" + + version: int + source_path: str + source_hash: str + variant: str + frontmatter_raw: str = "" + entries: list[ExplodeEntry] = field(default_factory=list) + + def to_dict(self) -> dict[str, Any]: + return { + "version": self.version, + "source_path": self.source_path, + "source_hash": self.source_hash, + "variant": self.variant, + "frontmatter_raw": self.frontmatter_raw, + "entries": [entry.to_dict() for entry in self.entries], + } + + +@dataclass(frozen=True) +class ExplodeResult: + """Result of exploding a Markdown file into a directory.""" + + manifest_path: str + output_dir: str + manifest: ExplodeManifest + written_files: list[str] + + def to_dict(self) -> dict[str, Any]: + return { + "manifest_path": self.manifest_path, + "output_dir": self.output_dir, + "manifest": self.manifest.to_dict(), + "written_files": self.written_files, + } + + +@dataclass(frozen=True) +class ImplodeResult: + """Result of rebuilding Markdown from an explode manifest.""" + + markdown: str + manifest_path: str + source_hash: str + current_hash: str + entries: list[str] + + def to_dict(self) -> dict[str, Any]: + return asdict(self) + + +def explode_markdown_file( + path: str | Path, + output_dir: str | Path, + *, + variant: str = "flat", + overwrite: bool = False, +) -> ExplodeResult: + """Explode a Markdown file into section files plus a roundtrip manifest.""" + + if variant not in {"flat", "hierarchical"}: + raise ExplodeError("Explode variant must be `flat` or `hierarchical`") + + source_path = Path(path) + target_dir = Path(output_dir) + markdown = source_path.read_text(encoding="utf-8") + if target_dir.exists() and any(target_dir.iterdir()) and not overwrite: + raise ExplodeError(f"Output directory is not empty: {target_dir}") + target_dir.mkdir(parents=True, exist_ok=True) + + frontmatter_raw, body_start_line = _split_frontmatter_raw(markdown) + entries_with_text = _explode_entries(markdown, body_start_line, variant) + written_files: list[str] = [] + entries: list[ExplodeEntry] = [] + + for entry, text in entries_with_text: + entry_path = _safe_entry_path(target_dir, entry.file) + entry_path.parent.mkdir(parents=True, exist_ok=True) + entry_path.write_text(text, encoding="utf-8") + written_files.append(str(entry_path)) + entries.append(entry) + + manifest = ExplodeManifest( + version=1, + source_path=str(source_path), + source_hash=_hash_text(markdown), + variant=variant, + frontmatter_raw=frontmatter_raw, + entries=entries, + ) + manifest_path = target_dir / EXPLODE_MANIFEST_NAME + manifest_path.write_text(yaml.safe_dump(manifest.to_dict(), sort_keys=False), encoding="utf-8") + return ExplodeResult( + manifest_path=str(manifest_path), + output_dir=str(target_dir), + manifest=manifest, + written_files=written_files + [str(manifest_path)], + ) + + +def implode_markdown_directory( + directory: str | Path, + *, + manifest_path: str | Path | None = None, +) -> ImplodeResult: + """Implode a Markdown directory created by :func:`explode_markdown_file`.""" + + root = Path(directory) + manifest_file = Path(manifest_path) if manifest_path else root / EXPLODE_MANIFEST_NAME + manifest = load_explode_manifest(manifest_file) + parts = [manifest.frontmatter_raw] + entry_files: list[str] = [] + + for entry in manifest.entries: + entry_path = _safe_entry_path(root, entry.file) + if not entry_path.exists() or not entry_path.is_file(): + raise ExplodeError(f"Exploded entry file not found: {entry.file}") + parts.append(entry_path.read_text(encoding="utf-8")) + entry_files.append(str(entry_path)) + + markdown = "".join(parts) + return ImplodeResult( + markdown=markdown, + manifest_path=str(manifest_file), + source_hash=manifest.source_hash, + current_hash=_hash_text(markdown), + entries=entry_files, + ) + + +def load_explode_manifest(path: str | Path) -> ExplodeManifest: + """Load an explode manifest from YAML.""" + + manifest_path = Path(path) + data = yaml.safe_load(manifest_path.read_text(encoding="utf-8")) + if not isinstance(data, dict): + raise ExplodeError("Explode manifest must be a mapping") + entries = data.get("entries", []) + if not isinstance(entries, list): + raise ExplodeError("Explode manifest entries must be a list") + return ExplodeManifest( + version=int(data.get("version", 1)), + source_path=str(data.get("source_path", "")), + source_hash=str(data.get("source_hash", "")), + variant=str(data.get("variant", "flat")), + frontmatter_raw=str(data.get("frontmatter_raw", "")), + entries=[_entry_from_mapping(entry) for entry in entries], + ) + + +def _explode_entries( + markdown: str, + body_start_line: int, + variant: str, +) -> list[tuple[ExplodeEntry, str]]: + lines = markdown.splitlines(keepends=True) + headings = parse_markdown(markdown).headings + entries: list[tuple[ExplodeEntry, str]] = [] + used_ids: dict[str, int] = {} + order = 0 + + first_heading_line = headings[0].line if headings else len(lines) + 1 + preamble_text = "".join(lines[body_start_line - 1:first_heading_line - 1]) + if preamble_text or not headings: + entry = ExplodeEntry( + kind="preamble", + file="00-preamble.md", + order=order, + unit_id="preamble", + line_start=body_start_line, + line_end=max(first_heading_line - 1, body_start_line), + content_hash=_hash_text(preamble_text), + ) + entries.append((entry, preamble_text)) + order += 1 + + hierarchy: dict[int, str] = {} + for index, heading in enumerate(headings): + start = heading.line + end = headings[index + 1].line - 1 if index + 1 < len(headings) else len(lines) + text = "".join(lines[start - 1:end]) + unit_id = _dedupe_id(_slug(_heading_title(heading)), used_ids) + file_path = _entry_file_for_heading(heading, index + 1, unit_id, variant, hierarchy) + entry = ExplodeEntry( + kind="section", + file=file_path, + order=order, + unit_id=unit_id, + line_start=start, + line_end=end, + heading_level=heading.level, + heading_text=heading.text, + content_hash=_hash_text(text), + ) + entries.append((entry, text)) + order += 1 + + return entries + + +def _entry_file_for_heading( + heading: Heading, + index: int, + unit_id: str, + variant: str, + hierarchy: dict[int, str], +) -> str: + filename = f"{index:02d}-{unit_id}.md" + if variant == "flat": + return f"sections/{filename}" + + for level in list(hierarchy): + if level >= heading.level: + del hierarchy[level] + parents = [hierarchy[level] for level in sorted(hierarchy) if level < heading.level] + hierarchy[heading.level] = f"{index:02d}-{unit_id}" + return str(Path(*parents, filename)) if parents else filename + + +def _entry_from_mapping(data: Any) -> ExplodeEntry: + if not isinstance(data, dict): + raise ExplodeError("Explode manifest entry must be a mapping") + return ExplodeEntry( + kind=str(data["kind"]), + file=str(data["file"]), + order=int(data["order"]), + unit_id=str(data["unit_id"]), + line_start=int(data["line_start"]), + line_end=int(data["line_end"]), + heading_level=int(data["heading_level"]) if data.get("heading_level") is not None else None, + heading_text=str(data["heading_text"]) if data.get("heading_text") is not None else None, + content_hash=str(data.get("content_hash", "")), + ) + + +def _safe_entry_path(root: Path, relative_path: str) -> Path: + path = Path(relative_path) + if path.is_absolute(): + raise ExplodeError(f"Exploded entry path must be relative: {relative_path}") + resolved = (root / path).resolve() + try: + resolved.relative_to(root.resolve()) + except ValueError as exc: + raise ExplodeError(f"Exploded entry path escapes directory: {relative_path}") from exc + return resolved + + +def _split_frontmatter_raw(markdown: str) -> tuple[str, int]: + if not markdown.startswith("---\n"): + return "", 1 + end = markdown.find("\n---", 4) + if end == -1: + return "", 1 + closing_end = markdown.find("\n", end + 4) + if closing_end == -1: + closing_end = len(markdown) + else: + closing_end += 1 + frontmatter_raw = markdown[:closing_end] + return frontmatter_raw, frontmatter_raw.count("\n") + 1 + + +def _heading_title(heading: Heading) -> str: + text = re.sub(r"\s+\{#[A-Za-z0-9_.:-]+\}\s*$", "", heading.text.strip()) + return text or "section" + + +def _dedupe_id(unit_id: str, used_ids: dict[str, int]) -> str: + count = used_ids.get(unit_id, 0) + 1 + used_ids[unit_id] = count + return unit_id if count == 1 else f"{unit_id}-{count}" + + +def _slug(value: str) -> str: + slug = re.sub(r"[^a-z0-9_.:-]+", "-", value.strip().lower()) + slug = re.sub(r"-+", "-", slug).strip("-") + return slug or "section" + + +def _hash_text(text: str) -> str: + return "sha256:" + hashlib.sha256(text.encode("utf-8")).hexdigest() diff --git a/src/markitect_tool/literate/__init__.py b/src/markitect_tool/literate/__init__.py new file mode 100644 index 0000000..6b1dd60 --- /dev/null +++ b/src/markitect_tool/literate/__init__.py @@ -0,0 +1,23 @@ +"""Markdown-native literate weave/tangle workflows.""" + +from markitect_tool.literate.engine import ( + CodeChunk, + LiterateFile, + TangleResult, + WeaveResult, + discover_code_chunks, + tangle_markdown, + weave_markdown, + write_tangle_files, +) + +__all__ = [ + "CodeChunk", + "LiterateFile", + "TangleResult", + "WeaveResult", + "discover_code_chunks", + "tangle_markdown", + "weave_markdown", + "write_tangle_files", +] diff --git a/src/markitect_tool/literate/engine.py b/src/markitect_tool/literate/engine.py new file mode 100644 index 0000000..b0643a0 --- /dev/null +++ b/src/markitect_tool/literate/engine.py @@ -0,0 +1,317 @@ +"""Literate programming helpers for Markdown fenced code chunks.""" + +from __future__ import annotations + +import hashlib +import re +import shlex +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any + +from markdown_it import MarkdownIt + +from markitect_tool.diagnostics import Diagnostic, SourceLocation +from markitect_tool.ops import OperationProvenance + + +@dataclass(frozen=True) +class CodeChunk: + """A named fenced code chunk.""" + + chunk_id: str + content: str + language: str | None = None + target_path: str | None = None + references: list[str] = field(default_factory=list) + source_path: str | None = None + line_start: int | None = None + line_end: int | None = None + content_hash: str = "" + + def to_dict(self) -> dict[str, Any]: + return {key: value for key, value in asdict(self).items() if value not in (None, [], "")} + + +@dataclass(frozen=True) +class LiterateFile: + """One generated file from tangling.""" + + path: str + content: str + chunk_ids: list[str] + + def to_dict(self) -> dict[str, Any]: + return asdict(self) + + +@dataclass(frozen=True) +class TangleResult: + """Result of tangling Markdown code chunks.""" + + files: list[LiterateFile] + chunks: list[CodeChunk] + diagnostics: list[Diagnostic] = field(default_factory=list) + provenance: list[OperationProvenance] = field(default_factory=list) + + @property + def valid(self) -> bool: + return not any(diagnostic.severity == "error" for diagnostic in self.diagnostics) + + def to_dict(self) -> dict[str, Any]: + return { + "valid": self.valid, + "files": [file.to_dict() for file in self.files], + "chunks": [chunk.to_dict() for chunk in self.chunks], + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + "provenance": [event.to_dict() for event in self.provenance], + } + + +@dataclass(frozen=True) +class WeaveResult: + """Result of weaving Markdown documentation with a chunk index.""" + + markdown: str + chunks: list[CodeChunk] + + def to_dict(self) -> dict[str, Any]: + return { + "markdown": self.markdown, + "chunks": [chunk.to_dict() for chunk in self.chunks], + } + + +_CHUNK_REF_RE = re.compile(r"<<(?P[A-Za-z0-9_.:-]+)>>") +_CHUNK_LINE_REF_RE = re.compile(r"^(?P[ \t]*)<<(?P[A-Za-z0-9_.:-]+)>>[ \t]*$", re.MULTILINE) + + +def discover_code_chunks( + markdown: str, + *, + source_path: str | Path | None = None, +) -> list[CodeChunk]: + """Discover named fenced code chunks in Markdown order.""" + + parser = MarkdownIt("commonmark", {"tables": True}).enable("table") + chunks: list[CodeChunk] = [] + used_ids: dict[str, int] = {} + for token in parser.parse(markdown): + if token.type != "fence": + continue + attrs = _parse_fence_info(token.info) + chunk_id = attrs.get("id") + if not chunk_id: + continue + chunk_id = _dedupe_id(_slug(chunk_id), used_ids) + line_start = token.map[0] + 1 if token.map else None + line_end = token.map[1] if token.map else None + chunks.append( + CodeChunk( + chunk_id=chunk_id, + content=token.content, + language=attrs.get("language"), + target_path=attrs.get("tangle") or attrs.get("target"), + references=_chunk_references(token.content), + source_path=str(source_path) if source_path else None, + line_start=line_start, + line_end=line_end, + content_hash=_hash_text(token.content), + ) + ) + return chunks + + +def tangle_markdown( + markdown: str, + *, + source_path: str | Path | None = None, +) -> TangleResult: + """Tangle named chunks into target files.""" + + chunks = discover_code_chunks(markdown, source_path=source_path) + chunks_by_id = {chunk.chunk_id: chunk for chunk in chunks} + diagnostics: list[Diagnostic] = [] + provenance: list[OperationProvenance] = [] + target_chunks: dict[str, list[CodeChunk]] = {} + for chunk in chunks: + if chunk.target_path: + target_chunks.setdefault(chunk.target_path, []).append(chunk) + + files: list[LiterateFile] = [] + for target_path, grouped_chunks in target_chunks.items(): + rendered_parts: list[str] = [] + for chunk in grouped_chunks: + rendered_parts.append(_expand_chunk(chunk, chunks_by_id, diagnostics, [])) + provenance.append( + OperationProvenance( + operation="literate.tangle", + source_path=chunk.source_path, + line_start=chunk.line_start, + line_end=chunk.line_end, + target_path=target_path, + dependencies=[chunk.source_path] if chunk.source_path else [], + metadata={"chunk_id": chunk.chunk_id, "references": chunk.references}, + ) + ) + files.append( + LiterateFile( + path=target_path, + content=_join_tangled_parts(rendered_parts), + chunk_ids=[chunk.chunk_id for chunk in grouped_chunks], + ) + ) + + return TangleResult( + files=files, + chunks=chunks, + diagnostics=diagnostics, + provenance=provenance, + ) + + +def weave_markdown( + markdown: str, + *, + source_path: str | Path | None = None, +) -> WeaveResult: + """Append a deterministic chunk index to human-readable Markdown.""" + + chunks = discover_code_chunks(markdown, source_path=source_path) + if not chunks: + return WeaveResult(markdown=markdown, chunks=[]) + + lines = [markdown.rstrip(), "", "## Code Chunk Index", ""] + for chunk in chunks: + target = f" -> `{chunk.target_path}`" if chunk.target_path else "" + refs = f"; refs: {', '.join(f'`{ref}`' for ref in chunk.references)}" if chunk.references else "" + location = f" line {chunk.line_start}" if chunk.line_start else "" + lines.append(f"- `{chunk.chunk_id}`{target}{refs}{location}") + return WeaveResult(markdown="\n".join(lines).rstrip() + "\n", chunks=chunks) + + +def write_tangle_files(result: TangleResult, output_dir: str | Path) -> list[str]: + """Write tangled files under an output directory.""" + + root = Path(output_dir) + root.mkdir(parents=True, exist_ok=True) + written: list[str] = [] + for file in result.files: + target = _safe_output_path(root, file.path) + target.parent.mkdir(parents=True, exist_ok=True) + target.write_text(file.content, encoding="utf-8") + written.append(str(target)) + return written + + +def _expand_chunk( + chunk: CodeChunk, + chunks_by_id: dict[str, CodeChunk], + diagnostics: list[Diagnostic], + stack: list[str], +) -> str: + if chunk.chunk_id in stack: + diagnostics.append( + Diagnostic( + severity="error", + code="literate.chunk_cycle", + message="Cyclic chunk reference: " + " -> ".join(stack + [chunk.chunk_id]), + source=SourceLocation(path=chunk.source_path, line=chunk.line_start), + ) + ) + return f"<<{chunk.chunk_id}>>" + + def replace_line(match: re.Match[str]) -> str: + indent = match.group("indent") + expanded = _expand_reference(match.group("id"), chunks_by_id, diagnostics, stack + [chunk.chunk_id], chunk) + return "\n".join(f"{indent}{line}" if line else line for line in expanded.splitlines()) + + rendered = _CHUNK_LINE_REF_RE.sub(replace_line, chunk.content) + + def replace_inline(match: re.Match[str]) -> str: + return _expand_reference(match.group("id"), chunks_by_id, diagnostics, stack + [chunk.chunk_id], chunk) + + return _CHUNK_REF_RE.sub(replace_inline, rendered) + + +def _expand_reference( + chunk_id: str, + chunks_by_id: dict[str, CodeChunk], + diagnostics: list[Diagnostic], + stack: list[str], + source_chunk: CodeChunk, +) -> str: + referenced = chunks_by_id.get(chunk_id) + if not referenced: + diagnostics.append( + Diagnostic( + severity="error", + code="literate.missing_chunk", + message=f"Missing chunk reference `{chunk_id}`", + source=SourceLocation(path=source_chunk.source_path, line=source_chunk.line_start), + ) + ) + return f"<<{chunk_id}>>" + return _expand_chunk(referenced, chunks_by_id, diagnostics, stack) + + +def _join_tangled_parts(parts: list[str]) -> str: + rendered = "\n".join(part.rstrip("\n") for part in parts if part is not None) + return rendered.rstrip() + "\n" if rendered else "" + + +def _safe_output_path(root: Path, relative_path: str) -> Path: + path = Path(relative_path) + if path.is_absolute(): + raise ValueError(f"Tangle target must be relative: {relative_path}") + resolved = (root / path).resolve() + try: + resolved.relative_to(root.resolve()) + except ValueError as exc: + raise ValueError(f"Tangle target escapes output directory: {relative_path}") from exc + return resolved + + +def _parse_fence_info(info: str) -> dict[str, str]: + match = re.match(r"^(?P[^\s{]+)?(?:\s+\{(?P.*)\})?\s*$", info.strip()) + if not match: + return {"language": info.strip()} if info.strip() else {} + attrs = _parse_attrs(match.group("attrs") or "") + language = match.group("language") + if language: + attrs["language"] = language + return attrs + + +def _parse_attrs(raw: str) -> dict[str, str]: + attrs: dict[str, str] = {} + for part in shlex.split(raw): + if part.startswith("#") and len(part) > 1: + attrs["id"] = part[1:] + continue + if "=" not in part: + attrs[part] = "true" + continue + key, value = part.split("=", 1) + attrs[key.strip()] = value.strip() + return attrs + + +def _chunk_references(content: str) -> list[str]: + return [match.group("id") for match in _CHUNK_REF_RE.finditer(content)] + + +def _dedupe_id(unit_id: str, used_ids: dict[str, int]) -> str: + count = used_ids.get(unit_id, 0) + 1 + used_ids[unit_id] = count + return unit_id if count == 1 else f"{unit_id}-{count}" + + +def _slug(value: str) -> str: + slug = re.sub(r"[^a-z0-9_.:-]+", "-", value.strip().lower()) + slug = re.sub(r"-+", "-", slug).strip("-") + return slug or "chunk" + + +def _hash_text(text: str) -> str: + return "sha256:" + hashlib.sha256(text.encode("utf-8")).hexdigest() diff --git a/src/markitect_tool/ops/__init__.py b/src/markitect_tool/ops/__init__.py index 1e68095..b438296 100644 --- a/src/markitect_tool/ops/__init__.py +++ b/src/markitect_tool/ops/__init__.py @@ -4,6 +4,7 @@ from markitect_tool.ops.engine import ( ComposeResult, IncludeError, IncludeResult, + OperationProvenance, TransformResult, compose_files, resolve_includes, @@ -14,6 +15,7 @@ __all__ = [ "ComposeResult", "IncludeError", "IncludeResult", + "OperationProvenance", "TransformResult", "compose_files", "resolve_includes", diff --git a/src/markitect_tool/ops/engine.py b/src/markitect_tool/ops/engine.py index 75da3ed..333d8a2 100644 --- a/src/markitect_tool/ops/engine.py +++ b/src/markitect_tool/ops/engine.py @@ -9,6 +9,7 @@ from pathlib import Path from typing import Any import yaml +from markdown_it import MarkdownIt from markitect_tool.core import parse_markdown from markitect_tool.query import extract_document @@ -18,15 +19,46 @@ class IncludeError(ValueError): """Raised when include resolution cannot continue.""" +@dataclass(frozen=True) +class OperationProvenance: + """Structured provenance for deterministic Markdown operations.""" + + operation: str + source_path: str | None = None + line_start: int | None = None + line_end: int | None = None + target_path: str | None = None + dependencies: list[str] = field(default_factory=list) + metadata: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + data = { + "operation": self.operation, + "source_path": self.source_path, + "line_start": self.line_start, + "line_end": self.line_end, + "target_path": self.target_path, + "dependencies": self.dependencies or None, + "metadata": self.metadata or None, + } + return {key: value for key, value in data.items() if value is not None} + + @dataclass(frozen=True) class TransformResult: """Result of a deterministic Markdown transform.""" markdown: str operations: list[str] = field(default_factory=list) + provenance: list[OperationProvenance] = field(default_factory=list) def to_dict(self) -> dict[str, Any]: - return asdict(self) + data: dict[str, Any] = { + "markdown": self.markdown, + "operations": self.operations, + "provenance": [event.to_dict() for event in self.provenance], + } + return {key: value for key, value in data.items() if value} @dataclass(frozen=True) @@ -46,9 +78,15 @@ class IncludeResult: markdown: str included_paths: list[str] = field(default_factory=list) + provenance: list[OperationProvenance] = field(default_factory=list) def to_dict(self) -> dict[str, Any]: - return asdict(self) + data: dict[str, Any] = { + "markdown": self.markdown, + "included_paths": self.included_paths, + "provenance": [event.to_dict() for event in self.provenance], + } + return {key: value for key, value in data.items() if value} _COMMENT_INCLUDE_RE = re.compile(r"", re.DOTALL) @@ -68,15 +106,30 @@ def transform_markdown( """Apply deterministic operations to one Markdown document.""" operations: list[str] = [] + provenance: list[OperationProvenance] = [] frontmatter, body = _split_frontmatter(markdown) if set_frontmatter: frontmatter = _deep_merge(frontmatter, set_frontmatter) operations.append("set_frontmatter") + provenance.append( + OperationProvenance( + operation="set_frontmatter", + source_path=source_path, + metadata={"keys": sorted(set_frontmatter.keys())}, + ) + ) if heading_delta: - body = shift_heading_levels(body, heading_delta) + body, affected_lines = _shift_heading_levels(body, heading_delta) operations.append(f"shift_headings:{heading_delta}") + provenance.append( + OperationProvenance( + operation="shift_headings", + source_path=source_path, + metadata={"delta": heading_delta, "affected_lines": affected_lines}, + ) + ) if extract_selector: document_text = _join_frontmatter(frontmatter, body) if frontmatter else body @@ -84,24 +137,71 @@ def transform_markdown( body = "\n\n".join(extract_document(document, extract_selector)) frontmatter = {} operations.append(f"extract:{extract_selector}") + provenance.append( + OperationProvenance( + operation="extract", + source_path=source_path, + metadata={"selector": extract_selector}, + ) + ) if strip_frontmatter: frontmatter = {} operations.append("strip_frontmatter") + provenance.append( + OperationProvenance( + operation="strip_frontmatter", + source_path=source_path, + ) + ) - return TransformResult(markdown=_join_frontmatter(frontmatter, body), operations=operations) + return TransformResult( + markdown=_join_frontmatter(frontmatter, body), + operations=operations, + provenance=provenance, + ) def shift_heading_levels(markdown: str, delta: int) -> str: """Shift ATX heading levels by delta while clamping to levels 1 through 6.""" - def replace(match: re.Match[str]) -> str: + shifted, _affected_lines = _shift_heading_levels(markdown, delta) + return shifted + + +def _shift_heading_levels(markdown: str, delta: int) -> tuple[str, list[int]]: + ignored_lines = _code_line_numbers(markdown) + affected_lines: list[int] = [] + rendered_lines: list[str] = [] + + for line_number, line in enumerate(markdown.splitlines(keepends=True), start=1): + if line_number in ignored_lines: + rendered_lines.append(line) + continue + line_body = line.rstrip("\r\n") + line_ending = line[len(line_body) :] + match = _HEADING_RE.match(line_body) + if not match: + rendered_lines.append(line) + continue marks = match.group(1) suffix = match.group(2) level = min(max(len(marks) + delta, 1), 6) - return f"{'#' * level}{suffix}" + rendered_lines.append(f"{'#' * level}{suffix}{line_ending}") + affected_lines.append(line_number) - return _HEADING_RE.sub(replace, markdown) + return "".join(rendered_lines), affected_lines + + +def _code_line_numbers(markdown: str) -> set[int]: + parser = MarkdownIt("commonmark", {"tables": True}).enable("table") + ignored_lines: set[int] = set() + for token in parser.parse(markdown): + if token.type not in {"fence", "code_block"} or not token.map: + continue + start, end = token.map + ignored_lines.update(range(start + 1, end + 1)) + return ignored_lines def compose_files( @@ -154,18 +254,22 @@ def resolve_includes( root = Path(base_dir).resolve() stack = [Path(current_path).resolve()] if current_path else [] included: list[Path] = [] + provenance: list[OperationProvenance] = [] resolved = _resolve_include_text( markdown, root=root, current_dir=Path(current_path).resolve().parent if current_path else root, + source_path=Path(current_path).resolve() if current_path else None, stack=stack, included=included, + provenance=provenance, depth=0, max_depth=max_depth, ) return IncludeResult( markdown=resolved, included_paths=[str(path) for path in included], + provenance=provenance, ) @@ -174,34 +278,73 @@ def _resolve_include_text( *, root: Path, current_dir: Path, + source_path: Path | None, stack: list[Path], included: list[Path], + provenance: list[OperationProvenance], depth: int, max_depth: int, ) -> str: if depth > max_depth: raise IncludeError(f"Include depth exceeded max_depth={max_depth}") - def replace_comment(match: re.Match[str]) -> str: - attrs = _parse_include_attrs(match.group("attrs")) - return _render_include(attrs, root, current_dir, stack, included, depth, max_depth) + ignored_lines = _code_line_numbers(markdown) + rendered_lines: list[str] = [] - def replace_brace(match: re.Match[str]) -> str: - attrs = {"path": match.group("path").strip()} - return _render_include(attrs, root, current_dir, stack, included, depth, max_depth) + for line_number, line in enumerate(markdown.splitlines(keepends=True), start=1): + if line_number in ignored_lines: + rendered_lines.append(line) + continue - markdown = _COMMENT_INCLUDE_RE.sub(replace_comment, markdown) - return _BRACE_INCLUDE_RE.sub(replace_brace, markdown) + def replace_comment(match: re.Match[str]) -> str: + attrs = _parse_include_attrs(match.group("attrs")) + return _render_include( + attrs, + root, + current_dir, + source_path, + stack, + included, + provenance, + depth, + max_depth, + marker_line=line_number, + ) + + def replace_brace(match: re.Match[str]) -> str: + attrs = {"path": match.group("path").strip()} + return _render_include( + attrs, + root, + current_dir, + source_path, + stack, + included, + provenance, + depth, + max_depth, + marker_line=line_number, + ) + + line = _COMMENT_INCLUDE_RE.sub(replace_comment, line) + line = _BRACE_INCLUDE_RE.sub(replace_brace, line) + rendered_lines.append(line) + + return "".join(rendered_lines) def _render_include( attrs: dict[str, str], root: Path, current_dir: Path, + source_path: Path | None, stack: list[Path], included: list[Path], + provenance: list[OperationProvenance], depth: int, max_depth: int, + *, + marker_line: int, ) -> str: raw_path = attrs.get("path") if not raw_path: @@ -228,12 +371,33 @@ def _render_include( body = shift_heading_levels(body, heading_delta) included.append(include_path) + provenance.append( + OperationProvenance( + operation="include", + source_path=str(source_path) if source_path else None, + line_start=marker_line, + line_end=marker_line, + target_path=str(include_path), + dependencies=[str(include_path)], + metadata={ + key: value + for key, value in { + "selector": selector, + "heading_delta": heading_delta if heading_delta else None, + "include_frontmatter": attrs.get("include_frontmatter"), + }.items() + if value is not None + }, + ) + ) return _resolve_include_text( body.strip(), root=root, current_dir=include_path.parent, + source_path=include_path, stack=stack + [include_path], included=included, + provenance=provenance, depth=depth + 1, max_depth=max_depth, ) diff --git a/src/markitect_tool/processor/__init__.py b/src/markitect_tool/processor/__init__.py new file mode 100644 index 0000000..3c31c71 --- /dev/null +++ b/src/markitect_tool/processor/__init__.py @@ -0,0 +1,27 @@ +"""Deterministic fenced-block processor registry.""" + +from markitect_tool.processor.engine import ( + FencedProcessorBlock, + ProcessorContext, + ProcessorOutputFile, + ProcessorRegistry, + ProcessorRequest, + ProcessorResult, + ProcessorRun, + default_processor_registry, + discover_fenced_processors, + run_fenced_processors, +) + +__all__ = [ + "FencedProcessorBlock", + "ProcessorContext", + "ProcessorOutputFile", + "ProcessorRegistry", + "ProcessorRequest", + "ProcessorResult", + "ProcessorRun", + "default_processor_registry", + "discover_fenced_processors", + "run_fenced_processors", +] diff --git a/src/markitect_tool/processor/engine.py b/src/markitect_tool/processor/engine.py new file mode 100644 index 0000000..31534a8 --- /dev/null +++ b/src/markitect_tool/processor/engine.py @@ -0,0 +1,374 @@ +"""Processor API for deterministic fenced-block workflows.""" + +from __future__ import annotations + +import hashlib +import re +import shlex +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any, Callable + +from markdown_it import MarkdownIt + +from markitect_tool.diagnostics import Diagnostic, SourceLocation +from markitect_tool.ops import OperationProvenance +from markitect_tool.reference import ( + ReferenceContext, + ReferenceResolutionError, + resolve_reference, +) + + +ProcessorCallable = Callable[["ProcessorRequest"], "ProcessorResult"] + + +@dataclass(frozen=True) +class FencedProcessorBlock: + """A fenced Markdown block that opted into processor handling.""" + + processor: str + content: str + unit_id: str + attrs: dict[str, str] + language: str | None = None + source_path: str | None = None + line_start: int | None = None + line_end: int | None = None + content_hash: str = "" + + def to_dict(self) -> dict[str, Any]: + return {key: value for key, value in asdict(self).items() if value not in (None, {}, "")} + + +@dataclass(frozen=True) +class ProcessorContext: + """Execution context passed to deterministic processors.""" + + root: Path = Path(".") + current_path: Path | None = None + namespaces: dict[str, str] = field(default_factory=dict) + variables: dict[str, Any] = field(default_factory=dict) + policy: dict[str, Any] = field(default_factory=dict) + + def reference_context(self) -> ReferenceContext: + return ReferenceContext( + root=self.root, + current_path=self.current_path, + namespaces=self.namespaces, + ) + + def to_dict(self) -> dict[str, Any]: + data = { + "root": str(self.root), + "current_path": str(self.current_path) if self.current_path else None, + "namespaces": self.namespaces, + "variables": self.variables, + "policy": self.policy, + } + return {key: value for key, value in data.items() if value not in (None, {}, "")} + + +@dataclass(frozen=True) +class ProcessorRequest: + """One processor invocation.""" + + block: FencedProcessorBlock + context: ProcessorContext + + +@dataclass(frozen=True) +class ProcessorOutputFile: + """A generated file requested by a processor.""" + + path: str + content: str + + def to_dict(self) -> dict[str, Any]: + return asdict(self) + + +@dataclass(frozen=True) +class ProcessorResult: + """Deterministic processor result envelope.""" + + content: str | None = None + files: list[ProcessorOutputFile] = field(default_factory=list) + diagnostics: list[Diagnostic] = field(default_factory=list) + dependencies: list[str] = field(default_factory=list) + provenance: list[OperationProvenance] = field(default_factory=list) + + @property + def valid(self) -> bool: + return not any(diagnostic.severity == "error" for diagnostic in self.diagnostics) + + def to_dict(self) -> dict[str, Any]: + data = { + "valid": self.valid, + "content": self.content, + "files": [file.to_dict() for file in self.files], + "diagnostics": [diagnostic.to_dict() for diagnostic in self.diagnostics], + "dependencies": self.dependencies, + "provenance": [event.to_dict() for event in self.provenance], + } + return {key: value for key, value in data.items() if value not in (None, [], {})} + + +@dataclass(frozen=True) +class ProcessorRun: + """Results from running all processor blocks in a document.""" + + source_path: str | None + blocks: list[FencedProcessorBlock] + results: list[ProcessorResult] + + @property + def valid(self) -> bool: + return all(result.valid for result in self.results) + + def to_dict(self) -> dict[str, Any]: + return { + "valid": self.valid, + "source_path": self.source_path, + "count": len(self.results), + "blocks": [block.to_dict() for block in self.blocks], + "results": [result.to_dict() for result in self.results], + } + + +class ProcessorRegistry: + """Explicit registry for deterministic fenced-block processors.""" + + def __init__(self) -> None: + self._processors: dict[str, ProcessorCallable] = {} + + def register(self, name: str, processor: ProcessorCallable) -> None: + key = _slug(name) + if not key: + raise ValueError("Processor name cannot be empty") + self._processors[key] = processor + + def names(self) -> list[str]: + return sorted(self._processors) + + def run(self, request: ProcessorRequest) -> ProcessorResult: + processor = self._processors.get(_slug(request.block.processor)) + if processor is None: + return ProcessorResult( + diagnostics=[ + Diagnostic( + severity="error", + code="processor.unknown", + message=f"Unknown processor `{request.block.processor}`", + source=SourceLocation( + path=request.block.source_path, + line=request.block.line_start, + ), + ) + ] + ) + return processor(request) + + +def default_processor_registry() -> ProcessorRegistry: + """Create the default deterministic processor registry.""" + + registry = ProcessorRegistry() + registry.register("identity", _identity_processor) + registry.register("uppercase", _uppercase_processor) + registry.register("include", _include_processor) + return registry + + +def discover_fenced_processors( + markdown: str, + *, + source_path: str | Path | None = None, +) -> list[FencedProcessorBlock]: + """Discover fenced blocks that explicitly opt into processor handling.""" + + parser = MarkdownIt("commonmark", {"tables": True}).enable("table") + blocks: list[FencedProcessorBlock] = [] + used_ids: dict[str, int] = {} + for index, token in enumerate(parser.parse(markdown)): + if token.type != "fence": + continue + attrs = _parse_fence_info(token.info) + processor = _processor_name(attrs) + if not processor: + continue + unit_id = _dedupe_id(_slug(attrs.get("id") or f"{processor}-{index}"), used_ids) + line_start = token.map[0] + 1 if token.map else None + line_end = token.map[1] if token.map else None + blocks.append( + FencedProcessorBlock( + processor=processor, + content=token.content, + unit_id=unit_id, + attrs={ + key: value + for key, value in attrs.items() + if key not in {"id", "language", "processor"} + }, + language=attrs.get("language"), + source_path=str(source_path) if source_path else None, + line_start=line_start, + line_end=line_end, + content_hash=_hash_text(token.content), + ) + ) + return blocks + + +def run_fenced_processors( + markdown: str, + *, + context: ProcessorContext, + registry: ProcessorRegistry | None = None, + source_path: str | Path | None = None, +) -> ProcessorRun: + """Run all processor-marked fenced blocks in document order.""" + + active_registry = registry or default_processor_registry() + blocks = discover_fenced_processors(markdown, source_path=source_path or context.current_path) + results = [ + active_registry.run(ProcessorRequest(block=block, context=context)) + for block in blocks + ] + return ProcessorRun( + source_path=str(source_path or context.current_path) if source_path or context.current_path else None, + blocks=blocks, + results=results, + ) + + +def _identity_processor(request: ProcessorRequest) -> ProcessorResult: + return ProcessorResult( + content=request.block.content, + provenance=[ + OperationProvenance( + operation="processor.identity", + source_path=request.block.source_path, + line_start=request.block.line_start, + line_end=request.block.line_end, + metadata={"unit_id": request.block.unit_id}, + ) + ], + ) + + +def _uppercase_processor(request: ProcessorRequest) -> ProcessorResult: + return ProcessorResult( + content=request.block.content.upper(), + provenance=[ + OperationProvenance( + operation="processor.uppercase", + source_path=request.block.source_path, + line_start=request.block.line_start, + line_end=request.block.line_end, + metadata={"unit_id": request.block.unit_id}, + ) + ], + ) + + +def _include_processor(request: ProcessorRequest) -> ProcessorResult: + reference = request.block.attrs.get("ref") + if not reference: + return ProcessorResult( + diagnostics=[ + Diagnostic( + severity="error", + code="processor.include.missing_ref", + message="Include processor requires a `ref` attribute", + source=SourceLocation( + path=request.block.source_path, + line=request.block.line_start, + ), + ) + ] + ) + try: + resolution = resolve_reference(reference, context=request.context.reference_context()) + except ReferenceResolutionError as exc: + return ProcessorResult( + diagnostics=[ + Diagnostic( + severity="error", + code="processor.include.reference_error", + message=str(exc), + source=SourceLocation( + path=request.block.source_path, + line=request.block.line_start, + ), + ) + ] + ) + content = "\n\n".join(unit.text for unit in resolution.units) + return ProcessorResult( + content=content, + dependencies=[resolution.target_path], + provenance=[ + OperationProvenance( + operation="processor.include", + source_path=request.block.source_path, + line_start=request.block.line_start, + line_end=request.block.line_end, + target_path=resolution.target_path, + dependencies=[resolution.target_path], + metadata={"ref": reference, "unit_ids": [unit.unit_id for unit in resolution.units]}, + ) + ], + ) + + +def _processor_name(attrs: dict[str, str]) -> str | None: + if "processor" in attrs: + return attrs["processor"] + language = attrs.get("language", "") + if language.startswith("mkt-"): + return language.removeprefix("mkt-") + if language == "mkt" and "type" in attrs: + return attrs["type"] + return None + + +def _parse_fence_info(info: str) -> dict[str, str]: + match = re.match(r"^(?P[^\s{]+)?(?:\s+\{(?P.*)\})?\s*$", info.strip()) + if not match: + return {"language": info.strip()} if info.strip() else {} + attrs = _parse_attrs(match.group("attrs") or "") + language = match.group("language") + if language: + attrs["language"] = language + return attrs + + +def _parse_attrs(raw: str) -> dict[str, str]: + attrs: dict[str, str] = {} + for part in shlex.split(raw): + if part.startswith("#") and len(part) > 1: + attrs["id"] = part[1:] + continue + if "=" not in part: + attrs[part] = "true" + continue + key, value = part.split("=", 1) + attrs[key.strip()] = value.strip() + return attrs + + +def _dedupe_id(unit_id: str, used_ids: dict[str, int]) -> str: + count = used_ids.get(unit_id, 0) + 1 + used_ids[unit_id] = count + return unit_id if count == 1 else f"{unit_id}-{count}" + + +def _slug(value: str) -> str: + slug = re.sub(r"[^a-z0-9_.:-]+", "-", value.strip().lower()) + slug = re.sub(r"-+", "-", slug).strip("-") + return slug + + +def _hash_text(text: str) -> str: + return "sha256:" + hashlib.sha256(text.encode("utf-8")).hexdigest() diff --git a/src/markitect_tool/reference/__init__.py b/src/markitect_tool/reference/__init__.py new file mode 100644 index 0000000..10ee563 --- /dev/null +++ b/src/markitect_tool/reference/__init__.py @@ -0,0 +1,25 @@ +"""Namespaced content reference resolution for Markdown artifacts.""" + +from markitect_tool.reference.engine import ( + ContentUnit, + ReferenceAddress, + ReferenceContext, + ReferenceResolution, + ReferenceResolutionError, + SourceSpan, + load_namespaces, + parse_reference, + resolve_reference, +) + +__all__ = [ + "ContentUnit", + "ReferenceAddress", + "ReferenceContext", + "ReferenceResolution", + "ReferenceResolutionError", + "SourceSpan", + "load_namespaces", + "parse_reference", + "resolve_reference", +] diff --git a/src/markitect_tool/reference/engine.py b/src/markitect_tool/reference/engine.py new file mode 100644 index 0000000..c5bd8a0 --- /dev/null +++ b/src/markitect_tool/reference/engine.py @@ -0,0 +1,626 @@ +"""Reference parsing and resolution for Markdown content units.""" + +from __future__ import annotations + +import hashlib +import re +import shlex +from dataclasses import asdict, dataclass, field +from pathlib import Path +from typing import Any + +from markdown_it import MarkdownIt + +from markitect_tool.core import ContentBlock, Document, Heading, Section, parse_markdown +from markitect_tool.query import InvalidQueryError, QueryMatch, query_document + + +class ReferenceResolutionError(ValueError): + """Raised when a content reference cannot be resolved.""" + + +@dataclass(frozen=True) +class ReferenceAddress: + """Parsed content reference address. + + Syntax is intentionally compact and Markdown-friendly: + + - ``path/to/file.md`` + - ``std:clauses/payment.md`` + - ``std:clauses/payment.md#section:terms`` + - ``std:clauses/payment.md::sections[heading=Terms]`` + - ``#intro`` for a fragment in the current document + """ + + raw: str + namespace: str | None = None + address: str = "" + fragment: str | None = None + selector: str | None = None + + def to_dict(self) -> dict[str, Any]: + return { + key: value + for key, value in asdict(self).items() + if value is not None and value != "" + } + + +@dataclass(frozen=True) +class ReferenceContext: + """Inputs used to resolve namespaced and relative content references.""" + + root: Path = Path(".") + current_path: Path | None = None + namespaces: dict[str, str] = field(default_factory=dict) + + @classmethod + def from_document( + cls, + document: Document, + *, + root: str | Path = ".", + current_path: str | Path | None = None, + ) -> "ReferenceContext": + """Build a reference context from document frontmatter.""" + + source_path = current_path or document.source_path + return cls( + root=Path(root), + current_path=Path(source_path) if source_path else None, + namespaces=load_namespaces(document.frontmatter), + ) + + def to_dict(self) -> dict[str, Any]: + data = { + "root": str(self.root), + "current_path": str(self.current_path) if self.current_path else None, + "namespaces": self.namespaces, + } + return {key: value for key, value in data.items() if value is not None} + + +@dataclass(frozen=True) +class SourceSpan: + """Line span for a resolved unit in its source file.""" + + line_start: int | None = None + line_end: int | None = None + + def to_dict(self) -> dict[str, Any]: + return {key: value for key, value in asdict(self).items() if value is not None} + + +@dataclass(frozen=True) +class ContentUnit: + """One addressable content unit resolved from Markdown.""" + + kind: str + unit_id: str + text: str + source_path: str + span: SourceSpan | None = None + name: str | None = None + content_hash: str = "" + metadata: dict[str, Any] = field(default_factory=dict) + + def to_dict(self) -> dict[str, Any]: + data = { + "kind": self.kind, + "unit_id": self.unit_id, + "name": self.name, + "source_path": self.source_path, + "span": self.span.to_dict() if self.span else None, + "content_hash": self.content_hash, + "metadata": self.metadata or None, + "text": self.text, + } + return {key: value for key, value in data.items() if value is not None} + + +@dataclass(frozen=True) +class ReferenceResolution: + """Resolved content reference and its dependency edge.""" + + reference: ReferenceAddress + source_path: str + target_path: str + units: list[ContentUnit] + + def to_dict(self) -> dict[str, Any]: + return { + "reference": self.reference.to_dict(), + "source_path": self.source_path, + "target_path": self.target_path, + "count": len(self.units), + "units": [unit.to_dict() for unit in self.units], + } + + +_NAMESPACE_RE = re.compile(r"^(?P[A-Za-z][A-Za-z0-9_.-]*):(?P
.*)$") +_HEADING_ID_RE = re.compile(r"^(?P.*?)(?:\s+\{#(?P<id>[A-Za-z0-9_.:-]+)\})?$") +_REGION_OPEN_RE = re.compile(r"<!--\s*mkt:region\s+(?P<attrs>.*?)\s*-->") +_REGION_CLOSE_RE = re.compile(r"<!--\s*/mkt:region\s*-->") +_FENCE_ATTRS_RE = re.compile(r"^(?P<language>[^\s{]+)?(?:\s+\{(?P<attrs>.*)\})?\s*$") + + +def parse_reference(reference: str) -> ReferenceAddress: + """Parse a compact Markitect content reference.""" + + raw = reference.strip() + if not raw: + raise ReferenceResolutionError("Reference cannot be empty") + + selector: str | None = None + base = raw + if "::" in base: + base, selector = base.split("::", 1) + selector = selector.strip() + if not selector: + raise ReferenceResolutionError(f"Reference selector is empty in `{reference}`") + + fragment: str | None = None + if "#" in base: + base, fragment = base.split("#", 1) + fragment = fragment.strip() + if not fragment: + raise ReferenceResolutionError(f"Reference fragment is empty in `{reference}`") + + namespace: str | None = None + address = base.strip() + match = _NAMESPACE_RE.match(address) + if match and "/" not in match.group("namespace") and "\\" not in match.group("namespace"): + namespace = match.group("namespace") + address = match.group("address").strip() + + return ReferenceAddress( + raw=raw, + namespace=namespace, + address=address, + fragment=fragment, + selector=selector, + ) + + +def load_namespaces(frontmatter: dict[str, Any]) -> dict[str, str]: + """Load namespace mappings from Markdown frontmatter.""" + + raw_namespaces = frontmatter.get("namespaces", {}) + if raw_namespaces is None: + return {} + if not isinstance(raw_namespaces, dict): + raise ReferenceResolutionError("Frontmatter `namespaces` must be a mapping") + + namespaces: dict[str, str] = {} + for raw_key, raw_value in raw_namespaces.items(): + key = str(raw_key).strip().rstrip(":") + if not key: + raise ReferenceResolutionError("Namespace keys cannot be empty") + if not _NAMESPACE_RE.match(f"{key}:"): + raise ReferenceResolutionError(f"Invalid namespace key `{raw_key}`") + if not isinstance(raw_value, str): + raise ReferenceResolutionError(f"Namespace `{key}` must map to a string path") + value = raw_value.strip() + if not value: + raise ReferenceResolutionError(f"Namespace `{key}` cannot map to an empty path") + namespaces[key] = value + return namespaces + + +def resolve_reference( + reference: str | ReferenceAddress, + *, + context: ReferenceContext, +) -> ReferenceResolution: + """Resolve a content reference to one or more content units.""" + + address = parse_reference(reference) if isinstance(reference, str) else reference + root = context.root.resolve() + source_path = context.current_path.resolve() if context.current_path else root + target_path = _resolve_target_path(address, context, root, source_path) + if not target_path.exists() or not target_path.is_file(): + raise ReferenceResolutionError(f"Referenced file not found: {target_path}") + + markdown = target_path.read_text(encoding="utf-8") + document = parse_markdown(markdown, source_path=str(target_path)) + + if address.selector and address.fragment: + raise ReferenceResolutionError("Reference cannot use both fragment and selector") + if address.selector: + units = _units_from_selector(document, address.selector, target_path) + elif address.fragment: + units = _units_from_fragment(document, address.fragment, target_path, markdown) + else: + units = [_document_unit(document, target_path, markdown)] + + if not units: + raise ReferenceResolutionError(f"Reference `{address.raw}` did not match any content units") + + return ReferenceResolution( + reference=address, + source_path=str(source_path), + target_path=str(target_path), + units=units, + ) + + +def _resolve_target_path( + address: ReferenceAddress, + context: ReferenceContext, + root: Path, + source_path: Path, +) -> Path: + if address.namespace: + if address.namespace not in context.namespaces: + raise ReferenceResolutionError(f"Unknown namespace `{address.namespace}`") + namespace_target = _path_from_namespace(context.namespaces[address.namespace], root) + candidate = namespace_target / address.address if namespace_target.is_dir() else namespace_target + elif address.address: + base_dir = source_path.parent if source_path.is_file() else root + candidate = Path(address.address) + candidate = candidate if candidate.is_absolute() else base_dir / candidate + elif context.current_path: + candidate = context.current_path + else: + raise ReferenceResolutionError("Pathless references require a current document") + + resolved = candidate.resolve() + try: + resolved.relative_to(root) + except ValueError as exc: + raise ReferenceResolutionError(f"Reference escapes root: {address.raw}") from exc + return resolved + + +def _path_from_namespace(raw_path: str, root: Path) -> Path: + path = Path(raw_path) + if not path.is_absolute(): + path = root / path + return path.resolve() + + +def _units_from_selector( + document: Document, + selector: str, + target_path: Path, +) -> list[ContentUnit]: + try: + matches = query_document(document, selector) + except InvalidQueryError as exc: + raise ReferenceResolutionError(str(exc)) from exc + return [_unit_from_query_match(match, target_path) for match in matches] + + +def _units_from_fragment( + document: Document, + fragment: str, + target_path: Path, + markdown: str, +) -> list[ContentUnit]: + kind, _, value = fragment.partition(":") + if not value: + kind, value = "id", kind + lookup = _slug(value) + + if kind == "document": + return [_document_unit(document, target_path, markdown)] + if kind == "id": + for units in [ + _section_units(document, target_path), + _region_units(markdown, target_path), + _fenced_block_units(markdown, target_path), + _heading_units(document, target_path), + ]: + matches = [ + unit for unit in units if unit.unit_id == lookup or _slug(unit.name or "") == lookup + ] + if matches: + return matches + return [] + if kind in {"id", "section"}: + sections = _section_units(document, target_path) + return [unit for unit in sections if unit.unit_id == lookup or _slug(unit.name or "") == lookup] + if kind == "heading": + headings = _heading_units(document, target_path) + return [unit for unit in headings if unit.unit_id == lookup or _slug(unit.name or "") == lookup] + if kind == "block": + return _block_fragment_units(document, target_path, value) + if kind == "region": + return [unit for unit in _region_units(markdown, target_path) if unit.unit_id == lookup] + if kind == "fence": + return [unit for unit in _fenced_block_units(markdown, target_path) if unit.unit_id == lookup] + if kind == "tag": + return [ + unit + for unit in _region_units(markdown, target_path) + _fenced_block_units(markdown, target_path) + if lookup in {_slug(tag) for tag in unit.metadata.get("tags", [])} + ] + if kind == "line": + return _line_range_units(markdown, target_path, value) + raise ReferenceResolutionError(f"Unsupported reference fragment kind `{kind}`") + + +def _document_unit(document: Document, target_path: Path, markdown: str) -> ContentUnit: + unit_id = _slug(str(document.frontmatter.get("id") or target_path.stem)) + return _content_unit( + kind="document", + unit_id=unit_id, + text=markdown, + source_path=target_path, + span=SourceSpan(1, len(markdown.splitlines())), + name=str(document.frontmatter.get("title") or target_path.stem), + metadata={"frontmatter": document.frontmatter}, + ) + + +def _unit_from_query_match(match: QueryMatch, target_path: Path) -> ContentUnit: + unit_id = _slug(match.path.replace("$.", "").replace("[", "-").replace("]", "")) + name = match.text.splitlines()[0].lstrip("# ").strip() if match.text else match.kind + return _content_unit( + kind=match.kind, + unit_id=unit_id, + text=match.text if match.text is not None else str(match.value), + source_path=target_path, + span=SourceSpan(match.line, None), + name=name, + metadata={"query_path": match.path, "value": match.value}, + ) + + +def _section_units(document: Document, target_path: Path) -> list[ContentUnit]: + used_ids: dict[str, int] = {} + return [ + _section_unit(section, target_path, used_ids) + for section in document.sections + ] + + +def _section_unit( + section: Section, + target_path: Path, + used_ids: dict[str, int], +) -> ContentUnit: + title, explicit_id = _heading_title_and_id(section.heading) + unit_id = _dedupe_id(_slug(explicit_id or title), used_ids) + line_end = section.blocks[-1].line_end if section.blocks else section.heading.line + lines = [f"{'#' * section.heading.level} {section.heading.text}"] + for block in section.blocks: + if block.text: + lines.extend(["", block.text]) + return _content_unit( + kind="section", + unit_id=unit_id, + text="\n".join(lines).strip(), + source_path=target_path, + span=SourceSpan(section.heading.line, line_end), + name=title, + metadata={"heading_level": section.heading.level}, + ) + + +def _heading_units(document: Document, target_path: Path) -> list[ContentUnit]: + used_ids: dict[str, int] = {} + units: list[ContentUnit] = [] + for heading in document.headings: + title, explicit_id = _heading_title_and_id(heading) + unit_id = _dedupe_id(_slug(explicit_id or title), used_ids) + units.append( + _content_unit( + kind="heading", + unit_id=unit_id, + text=f"{'#' * heading.level} {heading.text}", + source_path=target_path, + span=SourceSpan(heading.line, heading.line), + name=title, + metadata={"heading_level": heading.level}, + ) + ) + return units + + +def _block_fragment_units( + document: Document, + target_path: Path, + value: str, +) -> list[ContentUnit]: + blocks = _block_units(document.blocks, target_path) + if value.isdigit(): + index = int(value) + return [blocks[index]] if 0 <= index < len(blocks) else [] + lookup = _slug(value) + return [unit for unit in blocks if unit.unit_id == lookup] + + +def _block_units(blocks: list[ContentBlock], target_path: Path) -> list[ContentUnit]: + used_ids: dict[str, int] = {} + units: list[ContentUnit] = [] + for index, block in enumerate(blocks): + base_id = f"{block.type}-{block.line_start or index}" + units.append( + _content_unit( + kind=block.type, + unit_id=_dedupe_id(_slug(base_id), used_ids), + text=block.text, + source_path=target_path, + span=SourceSpan(block.line_start, block.line_end), + name=block.type, + metadata={"block_index": index}, + ) + ) + return units + + +def _region_units(markdown: str, target_path: Path) -> list[ContentUnit]: + lines = markdown.splitlines() + units: list[ContentUnit] = [] + open_region: tuple[int, str, list[str]] | None = None + + for index, line in enumerate(lines, start=1): + open_match = _REGION_OPEN_RE.search(line) + close_match = _REGION_CLOSE_RE.search(line) + if open_match and open_region is not None: + raise ReferenceResolutionError("Nested mkt:region blocks are not supported") + if close_match: + if open_region is None: + raise ReferenceResolutionError("Region close marker has no matching open marker") + start_line, region_id, tags = open_region + content_lines = lines[start_line:index - 1] + units.append( + _content_unit( + kind="region", + unit_id=_slug(region_id), + text="\n".join(content_lines).strip(), + source_path=target_path, + span=SourceSpan(start_line, index), + name=region_id, + metadata={"tags": tags}, + ) + ) + open_region = None + continue + if open_match: + attrs = _parse_attrs(open_match.group("attrs")) + region_id = attrs.get("id") + if not region_id: + raise ReferenceResolutionError("Region marker requires an id attribute") + open_region = (index, region_id, _tags_from_attrs(attrs)) + + if open_region is not None: + raise ReferenceResolutionError("Region open marker has no matching close marker") + return units + + +def _fenced_block_units(markdown: str, target_path: Path) -> list[ContentUnit]: + parser = MarkdownIt("commonmark", {"tables": True}).enable("table") + units: list[ContentUnit] = [] + used_ids: dict[str, int] = {} + for index, token in enumerate(parser.parse(markdown)): + if token.type != "fence": + continue + attrs = _parse_fence_info(token.info) + unit_id = attrs.get("id") + if not unit_id: + continue + line_start = token.map[0] + 1 if token.map else None + line_end = token.map[1] if token.map else None + units.append( + _content_unit( + kind="fenced_block", + unit_id=_dedupe_id(_slug(unit_id), used_ids), + text=token.content, + source_path=target_path, + span=SourceSpan(line_start, line_end), + name=unit_id, + metadata={ + "language": attrs.get("language"), + "tags": _tags_from_attrs(attrs), + "attrs": { + key: value + for key, value in attrs.items() + if key not in {"id", "language", "tag", "tags"} + }, + "block_index": index, + }, + ) + ) + return units + + +def _line_range_units(markdown: str, target_path: Path, value: str) -> list[ContentUnit]: + match = re.match(r"^(?P<start>\d+)(?:-(?P<end>\d+))?$", value) + if not match: + raise ReferenceResolutionError("Line fragments must use `line:start` or `line:start-end`") + start = int(match.group("start")) + end = int(match.group("end") or start) + lines = markdown.splitlines() + if start < 1 or end < start or end > len(lines): + return [] + text = "\n".join(lines[start - 1:end]) + return [ + _content_unit( + kind="line_range", + unit_id=f"line-{start}-{end}", + text=text, + source_path=target_path, + span=SourceSpan(start, end), + name=f"lines {start}-{end}", + metadata={}, + ) + ] + + +def _parse_fence_info(info: str) -> dict[str, str]: + match = _FENCE_ATTRS_RE.match(info.strip()) + if not match: + return {"language": info.strip()} if info.strip() else {} + attrs = _parse_attrs(match.group("attrs") or "") + language = match.group("language") + if language: + attrs["language"] = language + if "id" not in attrs and attrs: + for key in list(attrs): + if key.startswith("#"): + attrs["id"] = key[1:] + del attrs[key] + break + return attrs + + +def _parse_attrs(raw: str) -> dict[str, str]: + attrs: dict[str, str] = {} + for part in shlex.split(raw): + if part.startswith("#") and len(part) > 1: + attrs["id"] = part[1:] + continue + if "=" not in part: + attrs[part] = "true" + continue + key, value = part.split("=", 1) + attrs[key.strip()] = value.strip() + return attrs + + +def _tags_from_attrs(attrs: dict[str, str]) -> list[str]: + raw = attrs.get("tags") or attrs.get("tag") or "" + return [tag.strip() for tag in re.split(r"[, ]+", raw) if tag.strip()] + + +def _content_unit( + *, + kind: str, + unit_id: str, + text: str, + source_path: Path, + span: SourceSpan | None, + name: str | None, + metadata: dict[str, Any] | None = None, +) -> ContentUnit: + return ContentUnit( + kind=kind, + unit_id=unit_id, + text=text, + source_path=str(source_path), + span=span, + name=name, + content_hash="sha256:" + hashlib.sha256(text.encode("utf-8")).hexdigest(), + metadata=metadata or {}, + ) + + +def _heading_title_and_id(heading: Heading) -> tuple[str, str | None]: + match = _HEADING_ID_RE.match(heading.text.strip()) + if not match: + return heading.text.strip(), None + return match.group("title").strip(), match.group("id") + + +def _dedupe_id(unit_id: str, used_ids: dict[str, int]) -> str: + count = used_ids.get(unit_id, 0) + 1 + used_ids[unit_id] = count + return unit_id if count == 1 else f"{unit_id}-{count}" + + +def _slug(value: str) -> str: + slug = re.sub(r"[^a-z0-9_.:-]+", "-", value.strip().lower()) + slug = re.sub(r"-+", "-", slug).strip("-") + return slug or "unit" diff --git a/tests/test_content_class_resolution.py b/tests/test_content_class_resolution.py new file mode 100644 index 0000000..54130bd --- /dev/null +++ b/tests/test_content_class_resolution.py @@ -0,0 +1,106 @@ +from pathlib import Path + +from click.testing import CliRunner + +from markitect_tool.cli import main +from markitect_tool.content_class import load_content_classes + + +def test_c3_linearization_for_diamond_inheritance(): + registry = load_content_classes( + { + "classes": { + "base": {"slots": {"sections": ["Overview"]}}, + "left": {"extends": ["base"], "slots": {"sections": ["Left"]}}, + "right": {"extends": ["base"], "slots": {"sections": ["Right"]}}, + "leaf": {"extends": ["left", "right"], "slots": {"title": "Leaf"}}, + } + } + ) + + assert registry.linearize("leaf") == ["leaf", "left", "right", "base"] + + +def test_compose_merges_slots_with_explicit_policies(): + registry = load_content_classes( + { + "classes": { + "base": { + "slots": { + "sections": ["Overview"], + "assertions": {"tone": "plain", "depth": "short"}, + } + }, + "market": { + "extends": ["base"], + "slots": { + "sections": ["Pricing"], + "assertions": {"depth": "detailed"}, + }, + "merge_policies": { + "sections": "append", + "assertions": "deep_merge", + }, + }, + "instance": { + "extends": ["market"], + "slots": {"sections": ["Risks"]}, + "merge_policies": {"sections": "append"}, + }, + } + } + ) + + result = registry.compose("instance") + + assert result.valid + assert result.slots["sections"] == ["Overview", "Pricing", "Risks"] + assert result.slots["assertions"] == {"tone": "plain", "depth": "detailed"} + + +def test_compose_reports_error_on_conflict(): + registry = load_content_classes( + { + "classes": { + "base": {"slots": {"owner": "A"}}, + "instance": { + "extends": ["base"], + "slots": {"owner": "B"}, + "merge_policies": {"owner": "error_on_conflict"}, + }, + } + } + ) + + result = registry.compose("instance") + + assert not result.valid + assert result.diagnostics[0].code == "content_class.merge_conflict" + + +def test_mkt_class_resolve_outputs_text(tmp_path: Path): + class_file = tmp_path / "classes.yaml" + class_file.write_text( + """classes: + base: + slots: + sections: + - Overview + instance: + extends: + - base + slots: + sections: + - Risks + merge_policies: + sections: append +""", + encoding="utf-8", + ) + + result = CliRunner().invoke(main, ["class", "resolve", str(class_file), "instance"]) + + assert result.exit_code == 0 + assert "linearization: instance -> base" in result.output + assert "Overview" in result.output + assert "Risks" in result.output diff --git a/tests/test_explode_implode.py b/tests/test_explode_implode.py new file mode 100644 index 0000000..0b1a121 --- /dev/null +++ b/tests/test_explode_implode.py @@ -0,0 +1,93 @@ +from pathlib import Path + +import pytest +from click.testing import CliRunner + +from markitect_tool.cli import main +from markitect_tool.explode import ( + EXPLODE_MANIFEST_NAME, + ExplodeError, + explode_markdown_file, + implode_markdown_directory, +) + + +ROUNDTRIP_DOC = """--- +title: Explode Example +--- + +Opening text before the first heading. + +# Intro + +Intro body. + +## Detail + +Detail body. + +# Later + +Later body. +""" + + +def test_flat_explode_implode_roundtrips_exact_markdown(tmp_path: Path): + source = tmp_path / "source.md" + output_dir = tmp_path / "exploded" + source.write_text(ROUNDTRIP_DOC, encoding="utf-8") + + result = explode_markdown_file(source, output_dir, variant="flat") + imploded = implode_markdown_directory(output_dir) + + assert Path(result.manifest_path).name == EXPLODE_MANIFEST_NAME + assert (output_dir / "00-preamble.md").exists() + assert (output_dir / "sections" / "01-intro.md").exists() + assert imploded.markdown == ROUNDTRIP_DOC + assert imploded.current_hash == result.manifest.source_hash + + +def test_hierarchical_explode_places_child_sections_under_parent(tmp_path: Path): + source = tmp_path / "source.md" + output_dir = tmp_path / "exploded" + source.write_text(ROUNDTRIP_DOC, encoding="utf-8") + + result = explode_markdown_file(source, output_dir, variant="hierarchical") + + files = {Path(path).relative_to(output_dir).as_posix() for path in result.written_files} + assert "01-intro.md" in files + assert "01-intro/02-detail.md" in files + assert implode_markdown_directory(output_dir).markdown == ROUNDTRIP_DOC + + +def test_explode_rejects_non_empty_output_without_force(tmp_path: Path): + source = tmp_path / "source.md" + output_dir = tmp_path / "exploded" + output_dir.mkdir() + (output_dir / "existing.md").write_text("Existing", encoding="utf-8") + source.write_text(ROUNDTRIP_DOC, encoding="utf-8") + + with pytest.raises(ExplodeError, match="not empty"): + explode_markdown_file(source, output_dir) + + +def test_mkt_explode_and_implode(tmp_path: Path): + source = tmp_path / "source.md" + output_dir = tmp_path / "exploded" + rebuilt = tmp_path / "rebuilt.md" + source.write_text(ROUNDTRIP_DOC, encoding="utf-8") + runner = CliRunner() + + explode_result = runner.invoke( + main, + ["explode", str(source), "--output-dir", str(output_dir), "--variant", "flat"], + ) + implode_result = runner.invoke( + main, + ["implode", str(output_dir), "--output", str(rebuilt)], + ) + + assert explode_result.exit_code == 0 + assert "entries: 4" in explode_result.output + assert implode_result.exit_code == 0 + assert rebuilt.read_text(encoding="utf-8") == ROUNDTRIP_DOC diff --git a/tests/test_literate_weave_tangle.py b/tests/test_literate_weave_tangle.py new file mode 100644 index 0000000..07e3a81 --- /dev/null +++ b/tests/test_literate_weave_tangle.py @@ -0,0 +1,91 @@ +from pathlib import Path + +from click.testing import CliRunner + +from markitect_tool.cli import main +from markitect_tool.literate import ( + discover_code_chunks, + tangle_markdown, + weave_markdown, + write_tangle_files, +) + + +LITERATE_DOC = """# Literate Example + +```python {#helpers} +def helper(): + return "ready" +``` + +```python {#main tangle="src/app.py"} +<<helpers>> + +def main(): + return helper() +``` +""" + + +def test_discover_code_chunks_with_references_and_targets(): + chunks = discover_code_chunks(LITERATE_DOC, source_path="example.md") + + assert [chunk.chunk_id for chunk in chunks] == ["helpers", "main"] + assert chunks[1].target_path == "src/app.py" + assert chunks[1].references == ["helpers"] + + +def test_tangle_expands_named_chunk_references(): + result = tangle_markdown(LITERATE_DOC, source_path="example.md") + + assert result.valid + assert len(result.files) == 1 + assert result.files[0].path == "src/app.py" + assert "def helper" in result.files[0].content + assert "<<helpers>>" not in result.files[0].content + assert result.provenance[0].operation == "literate.tangle" + + +def test_tangle_reports_missing_chunk_reference(): + markdown = """```python {#main tangle="src/app.py"} +<<missing>> +``` +""" + + result = tangle_markdown(markdown, source_path="example.md") + + assert not result.valid + assert result.diagnostics[0].code == "literate.missing_chunk" + + +def test_weave_appends_chunk_index(): + result = weave_markdown(LITERATE_DOC, source_path="example.md") + + assert "## Code Chunk Index" in result.markdown + assert "`main` -> `src/app.py`; refs: `helpers`" in result.markdown + + +def test_write_tangle_files(tmp_path: Path): + result = tangle_markdown(LITERATE_DOC, source_path="example.md") + + written = write_tangle_files(result, tmp_path) + + assert written == [str(tmp_path / "src" / "app.py")] + assert "def main" in (tmp_path / "src" / "app.py").read_text(encoding="utf-8") + + +def test_mkt_tangle_and_weave(tmp_path: Path): + source = tmp_path / "literate.md" + output_dir = tmp_path / "out" + woven = tmp_path / "woven.md" + source.write_text(LITERATE_DOC, encoding="utf-8") + runner = CliRunner() + + tangle_result = runner.invoke(main, ["tangle", str(source), "--output-dir", str(output_dir)]) + weave_result = runner.invoke(main, ["weave", str(source), "--output", str(woven)]) + + assert tangle_result.exit_code == 0 + assert "files: 1" in tangle_result.output + assert (output_dir / "src" / "app.py").exists() + assert weave_result.exit_code == 0 + assert "## Code Chunk Index" in woven.read_text(encoding="utf-8") diff --git a/tests/test_ops_transform_compose_include.py b/tests/test_ops_transform_compose_include.py index 981a933..21754dd 100644 --- a/tests/test_ops_transform_compose_include.py +++ b/tests/test_ops_transform_compose_include.py @@ -34,6 +34,27 @@ title: Original assert "## Intro" in result.markdown assert "### Detail" in result.markdown assert result.operations == ["set_frontmatter", "shift_headings:1"] + assert [event.operation for event in result.provenance] == [ + "set_frontmatter", + "shift_headings", + ] + + +def test_transform_shifts_headings_without_touching_fenced_code(): + markdown = """# Intro + +```markdown +# Literal Heading +``` + +## Real Heading +""" + + result = transform_markdown(markdown, heading_delta=1) + + assert "```markdown\n# Literal Heading\n```" in result.markdown + assert "### Real Heading" in result.markdown + assert result.provenance[0].metadata["affected_lines"] == [1, 7] def test_transform_extracts_selector_text(): @@ -104,6 +125,25 @@ def test_resolve_includes_supports_brace_shorthand(tmp_path: Path): assert "Before" in result.markdown assert "Included body." in result.markdown assert "After" in result.markdown + assert result.provenance[0].operation == "include" + assert result.provenance[0].target_path == str(partial.resolve()) + + +def test_resolve_includes_ignores_markers_inside_fenced_code(tmp_path: Path): + partial = tmp_path / "partial.md" + partial.write_text("Included body.", encoding="utf-8") + markdown = """```markdown +{{include:partial.md}} +``` + +{{include:partial.md}} +""" + + result = resolve_includes(markdown, base_dir=tmp_path) + + assert result.markdown.count("Included body.") == 1 + assert "{{include:partial.md}}" in result.markdown + assert result.included_paths == [str(partial.resolve())] def test_resolve_includes_rejects_cycles(tmp_path: Path): diff --git a/tests/test_processor_registry.py b/tests/test_processor_registry.py new file mode 100644 index 0000000..0aaaa31 --- /dev/null +++ b/tests/test_processor_registry.py @@ -0,0 +1,105 @@ +from pathlib import Path + +from click.testing import CliRunner + +from markitect_tool.cli import main +from markitect_tool.core import parse_markdown +from markitect_tool.processor import ( + ProcessorContext, + default_processor_registry, + discover_fenced_processors, + run_fenced_processors, +) +from markitect_tool.reference import load_namespaces + + +def test_discover_fenced_processors_from_language_prefix(): + markdown = """# Doc + +```mkt-uppercase {#shout} +hello +``` +""" + + blocks = discover_fenced_processors(markdown, source_path="doc.md") + + assert len(blocks) == 1 + assert blocks[0].processor == "uppercase" + assert blocks[0].unit_id == "shout" + assert blocks[0].line_start == 3 + + +def test_default_registry_runs_uppercase_processor(): + markdown = """```mkt-uppercase {#shout} +hello +``` +""" + context = ProcessorContext() + + run = run_fenced_processors(markdown, context=context) + + assert run.valid + assert run.results[0].content == "HELLO\n" + assert run.results[0].provenance[0].operation == "processor.uppercase" + + +def test_include_processor_uses_reference_resolver(tmp_path: Path): + source = tmp_path / "doc.md" + partial = tmp_path / "partial.md" + source.write_text( + """--- +namespaces: + local: . +--- + +```mkt-include {#intro ref="local:partial.md#summary"} +``` +""", + encoding="utf-8", + ) + partial.write_text("# Partial\n\n## Summary\n\nIncluded summary.\n", encoding="utf-8") + document = parse_markdown(source.read_text(encoding="utf-8"), source_path=str(source)) + context = ProcessorContext( + root=tmp_path, + current_path=source, + namespaces=load_namespaces(document.frontmatter), + ) + + run = run_fenced_processors(source.read_text(encoding="utf-8"), context=context) + + assert run.valid + assert run.results[0].dependencies == [str(partial.resolve())] + assert "Included summary" in run.results[0].content + + +def test_unknown_processor_returns_diagnostic(): + markdown = """```mkt-nope {#x} +content +``` +""" + registry = default_processor_registry() + + run = run_fenced_processors(markdown, context=ProcessorContext(), registry=registry) + + assert not run.valid + assert run.results[0].diagnostics[0].code == "processor.unknown" + + +def test_mkt_process_outputs_text(tmp_path: Path): + source = tmp_path / "doc.md" + source.write_text( + """# Doc + +```mkt-uppercase {#shout} +hello +``` +""", + encoding="utf-8", + ) + + result = CliRunner().invoke(main, ["process", str(source), "--root", str(tmp_path)]) + + assert result.exit_code == 0 + assert "valid" in result.output + assert "uppercase shout" in result.output + assert "HELLO" in result.output diff --git a/tests/test_reference_resolution.py b/tests/test_reference_resolution.py new file mode 100644 index 0000000..eb3d23f --- /dev/null +++ b/tests/test_reference_resolution.py @@ -0,0 +1,195 @@ +from pathlib import Path + +import pytest +from click.testing import CliRunner + +from markitect_tool.cli import main +from markitect_tool.core import parse_markdown +from markitect_tool.reference import ( + ReferenceContext, + ReferenceResolutionError, + load_namespaces, + parse_reference, + resolve_reference, +) + + +def test_parse_reference_splits_namespace_fragment_and_selector(): + address = parse_reference("std:clauses/payment.md#section:fees::blocks[type=code]") + + assert address.namespace == "std" + assert address.address == "clauses/payment.md" + assert address.fragment == "section:fees" + assert address.selector == "blocks[type=code]" + + +def test_load_namespaces_accepts_optional_colon_suffix(): + namespaces = load_namespaces({"namespaces": {"std:": "./standard", "src": "../src"}}) + + assert namespaces == {"std": "./standard", "src": "../src"} + + +def test_resolve_path_reference_returns_document_unit(tmp_path: Path): + context_file = tmp_path / "context.md" + target_file = tmp_path / "target.md" + context_file.write_text("# Context\n", encoding="utf-8") + target_file.write_text("---\nid: target-doc\ntitle: Target\n---\n\n# Target\n\nBody.", encoding="utf-8") + context = ReferenceContext(root=tmp_path, current_path=context_file) + + resolution = resolve_reference("target.md", context=context) + + assert resolution.target_path == str(target_file.resolve()) + assert len(resolution.units) == 1 + assert resolution.units[0].kind == "document" + assert resolution.units[0].unit_id == "target-doc" + assert "# Target" in resolution.units[0].text + + +def test_resolve_namespace_reference_and_explicit_section_id(tmp_path: Path): + standard = tmp_path / "standard" + standard.mkdir() + context_file = tmp_path / "context.md" + clause_file = standard / "clauses.md" + context_file.write_text( + "---\nnamespaces:\n std: ./standard\n---\n\n# Context\n", + encoding="utf-8", + ) + clause_file.write_text( + "# Clauses\n\n## Payment Terms {#payment-terms}\n\nPay within 30 days.\n", + encoding="utf-8", + ) + document = parse_markdown(context_file.read_text(encoding="utf-8"), source_path=str(context_file)) + context = ReferenceContext.from_document(document, root=tmp_path) + + resolution = resolve_reference("std:clauses.md#section:payment-terms", context=context) + + assert resolution.units[0].kind == "section" + assert resolution.units[0].unit_id == "payment-terms" + assert resolution.units[0].name == "Payment Terms" + assert "Pay within 30 days" in resolution.units[0].text + + +def test_resolve_selector_reference_uses_existing_query_engine(tmp_path: Path): + standard = tmp_path / "standard" + standard.mkdir() + context_file = tmp_path / "context.md" + source_file = standard / "clauses.md" + context_file.write_text( + "---\nnamespaces:\n std: ./standard\n---\n\n# Context\n", + encoding="utf-8", + ) + source_file.write_text( + "# Clauses\n\n## Warranty\n\nWarranty text.\n\n## Liability\n\nLiability text.\n", + encoding="utf-8", + ) + context = ReferenceContext.from_document(parse_markdown(context_file.read_text(encoding="utf-8"), str(context_file)), root=tmp_path) + + resolution = resolve_reference("std:clauses.md::sections[heading=Warranty]", context=context) + + assert [unit.kind for unit in resolution.units] == ["section"] + assert resolution.units[0].name == "Warranty" + assert "Liability" not in resolution.units[0].text + + +def test_resolve_pathless_fragment_uses_current_document(tmp_path: Path): + context_file = tmp_path / "context.md" + context_file.write_text("# Context\n\n## Overview\n\nUseful local context.\n", encoding="utf-8") + context = ReferenceContext(root=tmp_path, current_path=context_file) + + resolution = resolve_reference("#overview", context=context) + + assert resolution.target_path == str(context_file.resolve()) + assert resolution.units[0].kind == "section" + assert resolution.units[0].unit_id == "overview" + assert "Useful local context" in resolution.units[0].text + + +def test_resolve_named_region_by_id_and_tag(tmp_path: Path): + context_file = tmp_path / "context.md" + context_file.write_text( + """# Context + +<!-- mkt:region id="overview" tags="reuse summary" --> +Reusable region text. +<!-- /mkt:region --> +""", + encoding="utf-8", + ) + context = ReferenceContext(root=tmp_path, current_path=context_file) + + by_id = resolve_reference("#region:overview", context=context) + by_tag = resolve_reference("#tag:summary", context=context) + + assert by_id.units[0].kind == "region" + assert by_id.units[0].text == "Reusable region text." + assert by_tag.units[0].unit_id == "overview" + + +def test_resolve_fenced_block_by_id(tmp_path: Path): + context_file = tmp_path / "context.md" + context_file.write_text( + """# Context + +```python {#load-config tags="code setup" tangle="src/config.py"} +def load_config(): + return {} +``` +""", + encoding="utf-8", + ) + context = ReferenceContext(root=tmp_path, current_path=context_file) + + resolution = resolve_reference("#fence:load-config", context=context) + + assert resolution.units[0].kind == "fenced_block" + assert resolution.units[0].unit_id == "load-config" + assert resolution.units[0].metadata["language"] == "python" + assert resolution.units[0].metadata["attrs"]["tangle"] == "src/config.py" + assert "def load_config" in resolution.units[0].text + + +def test_resolve_line_range_fragment(tmp_path: Path): + context_file = tmp_path / "context.md" + context_file.write_text("# Context\n\nLine A\nLine B\nLine C\n", encoding="utf-8") + context = ReferenceContext(root=tmp_path, current_path=context_file) + + resolution = resolve_reference("#line:3-4", context=context) + + assert resolution.units[0].kind == "line_range" + assert resolution.units[0].span.line_start == 3 + assert resolution.units[0].text == "Line A\nLine B" + + +def test_resolve_rejects_unknown_namespace(tmp_path: Path): + context_file = tmp_path / "context.md" + context_file.write_text("# Context\n", encoding="utf-8") + context = ReferenceContext(root=tmp_path, current_path=context_file) + + with pytest.raises(ReferenceResolutionError, match="Unknown namespace"): + resolve_reference("missing:doc.md", context=context) + + +def test_resolve_rejects_paths_outside_root(tmp_path: Path): + context_file = tmp_path / "context.md" + context_file.write_text("# Context\n", encoding="utf-8") + context = ReferenceContext(root=tmp_path, current_path=context_file) + + with pytest.raises(ReferenceResolutionError, match="escapes root"): + resolve_reference("../outside.md", context=context) + + +def test_mkt_ref_resolve_outputs_text(tmp_path: Path): + context_file = tmp_path / "context.md" + target_file = tmp_path / "target.md" + context_file.write_text("# Context\n", encoding="utf-8") + target_file.write_text("# Target\n\n## Decision\n\nChosen.", encoding="utf-8") + + result = CliRunner().invoke( + main, + ["ref", "resolve", str(context_file), "target.md#decision", "--root", str(tmp_path)], + ) + + assert result.exit_code == 0 + assert "1 unit(s)" in result.output + assert "section decision" in result.output + assert "Decision" in result.output diff --git a/tests/test_wp0010_migration_examples.py b/tests/test_wp0010_migration_examples.py new file mode 100644 index 0000000..1642a73 --- /dev/null +++ b/tests/test_wp0010_migration_examples.py @@ -0,0 +1,60 @@ +from pathlib import Path + +from markitect_tool.core import parse_markdown_file +from markitect_tool.explode import explode_markdown_file, implode_markdown_directory +from markitect_tool.ops import resolve_includes +from markitect_tool.processor import ProcessorContext, run_fenced_processors +from markitect_tool.reference import load_namespaces +from markitect_tool.literate import tangle_markdown + + +EXAMPLES = Path("examples/migration") + + +def test_migration_explode_example_roundtrips(tmp_path: Path): + source = EXAMPLES / "legacy-explode-source.md" + original = source.read_text(encoding="utf-8") + + explode_markdown_file(source, tmp_path / "exploded", variant="hierarchical") + result = implode_markdown_directory(tmp_path / "exploded") + + assert result.markdown == original + + +def test_migration_reference_backed_transclusion_example(): + source = EXAMPLES / "legacy-transclusion-context.md" + document = parse_markdown_file(source) + context = ProcessorContext( + root=EXAMPLES, + current_path=source, + namespaces=load_namespaces(document.frontmatter), + ) + + result = run_fenced_processors(source.read_text(encoding="utf-8"), context=context) + + assert result.valid + assert "Payment is due within 30 days" in result.results[0].content + + +def test_migration_path_include_example(): + source = EXAMPLES / "legacy-path-include.md" + + result = resolve_includes( + source.read_text(encoding="utf-8"), + base_dir=EXAMPLES, + current_path=source, + ) + + assert "## Warranty" in result.markdown + assert "Warranty begins on the effective date" in result.markdown + + +def test_migration_literate_example_tangles(): + source = EXAMPLES / "legacy-literate.md" + + result = tangle_markdown(source.read_text(encoding="utf-8"), source_path=source) + + assert result.valid + assert result.files[0].path == "src/app.py" + assert "CONFIG" in result.files[0].content + assert "<<config>>" not in result.files[0].content diff --git a/workplans/MKTT-WP-0010-content-reference-processor-literate-workflows.md b/workplans/MKTT-WP-0010-content-reference-processor-literate-workflows.md index 132a30d..0c5cfcf 100644 --- a/workplans/MKTT-WP-0010-content-reference-processor-literate-workflows.md +++ b/workplans/MKTT-WP-0010-content-reference-processor-literate-workflows.md @@ -3,7 +3,7 @@ id: MKTT-WP-0010 type: workplan title: "Content References, Processors, and Literate Workflows" domain: markitect -status: todo +status: done owner: markitect-tool topic_slug: markitect planning_priority: P1 @@ -55,7 +55,7 @@ See `docs/content-reference-literate-workflow-research.md`. ```task id: MKTT-WP-0010-T001 -status: todo +status: done priority: high state_hub_task_id: "f70d2b9d-151b-46c6-9613-bd6bdbf164e7" ``` @@ -66,11 +66,18 @@ resolver inputs/outputs, and error cases. Output: reference model docs, examples, and tests for path, namespace, selector, and ID resolution. +Initial implementation completed with a `reference` extension package, +frontmatter namespace loading, root-bounded path resolution, existing query +selector reuse, heading/section/block fragment IDs, CLI access via +`mkt ref resolve`, reference docs, examples, and tests. Region/tag/fenced-block +addressing continues in P10.3; processor dependency/provenance use continues in +P10.2 and P10.5. + ## P10.2 - Add token-safe transforms and operation provenance ```task id: MKTT-WP-0010-T002 -status: todo +status: done priority: high state_hub_task_id: "e35639b7-756f-4993-8b3c-2e58b23e0eca" ``` @@ -80,11 +87,17 @@ structured operation provenance, dependency edges, source spans, and diagnostics Output: token-safe transform implementation and provenance result envelope. +Initial implementation completed with token-safe heading shifts, include +markers that stay literal inside fenced or indented code blocks, additive +`OperationProvenance` events on transform/include results, dependency edges for +resolved includes, docs, and regression tests. Rich structured diagnostics and +source maps continue through P10.3, P10.4, and P10.5. + ## P10.3 - Implement named regions and addressable block selectors ```task id: MKTT-WP-0010-T003 -status: todo +status: done priority: high state_hub_task_id: "98cafe28-a364-48f1-ae55-cb47c71d9441" ``` @@ -94,11 +107,17 @@ selection by ID/tag/line range where appropriate. Output: region parser/resolver, CLI examples, and source-snippet tests. +Initial implementation completed as reference-layer extensions: named +`mkt:region` comments, region tags, fenced-block IDs and tags from info-string +attributes, `#line:start-end` ranges, convenience ID lookup ordering, docs, +examples, and tests. Deeper source maps and processor-owned block semantics +continue in P10.5 and P10.6. + ## P10.4 - Reimplement reversible explode/implode variants ```task id: MKTT-WP-0010-T004 -status: todo +status: done priority: high state_hub_task_id: "67f77aa1-a7ee-485c-891e-6ae7ecc52067" ``` @@ -111,11 +130,16 @@ reference and processor model is stable. Output: `mkt explode`, `mkt implode`, manifest schema, roundtrip tests. +Initial implementation completed with a separate `explode` extension package, +manifest-first flat and hierarchical variants, exact roundtrip implode, +non-empty output protection, CLI commands, docs, and tests. Semantic variants +remain deferred until processor and content-class semantics are stable. + ## P10.5 - Define processor registry for fenced blocks ```task id: MKTT-WP-0010-T005 -status: todo +status: done priority: high state_hub_task_id: "eb7cde08-8a73-4163-ac54-19a2bc7b5f88" ``` @@ -126,11 +150,18 @@ and return generated content/files, diagnostics, dependencies, and provenance. Output: processor registry API, deterministic built-in processors, and tests. +Initial implementation completed with a deterministic `processor` extension +package, fenced-block discovery, explicit registry, context/policy envelope, +result files/diagnostics/dependencies/provenance, built-in identity, +uppercase, and reference-backed include processors, CLI `mkt process`, docs, +examples, and tests. Arbitrary code or LLM execution remains intentionally +outside this deterministic registry floor. + ## P10.6 - Implement literate weave/tangle MVP ```task id: MKTT-WP-0010-T006 -status: todo +status: done priority: high state_hub_task_id: "090fcc38-758b-4414-b941-40f217eb17ca" ``` @@ -141,11 +172,16 @@ cross-references. Output: `mkt tangle`, `mkt weave`, chunk-reference diagnostics, examples. +Initial implementation completed with a `literate` extension package, named +fenced code chunks, `tangle` targets, noweb-style `<<chunk-id>>` expansion, +missing/cyclic chunk diagnostics, deterministic file writing, woven chunk +index output, CLI `mkt tangle`/`mkt weave`, docs, examples, and tests. + ## P10.7 - Design content class composition and multi-inheritance ```task id: MKTT-WP-0010-T007 -status: todo +status: done priority: medium state_hub_task_id: "220e6b27-2d7b-4c22-b5e8-304198ecfea8" ``` @@ -156,11 +192,16 @@ diagnostics. Output: architecture note, examples, and a small deterministic resolver spike. +Initial implementation completed with a `content_class` extension package, +C3-style deterministic linearization, explicit slot merge policies, conflict +diagnostics, CLI `mkt class resolve`, docs, examples, and tests. Markdown +instantiation and snippet injection remain deferred to later integration work. + ## P10.8 - Add migration examples from markitect-main ```task id: MKTT-WP-0010-T008 -status: todo +status: done priority: high state_hub_task_id: "287637d3-1997-43b2-b97d-10587d565cec" ``` @@ -169,3 +210,9 @@ Translate the relevant old explode/implode, transclusion, and spaces reference graph tests into successor-style fixtures and examples. Output: migration test inventory, example documents, and parity notes. + +Initial implementation completed with WP-0010 migration parity notes, +successor-style examples for explode/implode, path include, reference-backed +transclusion, and literate tangling, plus tests that exercise these examples. +Legacy platform, database, infospace, rendering, and provider-specific +behaviors remain intentionally out of scope.