feat: WP-0004 T01-T04 — stable corpus, ADRs, regression test

- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101) - corpus/markidocx-docs/known-drift.md: documented structural drift - workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109) - tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110) - architecture/ADR-002: python-docx as conversion engine - architecture/ADR-003: manifest YAML schema - workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 17:48:33 +00:00
parent 039420caee
commit ebc5eaee77
8 changed files with 375 additions and 6 deletions
--- a/architecture/ADR-002-python-docx-as-conversion-engine.md
+++ b/architecture/ADR-002-python-docx-as-conversion-engine.md
@@ -0,0 +1,88 @@
+---
+id: ADR-002
+type: adr
+status: accepted
+created: 2026-03-16
+deciders: [Bernd, Custodian]
+---
+
+# ADR-002: python-docx as DOCX Conversion Engine
+
+## Status
+
+Accepted
+
+## Context
+
+markidocx must produce and consume `.docx` (Open XML) files from Python. The build
+pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown.
+Both directions must be controlled programmatically without shelling out to Office
+applications or external services.
+
+The following options were evaluated:
+
+| Option | Direction | Notes |
+|--------|-----------|-------|
+| **python-docx** | read + write | Pure Python, direct Open XML paragraph/run model |
+| **pandoc** (subprocess) | read + write | Requires external binary; limited structural control |
+| **mammoth** | read only | Focused on HTML output; no write support |
+| **docx2python** | read only | Good for extracting raw content; no write support |
+| **LibreOffice** (subprocess) | read + write | Heavy dependency; unreliable in headless environments |
+
+The primary requirements were:
+
+1. Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
+2. Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
+3. No external process dependency (no pandoc, no LibreOffice)
+4. Pure Python — installable via `pip install` with no system-level setup
+
+## Decision
+
+Use **python-docx** for both the build (write) and import (read) directions.
+
+python-docx provides:
+- Direct access to the Open XML paragraph / run model — each `Paragraph` maps cleanly
+  to a Markdown block element; each `Run` maps to inline formatting
+- Style name assignment (`Heading 1`, `Normal`, `List Bullet`, etc.) enabling
+  template-driven presentation
+- Footnote, table, and image support within the standard API surface
+- Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
+- Stable, well-documented API; actively maintained
+
+## Consequences
+
+**Positive:**
+- Single dependency for both conversion directions
+- No subprocess execution; fully in-process
+- Paragraph/run model maps naturally to Markdown's block/inline structure
+- Template `.docx` files control presentation without touching content
+
+**Negative / accepted limitations:**
+- python-docx exposes only a subset of the Open XML specification. Complex Word
+  features are out of scope by design:
+  - Track changes (revision marks) — not parseable
+  - SmartArt, charts, embedded objects — ignored during import
+  - Advanced numbering schemes beyond simple ordered/unordered lists — not supported
+  - Content controls, form fields — not supported
+- python-docx's footnote write API is limited; markidocx uses a compatibility shim
+  for footnote construction (documented in `builder.py`)
+- Modifying an existing DOCX in-place is not supported — markidocx always builds
+  a fresh DOCX and never mutates the input during import
+
+**Out of scope by design:**
+The constraints above align with markidocx's defined semantic envelope (FC-01).
+The system only claims preservation for constructs within supported feature levels.
+
+## Alternatives Rejected
+
+**pandoc** — excellent general-purpose converter, but shelling out introduces a
+hard runtime dependency, reduces structural control, and makes it difficult to
+embed source-boundary markers needed for multi-file redistribution.
+
+**mammoth** — high-quality Word → HTML converter; read-only, so unsuitable for
+the build direction.
+
+**docx2python** — useful for raw content extraction; no write support.
+
+**LibreOffice** — handles the full Open XML spec, but requires a headless Office
+installation, is unreliable in CI, and introduces significant operational complexity.
--- a/architecture/ADR-003-manifest-yaml-schema.md
+++ b/architecture/ADR-003-manifest-yaml-schema.md
@@ -0,0 +1,103 @@
+---
+id: ADR-003
+type: adr
+status: accepted
+created: 2026-03-16
+deciders: [Bernd, Custodian]
+---
+
+# ADR-003: Manifest YAML Schema
+
+## Status
+
+Accepted
+
+## Context
+
+markidocx needs a project definition format that:
+
+1. Describes which Markdown source files form a document project
+2. Declares the feature level (`level1` / `level3`) and document family (`article`,
+   `book`, `website`)
+3. Specifies output location and document metadata
+4. Is human-writable and version-controllable alongside source files
+5. Is parseable by the system without a schema registry or external validator
+
+The format must support single-file and multi-file projects, and be extensible
+enough for future additions (e.g. bibliography sources, asset directories) without
+breaking existing manifests.
+
+## Decision
+
+Use **YAML** with a fixed four-section top-level structure:
+
+```yaml
+project:
+  name: <string>
+  feature_level: level1 | level3
+  family: article | book | website
+
+sources:
+  - path: <relative path to .md file>
+  - path: <relative path to .md file>
+
+output:
+  dir: <relative path to output directory>
+
+metadata:
+  title: <string>
+  author: <string>
+  date: <string>
+```
+
+All paths are resolved relative to the manifest file's location. The `metadata`
+section and individual source `path` keys may be extended in future versions.
+
+Validation is performed on load by `manifest.py` using dataclass coercion:
+`load_manifest(path)` raises `ManifestError` on any schema violation (missing
+required fields, unknown feature levels, unresolvable source paths).
+
+## Current Field Definitions
+
+| Field | Type | Required | Default | Notes |
+|-------|------|----------|---------|-------|
+| `project.name` | string | yes | — | Project identifier; used in output filenames |
+| `project.feature_level` | enum | yes | — | `level1` or `level3` |
+| `project.family` | enum | yes | — | `article`, `book`, or `website` |
+| `sources[].path` | string | yes | — | Relative path; resolved against manifest dir |
+| `output.dir` | string | no | `./dist` | Relative path for generated artefacts |
+| `metadata.title` | string | no | — | Propagated to DOCX document properties |
+| `metadata.author` | string | no | — | Propagated to DOCX document properties |
+| `metadata.date` | string | no | — | Propagated to DOCX document properties |
+
+## Consequences
+
+**Positive:**
+- Human-readable and diff-friendly; natural fit for version-controlled documentation
+  repositories
+- No external schema validation library needed — `manifest.py` owns validation
+- Simple enough for a first-time user to write by hand
+- Relative paths keep manifests portable across machines
+
+**Negative / accepted limitations:**
+- Evolving the schema requires coordination between the manifest file format and
+  `manifest.py` — there is no formal schema version field
+- No auto-completion support in editors without a JSON Schema / YAML Language Server
+  configuration (out of scope for v0.1)
+- YAML's implicit type coercion can surprise users (e.g. bare `no` parsed as `False`);
+  `load_manifest` validates all fields explicitly to catch these cases
+
+## Alternatives Rejected
+
+**TOML** — good alternative, but YAML is more common in documentation tooling
+(MkDocs, GitHub Actions, Kubernetes) and more familiar to the target audience.
+
+**JSON** — less writable for humans; comments not supported; trailing commas
+disallowed; less pleasant for multi-line string values.
+
+**Database / registry** — over-engineered for the single-project use case; would
+require a running service just to define a document project.
+
+**Pydantic / JSON Schema** — considered for validation, but adds a dependency
+for functionality that a handful of explicit checks in `load_manifest()` already
+covers cleanly.