feat: WP-0004 T01-T04 — stable corpus, ADRs, regression test

- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101) - corpus/markidocx-docs/known-drift.md: documented structural drift - workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109) - tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110) - architecture/ADR-002: python-docx as conversion engine - architecture/ADR-003: manifest YAML schema - workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 17:48:33 +00:00
parent 039420caee
commit ebc5eaee77
8 changed files with 375 additions and 6 deletions
--- a/architecture/ADR-002-python-docx-as-conversion-engine.md
+++ b/architecture/ADR-002-python-docx-as-conversion-engine.md
@@ -0,0 +1,88 @@
+---
+id: ADR-002
+type: adr
+status: accepted
+created: 2026-03-16
+deciders: [Bernd, Custodian]
+---
+
+# ADR-002: python-docx as DOCX Conversion Engine
+
+## Status
+
+Accepted
+
+## Context
+
+markidocx must produce and consume `.docx` (Open XML) files from Python. The build
+pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown.
+Both directions must be controlled programmatically without shelling out to Office
+applications or external services.
+
+The following options were evaluated:
+
+| Option | Direction | Notes |
+|--------|-----------|-------|
+| **python-docx** | read + write | Pure Python, direct Open XML paragraph/run model |
+| **pandoc** (subprocess) | read + write | Requires external binary; limited structural control |
+| **mammoth** | read only | Focused on HTML output; no write support |
+| **docx2python** | read only | Good for extracting raw content; no write support |
+| **LibreOffice** (subprocess) | read + write | Heavy dependency; unreliable in headless environments |
+
+The primary requirements were:
+
+1. Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
+2. Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
+3. No external process dependency (no pandoc, no LibreOffice)
+4. Pure Python — installable via `pip install` with no system-level setup
+
+## Decision
+
+Use **python-docx** for both the build (write) and import (read) directions.
+
+python-docx provides:
+- Direct access to the Open XML paragraph / run model — each `Paragraph` maps cleanly
+  to a Markdown block element; each `Run` maps to inline formatting
+- Style name assignment (`Heading 1`, `Normal`, `List Bullet`, etc.) enabling
+  template-driven presentation
+- Footnote, table, and image support within the standard API surface
+- Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
+- Stable, well-documented API; actively maintained
+
+## Consequences
+
+**Positive:**
+- Single dependency for both conversion directions
+- No subprocess execution; fully in-process
+- Paragraph/run model maps naturally to Markdown's block/inline structure
+- Template `.docx` files control presentation without touching content
+
+**Negative / accepted limitations:**
+- python-docx exposes only a subset of the Open XML specification. Complex Word
+  features are out of scope by design:
+  - Track changes (revision marks) — not parseable
+  - SmartArt, charts, embedded objects — ignored during import
+  - Advanced numbering schemes beyond simple ordered/unordered lists — not supported
+  - Content controls, form fields — not supported
+- python-docx's footnote write API is limited; markidocx uses a compatibility shim
+  for footnote construction (documented in `builder.py`)
+- Modifying an existing DOCX in-place is not supported — markidocx always builds
+  a fresh DOCX and never mutates the input during import
+
+**Out of scope by design:**
+The constraints above align with markidocx's defined semantic envelope (FC-01).
+The system only claims preservation for constructs within supported feature levels.
+
+## Alternatives Rejected
+
+**pandoc** — excellent general-purpose converter, but shelling out introduces a
+hard runtime dependency, reduces structural control, and makes it difficult to
+embed source-boundary markers needed for multi-file redistribution.
+
+**mammoth** — high-quality Word → HTML converter; read-only, so unsuitable for
+the build direction.
+
+**docx2python** — useful for raw content extraction; no write support.
+
+**LibreOffice** — handles the full Open XML spec, but requires a headless Office
+installation, is unreliable in CI, and introduces significant operational complexity.