--- id: ADR-002 type: adr status: accepted created: 2026-03-16 deciders: [Bernd, Custodian] --- # ADR-002: python-docx as DOCX Conversion Engine ## Status Accepted ## Context markidocx must produce and consume `.docx` (Open XML) files from Python. The build pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown. Both directions must be controlled programmatically without shelling out to Office applications or external services. The following options were evaluated: | Option | Direction | Notes | |--------|-----------|-------| | **python-docx** | read + write | Pure Python, direct Open XML paragraph/run model | | **pandoc** (subprocess) | read + write | Requires external binary; limited structural control | | **mammoth** | read only | Focused on HTML output; no write support | | **docx2python** | read only | Good for extracting raw content; no write support | | **LibreOffice** (subprocess) | read + write | Heavy dependency; unreliable in headless environments | The primary requirements were: 1. Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library 2. Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks 3. No external process dependency (no pandoc, no LibreOffice) 4. Pure Python — installable via `pip install` with no system-level setup ## Decision Use **python-docx** for both the build (write) and import (read) directions. python-docx provides: - Direct access to the Open XML paragraph / run model — each `Paragraph` maps cleanly to a Markdown block element; each `Run` maps to inline formatting - Style name assignment (`Heading 1`, `Normal`, `List Bullet`, etc.) enabling template-driven presentation - Footnote, table, and image support within the standard API surface - Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references) - Stable, well-documented API; actively maintained ## Consequences **Positive:** - Single dependency for both conversion directions - No subprocess execution; fully in-process - Paragraph/run model maps naturally to Markdown's block/inline structure - Template `.docx` files control presentation without touching content **Negative / accepted limitations:** - python-docx exposes only a subset of the Open XML specification. Complex Word features are out of scope by design: - Track changes (revision marks) — not parseable - SmartArt, charts, embedded objects — ignored during import - Advanced numbering schemes beyond simple ordered/unordered lists — not supported - Content controls, form fields — not supported - python-docx's footnote write API is limited; markidocx uses a compatibility shim for footnote construction (documented in `builder.py`) - Modifying an existing DOCX in-place is not supported — markidocx always builds a fresh DOCX and never mutates the input during import **Out of scope by design:** The constraints above align with markidocx's defined semantic envelope (FC-01). The system only claims preservation for constructs within supported feature levels. ## Alternatives Rejected **pandoc** — excellent general-purpose converter, but shelling out introduces a hard runtime dependency, reduces structural control, and makes it difficult to embed source-boundary markers needed for multi-file redistribution. **mammoth** — high-quality Word → HTML converter; read-only, so unsuitable for the build direction. **docx2python** — useful for raw content extraction; no write support. **LibreOffice** — handles the full Open XML spec, but requires a headless Office installation, is unreliable in CI, and introduces significant operational complexity.