- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101) - corpus/markidocx-docs/known-drift.md: documented structural drift - workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109) - tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110) - architecture/ADR-002: python-docx as conversion engine - architecture/ADR-003: manifest YAML schema - workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.6 KiB
id, type, status, created, deciders
| id | type | status | created | deciders | ||
|---|---|---|---|---|---|---|
| ADR-002 | adr | accepted | 2026-03-16 |
|
ADR-002: python-docx as DOCX Conversion Engine
Status
Accepted
Context
markidocx must produce and consume .docx (Open XML) files from Python. The build
pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown.
Both directions must be controlled programmatically without shelling out to Office
applications or external services.
The following options were evaluated:
| Option | Direction | Notes |
|---|---|---|
| python-docx | read + write | Pure Python, direct Open XML paragraph/run model |
| pandoc (subprocess) | read + write | Requires external binary; limited structural control |
| mammoth | read only | Focused on HTML output; no write support |
| docx2python | read only | Good for extracting raw content; no write support |
| LibreOffice (subprocess) | read + write | Heavy dependency; unreliable in headless environments |
The primary requirements were:
- Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
- Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
- No external process dependency (no pandoc, no LibreOffice)
- Pure Python — installable via
pip installwith no system-level setup
Decision
Use python-docx for both the build (write) and import (read) directions.
python-docx provides:
- Direct access to the Open XML paragraph / run model — each
Paragraphmaps cleanly to a Markdown block element; eachRunmaps to inline formatting - Style name assignment (
Heading 1,Normal,List Bullet, etc.) enabling template-driven presentation - Footnote, table, and image support within the standard API surface
- Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
- Stable, well-documented API; actively maintained
Consequences
Positive:
- Single dependency for both conversion directions
- No subprocess execution; fully in-process
- Paragraph/run model maps naturally to Markdown's block/inline structure
- Template
.docxfiles control presentation without touching content
Negative / accepted limitations:
- python-docx exposes only a subset of the Open XML specification. Complex Word
features are out of scope by design:
- Track changes (revision marks) — not parseable
- SmartArt, charts, embedded objects — ignored during import
- Advanced numbering schemes beyond simple ordered/unordered lists — not supported
- Content controls, form fields — not supported
- python-docx's footnote write API is limited; markidocx uses a compatibility shim
for footnote construction (documented in
builder.py) - Modifying an existing DOCX in-place is not supported — markidocx always builds a fresh DOCX and never mutates the input during import
Out of scope by design: The constraints above align with markidocx's defined semantic envelope (FC-01). The system only claims preservation for constructs within supported feature levels.
Alternatives Rejected
pandoc — excellent general-purpose converter, but shelling out introduces a hard runtime dependency, reduces structural control, and makes it difficult to embed source-boundary markers needed for multi-file redistribution.
mammoth — high-quality Word → HTML converter; read-only, so unsuitable for the build direction.
docx2python — useful for raw content extraction; no write support.
LibreOffice — handles the full Open XML spec, but requires a headless Office installation, is unreliable in CI, and introduces significant operational complexity.