Files
marki-docx/architecture/ADR-002-python-docx-as-conversion-engine.md
Bernd Worsch ebc5eaee77 feat: WP-0004 T01-T04 — stable corpus, ADRs, regression test
- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101)
- corpus/markidocx-docs/known-drift.md: documented structural drift
- workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109)
- tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110)
- architecture/ADR-002: python-docx as conversion engine
- architecture/ADR-003: manifest YAML schema
- workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 17:48:33 +00:00

3.6 KiB

id, type, status, created, deciders
id type status created deciders
ADR-002 adr accepted 2026-03-16
Bernd
Custodian

ADR-002: python-docx as DOCX Conversion Engine

Status

Accepted

Context

markidocx must produce and consume .docx (Open XML) files from Python. The build pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown. Both directions must be controlled programmatically without shelling out to Office applications or external services.

The following options were evaluated:

Option Direction Notes
python-docx read + write Pure Python, direct Open XML paragraph/run model
pandoc (subprocess) read + write Requires external binary; limited structural control
mammoth read only Focused on HTML output; no write support
docx2python read only Good for extracting raw content; no write support
LibreOffice (subprocess) read + write Heavy dependency; unreliable in headless environments

The primary requirements were:

  1. Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
  2. Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
  3. No external process dependency (no pandoc, no LibreOffice)
  4. Pure Python — installable via pip install with no system-level setup

Decision

Use python-docx for both the build (write) and import (read) directions.

python-docx provides:

  • Direct access to the Open XML paragraph / run model — each Paragraph maps cleanly to a Markdown block element; each Run maps to inline formatting
  • Style name assignment (Heading 1, Normal, List Bullet, etc.) enabling template-driven presentation
  • Footnote, table, and image support within the standard API surface
  • Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
  • Stable, well-documented API; actively maintained

Consequences

Positive:

  • Single dependency for both conversion directions
  • No subprocess execution; fully in-process
  • Paragraph/run model maps naturally to Markdown's block/inline structure
  • Template .docx files control presentation without touching content

Negative / accepted limitations:

  • python-docx exposes only a subset of the Open XML specification. Complex Word features are out of scope by design:
    • Track changes (revision marks) — not parseable
    • SmartArt, charts, embedded objects — ignored during import
    • Advanced numbering schemes beyond simple ordered/unordered lists — not supported
    • Content controls, form fields — not supported
  • python-docx's footnote write API is limited; markidocx uses a compatibility shim for footnote construction (documented in builder.py)
  • Modifying an existing DOCX in-place is not supported — markidocx always builds a fresh DOCX and never mutates the input during import

Out of scope by design: The constraints above align with markidocx's defined semantic envelope (FC-01). The system only claims preservation for constructs within supported feature levels.

Alternatives Rejected

pandoc — excellent general-purpose converter, but shelling out introduces a hard runtime dependency, reduces structural control, and makes it difficult to embed source-boundary markers needed for multi-file redistribution.

mammoth — high-quality Word → HTML converter; read-only, so unsuitable for the build direction.

docx2python — useful for raw content extraction; no write support.

LibreOffice — handles the full Open XML spec, but requires a headless Office installation, is unreliable in CI, and introduces significant operational complexity.