marki-docx/architecture/ADR-002-python-docx-as-conversion-engine.md

---
id: ADR-002
type: adr
status: accepted
created: 2026-03-16
deciders: [Bernd, Custodian]
---

# ADR-002: python-docx as DOCX Conversion Engine

## Status

Accepted

## Context

markidocx must produce and consume `.docx` (Open XML) files from Python. The build
pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown.
Both directions must be controlled programmatically without shelling out to Office
applications or external services.

The following options were evaluated:

| Option | Direction | Notes |
|--------|-----------|-------|
| **python-docx** | read + write | Pure Python, direct Open XML paragraph/run model |
| **pandoc** (subprocess) | read + write | Requires external binary; limited structural control |
| **mammoth** | read only | Focused on HTML output; no write support |
| **docx2python** | read only | Good for extracting raw content; no write support |
| **LibreOffice** (subprocess) | read + write | Heavy dependency; unreliable in headless environments |

The primary requirements were:

1. Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
2. Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
3. No external process dependency (no pandoc, no LibreOffice)
4. Pure Python — installable via `pip install` with no system-level setup

## Decision

Use **python-docx** for both the build (write) and import (read) directions.

python-docx provides:
- Direct access to the Open XML paragraph / run model — each `Paragraph` maps cleanly
  to a Markdown block element; each `Run` maps to inline formatting
- Style name assignment (`Heading 1`, `Normal`, `List Bullet`, etc.) enabling
  template-driven presentation
- Footnote, table, and image support within the standard API surface
- Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
- Stable, well-documented API; actively maintained

## Consequences

**Positive:**
- Single dependency for both conversion directions
- No subprocess execution; fully in-process
- Paragraph/run model maps naturally to Markdown's block/inline structure
- Template `.docx` files control presentation without touching content

**Negative / accepted limitations:**
- python-docx exposes only a subset of the Open XML specification. Complex Word
  features are out of scope by design:
  - Track changes (revision marks) — not parseable
  - SmartArt, charts, embedded objects — ignored during import
  - Advanced numbering schemes beyond simple ordered/unordered lists — not supported
  - Content controls, form fields — not supported
- python-docx's footnote write API is limited; markidocx uses a compatibility shim
  for footnote construction (documented in `builder.py`)
- Modifying an existing DOCX in-place is not supported — markidocx always builds
  a fresh DOCX and never mutates the input during import

**Out of scope by design:**
The constraints above align with markidocx's defined semantic envelope (FC-01).
The system only claims preservation for constructs within supported feature levels.

## Alternatives Rejected

**pandoc** — excellent general-purpose converter, but shelling out introduces a
hard runtime dependency, reduces structural control, and makes it difficult to
embed source-boundary markers needed for multi-file redistribution.

**mammoth** — high-quality Word → HTML converter; read-only, so unsuitable for
the build direction.

**docx2python** — useful for raw content extraction; no write support.

**LibreOffice** — handles the full Open XML spec, but requires a headless Office
installation, is unreliable in CI, and introduces significant operational complexity.