generated from coulomb/repo-seed
- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101) - corpus/markidocx-docs/known-drift.md: documented structural drift - workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109) - tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110) - architecture/ADR-002: python-docx as conversion engine - architecture/ADR-003: manifest YAML schema - workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
89 lines
3.6 KiB
Markdown
89 lines
3.6 KiB
Markdown
---
|
|
id: ADR-002
|
|
type: adr
|
|
status: accepted
|
|
created: 2026-03-16
|
|
deciders: [Bernd, Custodian]
|
|
---
|
|
|
|
# ADR-002: python-docx as DOCX Conversion Engine
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
markidocx must produce and consume `.docx` (Open XML) files from Python. The build
|
|
pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown.
|
|
Both directions must be controlled programmatically without shelling out to Office
|
|
applications or external services.
|
|
|
|
The following options were evaluated:
|
|
|
|
| Option | Direction | Notes |
|
|
|--------|-----------|-------|
|
|
| **python-docx** | read + write | Pure Python, direct Open XML paragraph/run model |
|
|
| **pandoc** (subprocess) | read + write | Requires external binary; limited structural control |
|
|
| **mammoth** | read only | Focused on HTML output; no write support |
|
|
| **docx2python** | read only | Good for extracting raw content; no write support |
|
|
| **LibreOffice** (subprocess) | read + write | Heavy dependency; unreliable in headless environments |
|
|
|
|
The primary requirements were:
|
|
|
|
1. Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
|
|
2. Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
|
|
3. No external process dependency (no pandoc, no LibreOffice)
|
|
4. Pure Python — installable via `pip install` with no system-level setup
|
|
|
|
## Decision
|
|
|
|
Use **python-docx** for both the build (write) and import (read) directions.
|
|
|
|
python-docx provides:
|
|
- Direct access to the Open XML paragraph / run model — each `Paragraph` maps cleanly
|
|
to a Markdown block element; each `Run` maps to inline formatting
|
|
- Style name assignment (`Heading 1`, `Normal`, `List Bullet`, etc.) enabling
|
|
template-driven presentation
|
|
- Footnote, table, and image support within the standard API surface
|
|
- Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
|
|
- Stable, well-documented API; actively maintained
|
|
|
|
## Consequences
|
|
|
|
**Positive:**
|
|
- Single dependency for both conversion directions
|
|
- No subprocess execution; fully in-process
|
|
- Paragraph/run model maps naturally to Markdown's block/inline structure
|
|
- Template `.docx` files control presentation without touching content
|
|
|
|
**Negative / accepted limitations:**
|
|
- python-docx exposes only a subset of the Open XML specification. Complex Word
|
|
features are out of scope by design:
|
|
- Track changes (revision marks) — not parseable
|
|
- SmartArt, charts, embedded objects — ignored during import
|
|
- Advanced numbering schemes beyond simple ordered/unordered lists — not supported
|
|
- Content controls, form fields — not supported
|
|
- python-docx's footnote write API is limited; markidocx uses a compatibility shim
|
|
for footnote construction (documented in `builder.py`)
|
|
- Modifying an existing DOCX in-place is not supported — markidocx always builds
|
|
a fresh DOCX and never mutates the input during import
|
|
|
|
**Out of scope by design:**
|
|
The constraints above align with markidocx's defined semantic envelope (FC-01).
|
|
The system only claims preservation for constructs within supported feature levels.
|
|
|
|
## Alternatives Rejected
|
|
|
|
**pandoc** — excellent general-purpose converter, but shelling out introduces a
|
|
hard runtime dependency, reduces structural control, and makes it difficult to
|
|
embed source-boundary markers needed for multi-file redistribution.
|
|
|
|
**mammoth** — high-quality Word → HTML converter; read-only, so unsuitable for
|
|
the build direction.
|
|
|
|
**docx2python** — useful for raw content extraction; no write support.
|
|
|
|
**LibreOffice** — handles the full Open XML spec, but requires a headless Office
|
|
installation, is unreliable in CI, and introduces significant operational complexity.
|