Files

Bernd Worsch ebc5eaee77 feat: WP-0004 T01-T04 — stable corpus, ADRs, regression test

- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101)
- corpus/markidocx-docs/known-drift.md: documented structural drift
- workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109)
- tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110)
- architecture/ADR-002: python-docx as conversion engine
- architecture/ADR-003: manifest YAML schema
- workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-16 17:48:33 +00:00

3.6 KiB

Raw Blame History

id, type, status, created, deciders

type

status

created

deciders

ADR-002

adr

accepted

2026-03-16

Bernd

Custodian

ADR-002: python-docx as DOCX Conversion Engine

Status

Accepted

Context

markidocx must produce and consume .docx (Open XML) files from Python. The build pipeline writes DOCX from Markdown; the import pipeline reads DOCX back into Markdown. Both directions must be controlled programmatically without shelling out to Office applications or external services.

The following options were evaluated:

Option	Direction	Notes
python-docx	read + write	Pure Python, direct Open XML paragraph/run model
pandoc (subprocess)	read + write	Requires external binary; limited structural control
mammoth	read only	Focused on HTML output; no write support
docx2python	read only	Good for extracting raw content; no write support
LibreOffice (subprocess)	read + write	Heavy dependency; unreliable in headless environments

The primary requirements were:

Both build (Markdown → DOCX) and import (DOCX → Markdown) in a single library
Programmatic control over paragraph styles, runs, tables, footnotes, and bookmarks
No external process dependency (no pandoc, no LibreOffice)
Pure Python — installable via pip install with no system-level setup

Decision

Use python-docx for both the build (write) and import (read) directions.

python-docx provides:

Direct access to the Open XML paragraph / run model — each Paragraph maps cleanly to a Markdown block element; each Run maps to inline formatting
Style name assignment (Heading 1, Normal, List Bullet, etc.) enabling template-driven presentation
Footnote, table, and image support within the standard API surface
Bookmark creation and hyperlink insertion (used for LEVEL3 cross-references)
Stable, well-documented API; actively maintained

Consequences

Positive:

Single dependency for both conversion directions
No subprocess execution; fully in-process
Paragraph/run model maps naturally to Markdown's block/inline structure
Template .docx files control presentation without touching content

Negative / accepted limitations:

python-docx exposes only a subset of the Open XML specification. Complex Word features are out of scope by design:
- Track changes (revision marks) — not parseable
- SmartArt, charts, embedded objects — ignored during import
- Advanced numbering schemes beyond simple ordered/unordered lists — not supported
- Content controls, form fields — not supported
python-docx's footnote write API is limited; markidocx uses a compatibility shim for footnote construction (documented in builder.py)
Modifying an existing DOCX in-place is not supported — markidocx always builds a fresh DOCX and never mutates the input during import

Out of scope by design: The constraints above align with markidocx's defined semantic envelope (FC-01). The system only claims preservation for constructs within supported feature levels.

Alternatives Rejected

pandoc — excellent general-purpose converter, but shelling out introduces a hard runtime dependency, reduces structural control, and makes it difficult to embed source-boundary markers needed for multi-file redistribution.

mammoth — high-quality Word → HTML converter; read-only, so unsuitable for the build direction.

docx2python — useful for raw content extraction; no write support.

LibreOffice — handles the full Open XML spec, but requires a headless Office installation, is unreliable in CI, and introduces significant operational complexity.

3.6 KiB Raw Blame History