- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101) - corpus/markidocx-docs/known-drift.md: documented structural drift - workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109) - tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110) - architecture/ADR-002: python-docx as conversion engine - architecture/ADR-003: manifest YAML schema - workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1.9 KiB
Known Drift — markidocx-docs Corpus
Last updated: 2026-03-16
Summary
The markidocx-docs corpus (PRD + FRS v0.2 + UCC) produces known structural drift on round-trip at LEVEL1. This drift is expected and does not indicate a regression.
Import mode: fallback (merged)
The three source files are composed into a single DOCX. On import the system attempts
to redistribute content back to the three origin files using source-boundary markers.
The current build pipeline embeds section markers but the 27 H1-level sections in the
combined document make boundary matching ambiguous, so the importer falls back to a
single merged output (dist/imported_merged.md).
Classification: expected / by-design. The merged output is complete and usable.
Structural drift items
Bold inline text in list items (broken: ~70 items)
List items containing **bold** inline spans lose the bold markers on round-trip.
python-docx represents inline bold as a Run with bold=True, but the importer's
list-item text extractor concatenates run text without restoring markdown bold syntax.
Classification: known limitation of LEVEL1 inline formatting in list items.
FR reference: FR-508 (unsupported construct visibility) — these are surfaced
explicitly as broken rather than silently accepted.
Impact: content is preserved, presentation marker is lost.
Table (broken: 1 of 1)
One table in the UCC is detected as missing after round-trip. Likely cause: the table contains merged cells or a header row structure that the importer does not reconstruct.
Classification: known LEVEL1 table limitation. Impact: table content is present in the DOCX but not re-imported to Markdown.
Verdict
902 elements preserved; ~71 broken items (all inline formatting in lists or 1 table).
This corpus is suitable as a regression baseline: a clean round-trip regression test
can assert preserved >= 900 and broken <= 80 rather than exact zero-drift.