Files
marki-docx/corpus/markidocx-docs/known-drift.md
Bernd Worsch ebc5eaee77 feat: WP-0004 T01-T04 — stable corpus, ADRs, regression test
- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101)
- corpus/markidocx-docs/known-drift.md: documented structural drift
- workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109)
- tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110)
- architecture/ADR-002: python-docx as conversion engine
- architecture/ADR-003: manifest YAML schema
- workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 17:48:33 +00:00

1.9 KiB

Known Drift — markidocx-docs Corpus

Last updated: 2026-03-16

Summary

The markidocx-docs corpus (PRD + FRS v0.2 + UCC) produces known structural drift on round-trip at LEVEL1. This drift is expected and does not indicate a regression.

Import mode: fallback (merged)

The three source files are composed into a single DOCX. On import the system attempts to redistribute content back to the three origin files using source-boundary markers. The current build pipeline embeds section markers but the 27 H1-level sections in the combined document make boundary matching ambiguous, so the importer falls back to a single merged output (dist/imported_merged.md).

Classification: expected / by-design. The merged output is complete and usable.

Structural drift items

Bold inline text in list items (broken: ~70 items)

List items containing **bold** inline spans lose the bold markers on round-trip. python-docx represents inline bold as a Run with bold=True, but the importer's list-item text extractor concatenates run text without restoring markdown bold syntax.

Classification: known limitation of LEVEL1 inline formatting in list items. FR reference: FR-508 (unsupported construct visibility) — these are surfaced explicitly as broken rather than silently accepted. Impact: content is preserved, presentation marker is lost.

Table (broken: 1 of 1)

One table in the UCC is detected as missing after round-trip. Likely cause: the table contains merged cells or a header row structure that the importer does not reconstruct.

Classification: known LEVEL1 table limitation. Impact: table content is present in the DOCX but not re-imported to Markdown.

Verdict

902 elements preserved; ~71 broken items (all inline formatting in lists or 1 table). This corpus is suitable as a regression baseline: a clean round-trip regression test can assert preserved >= 900 and broken <= 80 rather than exact zero-drift.