Files
marki-docx/corpus/markidocx-docs/known-drift.md
Bernd Worsch ebc5eaee77 feat: WP-0004 T01-T04 — stable corpus, ADRs, regression test
- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101)
- corpus/markidocx-docs/known-drift.md: documented structural drift
- workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109)
- tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110)
- architecture/ADR-002: python-docx as conversion engine
- architecture/ADR-003: manifest YAML schema
- workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-16 17:48:33 +00:00

46 lines
1.9 KiB
Markdown

# Known Drift — markidocx-docs Corpus
Last updated: 2026-03-16
## Summary
The markidocx-docs corpus (PRD + FRS v0.2 + UCC) produces known structural drift
on round-trip at LEVEL1. This drift is expected and does not indicate a regression.
## Import mode: fallback (merged)
The three source files are composed into a single DOCX. On import the system attempts
to redistribute content back to the three origin files using source-boundary markers.
The current build pipeline embeds section markers but the 27 H1-level sections in the
combined document make boundary matching ambiguous, so the importer falls back to a
single merged output (`dist/imported_merged.md`).
**Classification:** expected / by-design. The merged output is complete and usable.
## Structural drift items
### Bold inline text in list items (broken: ~70 items)
List items containing `**bold**` inline spans lose the bold markers on round-trip.
python-docx represents inline bold as a `Run` with `bold=True`, but the importer's
list-item text extractor concatenates run text without restoring markdown bold syntax.
**Classification:** known limitation of LEVEL1 inline formatting in list items.
**FR reference:** FR-508 (unsupported construct visibility) — these are surfaced
explicitly as `broken` rather than silently accepted.
**Impact:** content is preserved, presentation marker is lost.
### Table (broken: 1 of 1)
One table in the UCC is detected as missing after round-trip. Likely cause: the table
contains merged cells or a header row structure that the importer does not reconstruct.
**Classification:** known LEVEL1 table limitation.
**Impact:** table content is present in the DOCX but not re-imported to Markdown.
## Verdict
902 elements preserved; ~71 broken items (all inline formatting in lists or 1 table).
This corpus is suitable as a regression baseline: a clean round-trip regression test
can assert `preserved >= 900` and `broken <= 80` rather than exact zero-drift.