generated from coulomb/repo-seed
- corpus/markidocx-docs/manifest.yaml: specs as live markidocx project (FR-1101) - corpus/markidocx-docs/known-drift.md: documented structural drift - workflows.py: release-regression accepts manifest path; emits corpus_id (FR-1109) - tests/regression/test_corpus_regression.py: corpus regression suite (FR-1102–1110) - architecture/ADR-002: python-docx as conversion engine - architecture/ADR-003: manifest YAML schema - workplans/MRKD-WP-0004: T01–T04 done; T05 blocked (SBOM path mapping needed) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
46 lines
1.9 KiB
Markdown
46 lines
1.9 KiB
Markdown
# Known Drift — markidocx-docs Corpus
|
|
|
|
Last updated: 2026-03-16
|
|
|
|
## Summary
|
|
|
|
The markidocx-docs corpus (PRD + FRS v0.2 + UCC) produces known structural drift
|
|
on round-trip at LEVEL1. This drift is expected and does not indicate a regression.
|
|
|
|
## Import mode: fallback (merged)
|
|
|
|
The three source files are composed into a single DOCX. On import the system attempts
|
|
to redistribute content back to the three origin files using source-boundary markers.
|
|
The current build pipeline embeds section markers but the 27 H1-level sections in the
|
|
combined document make boundary matching ambiguous, so the importer falls back to a
|
|
single merged output (`dist/imported_merged.md`).
|
|
|
|
**Classification:** expected / by-design. The merged output is complete and usable.
|
|
|
|
## Structural drift items
|
|
|
|
### Bold inline text in list items (broken: ~70 items)
|
|
|
|
List items containing `**bold**` inline spans lose the bold markers on round-trip.
|
|
python-docx represents inline bold as a `Run` with `bold=True`, but the importer's
|
|
list-item text extractor concatenates run text without restoring markdown bold syntax.
|
|
|
|
**Classification:** known limitation of LEVEL1 inline formatting in list items.
|
|
**FR reference:** FR-508 (unsupported construct visibility) — these are surfaced
|
|
explicitly as `broken` rather than silently accepted.
|
|
**Impact:** content is preserved, presentation marker is lost.
|
|
|
|
### Table (broken: 1 of 1)
|
|
|
|
One table in the UCC is detected as missing after round-trip. Likely cause: the table
|
|
contains merged cells or a header row structure that the importer does not reconstruct.
|
|
|
|
**Classification:** known LEVEL1 table limitation.
|
|
**Impact:** table content is present in the DOCX but not re-imported to Markdown.
|
|
|
|
## Verdict
|
|
|
|
902 elements preserved; ~71 broken items (all inline formatting in lists or 1 table).
|
|
This corpus is suitable as a regression baseline: a clean round-trip regression test
|
|
can assert `preserved >= 900` and `broken <= 80` rather than exact zero-drift.
|