Files
citation-evidence/fixtures/pdfs/manifest.json
tegwick d54daf2e61 Implement CE-WP-0002 T03-T09: ingest, anchor resolution, engine, UI, persistence, e2e
Completes the PDF review slice end-to-end. After this commit a user can
open a fixture, select text, save an evidence item with commentary, see
it in the sidebar, reload the page, click the item, and the viewer
scrolls to the passage.

- T03 src/source/pdf/{fingerprint,extract,ingest}.ts + 39 fixture tests
  - SHA-256 fingerprint over a fresh ArrayBuffer (TS BufferSource-safe)
  - PDF.js text extract; per-page normalize then join with "\n\n"
  - PageMap + OffsetMap (gap-free coverage); pageLength = end - start
  - Updated manifest's Betriebskosten quote to one PDF.js extracts cleanly
- T04 src/anchor/selectors/{create,resolve}.ts + 25 unit + 7 fixture tests
  - createSelectors emits the maximal redundant set (TextQuote +
    TextPosition + PdfRect + PdfPageText when available)
  - resolveSelectors implements the SharedContracts §7 ladder; confidence
    1.0 (pos+quote) → 0.7 (rect-only) → 0 (unresolved)
  - Cross-module integration test moved to tests/integration/ to honor
    the anchor↛source boundary lint rule
- T05 engine: sync event bus over the closed §4 vocabulary, Map-backed
  repos, services, createEngine() composition root, 12 tests
- T06 work + app: three-pane shell (CollectionList | ViewerShell |
  EvidenceSidebar) wired through EngineProvider; EngineContext lives in
  src/work/ to respect the work↛app boundary; SpikeApp deleted
- T07 AnnotationToolbar: pendingSelection in context; Save runs
  createSelectors → engine.annotations.create → engine.evidence.create
- T08 click-to-reopen + localStorage persistence
  - scrollToAnnotation state in context with a version counter so a
    second click on the same item re-fires the viewer scroll
  - captureSnapshot/restoreSnapshot/attachPersister/restoreFromStorage;
    restore bypasses services to avoid event-loops
  - active-document id persisted alongside the snapshot so reload lands
    on the same fixture; ADR-0005 written
  - 9 persistence tests
- T09 tests/integration/app-prd-scenario.dom.test.tsx
  - end-to-end happy-dom test of PRD scenario steps 1-8 through the real
    React tree; viewer + ingest mocked per ADR-0004's headless-Chromium
    limitation. Fixed memo-deps bug in EvidenceSidebar/ViewerShell where
    useEngineEventTick values were not included in the useMemo deps,
    leaving stale memoization across event-driven re-renders
- vitest.config.ts: happy-dom for *.dom.test.{ts,tsx} files
- noEmit added to tsconfig so tsc -b doesn't litter src/ with .js outputs

Gates: typecheck ✓ lint ✓ test 109/109 across 11 files ✓ build ✓

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:58:11 +02:00

71 lines
5.1 KiB
JSON

{
"_schema_version": 1,
"_description": "PDF fixture corpus for citation-evidence selector tests. Each entry binds a stable id (used by test code) to a file path, page count, and a verbatim known-good quote with its 1-indexed physical PDF page number. The quote is short, unique within the document, and chosen to round-trip cleanly through the canonical text normalizer.",
"_provenance": "Page counts and quotes extracted on 2026-05-24 by reading each PDF directly, then re-verified on 2026-05-25 against the PDF.js v4 text extractor used by src/source/pdf/extract.ts. The Betriebskosten file is a scanned/handwritten form with noisy OCR text — its known-good quote was updated 2026-05-25 from 'Ich bitte um Überweisung auf das Konto bei' to 'Auf der Rückseite finden Sie Ihre Abrechnung' because PDF.js drops the capital-Ü in the original (the lowercase-ü in 'Rückseite' survives, so the new quote still exercises the umlaut code path).",
"fixtures": [
{
"id": "betriebskosten-2024",
"filename": "031-Kemal Güldag Betriebskosten 2024.pdf",
"description": "German Betriebskostenabrechnung (utility-cost statement) for a Seeheim apartment — scanned cover letter + filled-in Abrechnung form. OCR-noisy text and handwritten field values. Useful for stress-testing canonical normalization and selector resolution on imperfect extraction.",
"page_count": 2,
"known_good_quote": "Auf der Rückseite finden Sie Ihre Abrechnung",
"known_good_quote_page": 1,
"characteristics": ["german", "umlauts", "scanned", "ocr-noisy", "form", "handwritten"]
},
{
"id": "brief-trennung-angebot",
"filename": "061-260215-brief-trennungRoxanaAngebot_v1_final.pdf",
"description": "German correspondence — three-page settlement proposal letter with bullet lists, embedded amounts, and a signature on the final page. Single-column prose; representative of typical legal/personal letters.",
"page_count": 3,
"known_good_quote": "Dieser Vorschlag ist befristet bis zum 31.03.2026",
"known_good_quote_page": 3,
"characteristics": ["german", "umlauts", "single-column", "prose", "multi-page", "bullet-lists"]
},
{
"id": "sonderkosten",
"filename": "063-26.01_Sonderkosten.pdf",
"description": "German Sonderkosten ledger — tabular expense statement across multiple years and people. Three pages of column-heavy tables; representative of spreadsheet-exported PDFs where text extraction order is column-driven.",
"page_count": 3,
"known_good_quote": "Einzahlung Bernd 23.09.24",
"known_good_quote_page": 2,
"characteristics": ["german", "tables", "spreadsheet-export", "multi-column", "amounts"]
},
{
"id": "vollstaendigkeitserklaerung-2024",
"filename": "61595286_Vollständigkeitserklärung_2024.pdf",
"description": "Single-page German Vollständigkeitserklärung tax-prep form (VLH). Dense form with checkboxes, labelled fields, and small-print legal text — a good test of selector creation on form-heavy layouts.",
"page_count": 1,
"known_good_quote": "Mitglied beim Lohnsteuerhilfeverein Vereinigte Lohnsteuerhilfe e.V.",
"known_good_quote_page": 1,
"characteristics": ["german", "umlauts", "form", "checkboxes", "single-page", "dense"]
},
{
"id": "aufnahmeschein-naturfriedhof",
"filename": "Aufnahmeschein Naturfriedhof.pdf",
"description": "Three-page German admission form (Aufnahmeschein) for the Mühltal natural cemetery: fillable form (p1-2) plus an excerpt of the Friedhofsordnung statute (p3). Tests selectors that must distinguish form labels from underline-fields and prose.",
"page_count": 3,
"known_good_quote": "Mehrere Verpflichtete haften als Gesamtschuldner",
"known_good_quote_page": 3,
"characteristics": ["german", "umlauts", "form", "statute", "mixed-layout"]
},
{
"id": "fristsetzung-bezifferung",
"filename": "Fristsetzung zur Bezifferung GÜ an Gegenseite 3 Wochen.pdf",
"description": "Single-page formal court letter from the Amtsgericht Darmstadt — header block, addressed block, a one-sentence ruling, and a signature block. Excellent for clean selector round-trip tests.",
"page_count": 1,
"known_good_quote": "wird der Antragsgegnerin eine Frist von 3 Wochen zur Bezifferung gesetzt",
"known_good_quote_page": 1,
"characteristics": ["german", "umlauts", "legal", "single-page", "clean-text"]
},
{
"id": "zeugnisspruche-klasse-7-8",
"filename": "Zeugnissprüche_Klasse_7_8.pdf",
"description": "Long-form reference document (29 PDF pages): title page + 28 pages of curated quotes for German school-year reports, each with author, dates, and short biography. Multi-language inclusions (English, Spanish, Greek). Ideal for cross-page selector and heading-hierarchy tests.",
"page_count": 29,
"known_good_quote": "Der Friede der Welt beginnt in den Herzen der Menschen",
"known_good_quote_page": 2,
"characteristics": ["german", "umlauts", "long-form", "multi-language", "hierarchy", "structured"]
}
]
}