citation-evidence

coulomb/citation-evidence

Fork 0

generated from coulomb/repo-seed

Commit Graph

Author	SHA1	Message	Date
tegwick	c000ce6f73	Wire pdfjs cmaps + standard fonts so text layer positions correctly Strong likelihood that the "text layer is misplaced / body text not selectable" symptoms across multiple PDFs come from PDF.js falling back to substitute font metrics. Without the cmaps directory (CID character maps for non-Latin fonts) and the standard_fonts directory (Helvetica/Times/Courier metrics for unembedded standard fonts), the canvas glyphs use embedded font data while the text-layer span positions are computed from fallback metrics. The two diverge — text spans land in the wrong place, or text content can't be decoded at all, leaving the body unselectable. Both directories are now copied into the served root by vite-plugin-static-copy and passed to pdfjs.getDocument() as `cMapUrl: "/cmaps/"` + `cMapPacked: true` + `standardFontDataUrl: "/standard_fonts/"` via PdfLoader's `document` prop (which accepts a full DocumentInitParameters object). If this is the right diagnosis, the textLayer overlay should now line up with the visible glyphs on the same PDFs that were producing fragmented captures. If the body text is still unselectable, the PDF genuinely lacks a text layer for those glyphs (image-only content) and OCR would be the only path forward. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 00:38:34 +02:00
tegwick	2a7b05c190	Implement CE-WP-0002 T01-T02: engine types + PDF viewer adapter spike T01: shared engine types (Document, Selector union, Annotation, EvidenceItem, branded IDs with newId factory) per wiki/SharedContracts.md §1-§3. T02: react-pdf-highlighter-plus v1.1.4 spike behind the §5 DocumentViewerAdapter contract in src/anchor/. Pure round-trip math extracted to pdf-selector-math.ts with 11 unit tests proving lossless capture → selectors → JSON → restored-rects. ADR-0004 accepted; full user-flow Playwright verification deferred to T09. Adds Vite app shell (index.html, src/app/SpikeApp.tsx) so the spike is exercisable via pnpm dev. tsconfig --noEmit prevents tsc -b from littering src/ with stray .js outputs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 02:21:31 +02:00

Author

SHA1

Message

Date

tegwick

c000ce6f73

Wire pdfjs cmaps + standard fonts so text layer positions correctly

Strong likelihood that the "text layer is misplaced / body text not
selectable" symptoms across multiple PDFs come from PDF.js falling
back to substitute font metrics. Without the cmaps directory (CID
character maps for non-Latin fonts) and the standard_fonts directory
(Helvetica/Times/Courier metrics for unembedded standard fonts), the
canvas glyphs use embedded font data while the text-layer span
positions are computed from fallback metrics. The two diverge — text
spans land in the wrong place, or text content can't be decoded at
all, leaving the body unselectable.

Both directories are now copied into the served root by
vite-plugin-static-copy and passed to pdfjs.getDocument() as
`cMapUrl: "/cmaps/"` + `cMapPacked: true` + `standardFontDataUrl:
"/standard_fonts/"` via PdfLoader's `document` prop (which accepts a
full DocumentInitParameters object).

If this is the right diagnosis, the textLayer overlay should now line
up with the visible glyphs on the same PDFs that were producing
fragmented captures. If the body text is still unselectable, the PDF
genuinely lacks a text layer for those glyphs (image-only content)
and OCR would be the only path forward.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 00:38:34 +02:00

tegwick

2a7b05c190

Implement CE-WP-0002 T01-T02: engine types + PDF viewer adapter spike

T01: shared engine types (Document, Selector union, Annotation, EvidenceItem,
branded IDs with newId factory) per wiki/SharedContracts.md §1-§3.

T02: react-pdf-highlighter-plus v1.1.4 spike behind the §5
DocumentViewerAdapter contract in src/anchor/. Pure round-trip math
extracted to pdf-selector-math.ts with 11 unit tests proving lossless
capture → selectors → JSON → restored-rects. ADR-0004 accepted; full
user-flow Playwright verification deferred to T09.

Adds Vite app shell (index.html, src/app/SpikeApp.tsx) so the spike is
exercisable via pnpm dev. tsconfig --noEmit prevents tsc -b from
littering src/ with stray .js outputs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 02:21:31 +02:00

2 Commits