diff --git a/docs/decisions/ADR-0005-persistence.md b/docs/decisions/ADR-0005-persistence.md index 120dead..8bece91 100644 --- a/docs/decisions/ADR-0005-persistence.md +++ b/docs/decisions/ADR-0005-persistence.md @@ -1,38 +1,85 @@ -# ADR-0005 — Persistence layer (MVP and beyond) +# ADR-0005 — Persistence for the MVP slice -- Status: proposed -- Date: 2026-05-24 -- Workplan: CE-WP-0001-T07 (stub); MVP placeholder in CE-WP-0002-T08 +- Status: accepted (provisional — durable storage owned by a later workplan) +- Date: 2026-05-25 +- Workplan: CE-WP-0002-T08 (click-to-reopen requires reload-survival) ## Context -The MVP needs persistence so that "click an evidence item and have the PDF -jump to and highlight the passage — even after a full page reload" works -(PRD §20 step 4). The acceptable MVP shortcut is `localStorage` (decided -explicitly in CE-WP-0002-T08). +CE-WP-0002 needs the click-to-reopen flow to survive a page reload (PRD +scenario step 4 → "even after a full page reload"). The full persistence +design (SQLite local-first vs Postgres server-first) is too large to land +inside this slice — `wiki/ArchitectureOverview.md` §10 lays out the bigger +picture but the workplan explicitly defers the decision. -This ADR is the durable home for the real persistence decision: where do -documents, annotations, evidence items, links, and sets live in v1.0? +The engine already runs `Map`-backed in-memory repositories +(`src/engine/repos/in-memory.ts`). To survive reloads we need *some* +persistence boundary now, without committing to the long-term store. ## Options -- **A. Browser-local only (IndexedDB via `idb` or `dexie`)** - - Pros: zero infra; great for a single-user reference workspace. - - Cons: no cross-device sync; export/import only via files. - -- **B. Local-first + sync server (e.g. CRDT-backed)** - - Pros: matches the long-term vision of a workspace tool; conflict-free - multi-device. - - Cons: significant infra and CRDT design cost; out of MVP scope. - -- **C. Traditional client/server with a REST or GraphQL API** - - Pros: familiar; easy team-sharing story. - - Cons: requires hosting; loses the local-first character. +- **A. localStorage snapshot (this ADR).** The SPA serializes the entire + engine state into a single JSON blob on every mutation and restores it + on mount. No new dependencies; no schema migrations; no networking. + Per-tab only. +- **B. IndexedDB-backed store.** More headroom, more API surface, async + reads. Needed eventually for binary blobs (PDF bytes) but overkill for + the few hundred annotations the MVP produces. +- **C. SQLite via `sql.js` or `wa-sqlite`.** Brings query semantics into + the browser. Heavy for the MVP and entangles us with a database we may + not keep. +- **D. Server-backed persistence from day one.** Requires shipping a + backend. Premature. ## Decision -(blank — to be answered before the second product slice past MVP.) +Adopt **A: localStorage snapshot**, deliberately temporary. + +Implementation lives in `src/engine/persistence.ts`: + +- `captureSnapshot(engine)` returns + `{ documents, representations, annotations, evidenceItems }`. +- `attachPersister(engine, { key })` subscribes to every mutating engine + event and writes a fresh snapshot to `localStorage` after each. +- `restoreFromStorage(engine, { key })` reads the snapshot on app mount + and hydrates the repos *directly* (bypassing service `create()` calls) + so no spurious `*Created` events fire — the persister would otherwise + loop on its own writes, and other UI listeners would see "the same + annotation was created again" on every reload. +- Snapshot is versioned (`SNAPSHOT_VERSION = 1`); a version mismatch + throws on restore so a future schema bump is loud. + +`src/work/EngineContext.tsx`'s `EngineProvider` wires this on first mount. +A sibling localStorage key holds the last-active `documentId` so reload +lands the user back on the same fixture. + +## Why this is acceptable for the MVP + +- The engine never holds PDF bytes — only metadata + selectors + commentary. + A typical session is well under 1 MB even with hundreds of annotations, + comfortably within the ~5 MB localStorage budget. +- The repositories' `create()` signatures already match the shape an + eventual durable repo would expose; swapping the implementation is a + localised change. +- "Survives reload" is the only persistence requirement of CE-WP-0002. + Cross-device sync, multi-user access, query-by-tag, history — none are + in scope yet. + +## What this defers + +- A real persistence ADR (SQLite local-first vs Postgres server-first vs + IndexedDB) for CE-WP-0005+ work. +- PDF byte persistence. Today the SPA re-fetches `/fixtures/pdfs/*` on + load; bytes do not enter the snapshot. +- Multi-tab consistency. Tabs see each other's writes only on reload. +- Migrations beyond the version check. ## Consequences -(blank) +- `src/engine/persistence.ts` is the single point of contact for storage. + When the real durable-store ADR lands, that module is what changes. +- Tests inject a memory-Storage shim into `attachPersister` / + `restoreFromStorage` so they don't depend on a browser environment + (see `src/engine/persistence.test.ts`). +- Clearing the user's browser storage destroys all annotations — call + this out in the README once the MVP ships. diff --git a/fixtures/pdfs/manifest.json b/fixtures/pdfs/manifest.json index e36551d..0238919 100644 --- a/fixtures/pdfs/manifest.json +++ b/fixtures/pdfs/manifest.json @@ -1,14 +1,14 @@ { "_schema_version": 1, "_description": "PDF fixture corpus for citation-evidence selector tests. Each entry binds a stable id (used by test code) to a file path, page count, and a verbatim known-good quote with its 1-indexed physical PDF page number. The quote is short, unique within the document, and chosen to round-trip cleanly through the canonical text normalizer.", - "_provenance": "Page counts and quotes extracted on 2026-05-24 by reading each PDF directly. The Betriebskosten file is a scanned/handwritten form with noisy OCR text — its quote is taken from the reliably-extracted printed boilerplate, not from the handwritten fields.", + "_provenance": "Page counts and quotes extracted on 2026-05-24 by reading each PDF directly, then re-verified on 2026-05-25 against the PDF.js v4 text extractor used by src/source/pdf/extract.ts. The Betriebskosten file is a scanned/handwritten form with noisy OCR text — its known-good quote was updated 2026-05-25 from 'Ich bitte um Überweisung auf das Konto bei' to 'Auf der Rückseite finden Sie Ihre Abrechnung' because PDF.js drops the capital-Ü in the original (the lowercase-ü in 'Rückseite' survives, so the new quote still exercises the umlaut code path).", "fixtures": [ { "id": "betriebskosten-2024", "filename": "031-Kemal Güldag Betriebskosten 2024.pdf", "description": "German Betriebskostenabrechnung (utility-cost statement) for a Seeheim apartment — scanned cover letter + filled-in Abrechnung form. OCR-noisy text and handwritten field values. Useful for stress-testing canonical normalization and selector resolution on imperfect extraction.", "page_count": 2, - "known_good_quote": "Ich bitte um Überweisung auf das Konto bei", + "known_good_quote": "Auf der Rückseite finden Sie Ihre Abrechnung", "known_good_quote_page": 1, "characteristics": ["german", "umlauts", "scanned", "ocr-noisy", "form", "handwritten"] }, diff --git a/index.html b/index.html index 7a0522b..992f3b0 100644 --- a/index.html +++ b/index.html @@ -3,7 +3,7 @@ - citation-evidence · spike + citation-evidence
diff --git a/package.json b/package.json index 75fbc0e..51909a0 100644 --- a/package.json +++ b/package.json @@ -25,6 +25,9 @@ "react-pdf-highlighter-plus": "^1.1.4" }, "devDependencies": { + "@testing-library/dom": "^10.4.1", + "@testing-library/react": "^16.3.2", + "@testing-library/user-event": "^14.6.1", "@types/node": "^20.14.0", "@types/react": "^18.3.3", "@types/react-dom": "^18.3.0", @@ -34,6 +37,7 @@ "eslint-plugin-boundaries": "^4.2.2", "eslint-plugin-import": "^2.30.0", "globals": "^15.9.0", + "happy-dom": "^20.9.0", "typescript": "^5.5.4", "typescript-eslint": "^8.0.0", "vite": "^5.4.0", diff --git a/pnpm-lock.yaml b/pnpm-lock.yaml index 1b9f927..a07aa88 100644 --- a/pnpm-lock.yaml +++ b/pnpm-lock.yaml @@ -21,6 +21,15 @@ importers: specifier: ^1.1.4 version: 1.1.4(@types/react-dom@18.3.7(@types/react@18.3.29))(@types/react@18.3.29)(pdfjs-dist@4.10.38)(react-dom@18.3.1(react@18.3.1))(react@18.3.1) devDependencies: + '@testing-library/dom': + specifier: ^10.4.1 + version: 10.4.1 + '@testing-library/react': + specifier: ^16.3.2 + version: 16.3.2(@testing-library/dom@10.4.1)(@types/react-dom@18.3.7(@types/react@18.3.29))(@types/react@18.3.29)(react-dom@18.3.1(react@18.3.1))(react@18.3.1) + '@testing-library/user-event': + specifier: ^14.6.1 + version: 14.6.1(@testing-library/dom@10.4.1) '@types/node': specifier: ^20.14.0 version: 20.19.41 @@ -48,6 +57,9 @@ importers: globals: specifier: ^15.9.0 version: 15.15.0 + happy-dom: + specifier: ^20.9.0 + version: 20.9.0 typescript: specifier: ^5.5.4 version: 5.9.3 @@ -59,7 +71,7 @@ importers: version: 5.4.21(@types/node@20.19.41) vitest: specifier: ^2.0.5 - version: 2.1.9(@types/node@20.19.41) + version: 2.1.9(@types/node@20.19.41)(happy-dom@20.9.0) packages: @@ -134,6 +146,10 @@ packages: peerDependencies: '@babel/core': ^7.0.0-0 + '@babel/runtime@7.29.2': + resolution: {integrity: sha512-JiDShH45zKHWyGe4ZNVRrCjBz8Nh9TMmZG1kh4QTK8hCBTWBi8Da+i7s1fJw7/lYpM4ccepSNfqzZ/QvABBi5g==} + engines: {node: '>=6.9.0'} + '@babel/template@7.28.6': resolution: {integrity: sha512-YA6Ma2KsCdGb+WC6UpBVFJGXL58MDA6oyONbjyF/+5sBgxY/dwkhLogbMT2GXXyU84/IhRw/2D1Os1B/giz+BQ==} engines: {node: '>=6.9.0'} @@ -1051,9 +1067,37 @@ packages: '@tanstack/virtual-core@3.15.0': resolution: {integrity: sha512-0AwPGx0I8QxPYjAxShT/+z+ZOe9u8mW5rsXvivCTjRfRmz9a43+3mRyi4wwlyoUqOC56q/jatKa0Bh9M99BEHQ==} + '@testing-library/dom@10.4.1': + resolution: {integrity: sha512-o4PXJQidqJl82ckFaXUeoAW+XysPLauYI43Abki5hABd853iMhitooc6znOnczgbTYmEP6U6/y1ZyKAIsvMKGg==} + engines: {node: '>=18'} + + '@testing-library/react@16.3.2': + resolution: {integrity: sha512-XU5/SytQM+ykqMnAnvB2umaJNIOsLF3PVv//1Ew4CTcpz0/BRyy/af40qqrt7SjKpDdT1saBMc42CUok5gaw+g==} + engines: {node: '>=18'} + peerDependencies: + '@testing-library/dom': ^10.0.0 + '@types/react': ^18.0.0 || ^19.0.0 + '@types/react-dom': ^18.0.0 || ^19.0.0 + react: ^18.0.0 || ^19.0.0 + react-dom: ^18.0.0 || ^19.0.0 + peerDependenciesMeta: + '@types/react': + optional: true + '@types/react-dom': + optional: true + + '@testing-library/user-event@14.6.1': + resolution: {integrity: sha512-vq7fv0rnt+QTXgPxr5Hjc210p6YKq2kmdziLgnsZGgLJ9e6VAShx1pACLuRjd/AS/sr7phAR58OIIpf0LlmQNw==} + engines: {node: '>=12', npm: '>=6'} + peerDependencies: + '@testing-library/dom': '>=7.21.4' + '@tybys/wasm-util@0.10.2': resolution: {integrity: sha512-RoBvJ2X0wuKlWFIjrwffGw1IqZHKQqzIchKaadZZfnNpsAYp2mM0h36JtPCjNDAHGgYez/15uMBpfGwchhiMgg==} + '@types/aria-query@5.0.4': + resolution: {integrity: sha512-rfT93uj5s0PRL7EzccGMs3brplhcrghnDoV26NqKhCAS1hVo+WdNsPvE/yb6ilfr5hi2MEk6d5EWJTKdxg8jVw==} + '@types/babel__core@7.20.5': resolution: {integrity: sha512-qoQprZvz5wQFJwMDqeseRXWv3rqMvhgpbXFfVyWhbx9X47POIA6i/+dXefEmZKoAgOaTdaIgNSMqMIU61yRyzA==} @@ -1092,6 +1136,12 @@ packages: '@types/react@18.3.29': resolution: {integrity: sha512-ch0qJdr2JY0r04NXSprbK6TXOgnaJ1Tz23fm5W+z0/CBah6BSBc3n96h7K9GOtwh0HrilNWHIBzE1Ko4Dcw/Wg==} + '@types/whatwg-mimetype@3.0.2': + resolution: {integrity: sha512-c2AKvDT8ToxLIOUlN51gTiHXflsfIFisS4pO7pDPoKouJCESkhZnEy623gwP9laCy5lnLDAw1vAzu2vM2YLOrA==} + + '@types/ws@8.18.1': + resolution: {integrity: sha512-ThVF6DCVhA8kUGy+aazFQ4kXQ7E1Ty7A3ypFOe0IcJV8O/M511G99AW24irKrW56Wt44yG9+ij8FaqoBGkuBXg==} + '@typescript-eslint/eslint-plugin@8.59.4': resolution: {integrity: sha512-PegsU+XfyJJNjd4+u/k6f9yTyp0lEXXiPopUNobZcIAUJFGICFLN+sP0Rb3JehVmiij1Ph0dFGYqODoRo/2+6A==} engines: {node: ^18.18.0 || ^20.9.0 || >=21.1.0} @@ -1309,10 +1359,18 @@ packages: ajv@6.15.0: resolution: {integrity: sha512-fgFx7Hfoq60ytK2c7DhnF8jIvzYgOMxfugjLOSMHjLIPgenqa7S7oaagATUq99mV6IYvN2tRmC0wnTYX6iPbMw==} + ansi-regex@5.0.1: + resolution: {integrity: sha512-quJQXlTSUGL2LH9SUXo8VwsY4soanhgo6LNSm84E1LBcE8s3O0wpdiRzyR9z/ZZJMlMWv37qOOb9pdJlMUEKFQ==} + engines: {node: '>=8'} + ansi-styles@4.3.0: resolution: {integrity: sha512-zbB9rCJAT1rbjiVDb2hqKFHNYLxgtk8NURxZ3IZwD3F6NtxbXZQCnnSi1Lkx+IDohdPlFp222wVALIheZJQSEg==} engines: {node: '>=8'} + ansi-styles@5.2.0: + resolution: {integrity: sha512-Cxwpt2SfTzTtXcfOlzGEee8O+c+MmUgGrNiBcXnuWxuFJHe6a5Hz7qwhwe5OgaSYI0IJvkLqWX1ASG+cJOkEiA==} + engines: {node: '>=10'} + argparse@2.0.1: resolution: {integrity: sha512-8+9WqebbFzpX9OR+Wa6O29asIogeRMzcGtAINdpMHHyAg10f05aSFVBbcEqGf/PXw1EjAZ+q2/bEBg3DvurK3Q==} @@ -1320,6 +1378,9 @@ packages: resolution: {integrity: sha512-ik3ZgC9dY/lYVVM++OISsaYDeg1tb0VtP5uL3ouh1koGOaUMDPpbFIei4JkFimWUFPn90sbMNMXQAIVOlnYKJA==} engines: {node: '>=10'} + aria-query@5.3.0: + resolution: {integrity: sha512-b0P0sZPKtyu8HkeRAfCq0IfURZK+SuwMjY1UXGBU27wpAiTwQAIlq56IbIO+ytk/JjS1fMR14ee5WBBfKi5J6A==} + array-buffer-byte-length@1.0.2: resolution: {integrity: sha512-LHE+8BuR7RYGDKvnrmcuSq3tDcKv9OFEXQt/HpbZhY7V6h0zlUXutnAD82GiFx9rdieCMjkvtcsPqBwgUl1Iiw==} engines: {node: '>= 0.4'} @@ -1490,6 +1551,10 @@ packages: resolution: {integrity: sha512-8QmQKqEASLd5nx0U1B1okLElbUuuttJ/AnYmRXbbbGDWh6uS208EjD4Xqq/I9wK7u0v6O08XhTWnt5XtEbR6Dg==} engines: {node: '>= 0.4'} + dequal@2.0.3: + resolution: {integrity: sha512-0je+qPKHEMohvfRTCEo3CrPG6cAzAYgmzKyxRiYSSDkS6eGJdyVJm7WaYA5ECaAD9wLB2T4EEeymA5aFVcYXCA==} + engines: {node: '>=6'} + detect-node-es@1.1.0: resolution: {integrity: sha512-ypdmJU/TbBby2Dxibuv7ZLW3Bs1QEmM7nHjEANfohJLvE0XVujisn1qPJcZxg+qDucsr+bP6fLD1rPS3AhJ7EQ==} @@ -1497,6 +1562,9 @@ packages: resolution: {integrity: sha512-35mSku4ZXK0vfCuHEDAwt55dg2jNajHZ1odvF+8SSr82EsZY4QmXfuWso8oEd8zRhVObSN18aM0CjSdoBX7zIw==} engines: {node: '>=0.10.0'} + dom-accessibility-api@0.5.16: + resolution: {integrity: sha512-X7BJ2yElsnOJ30pZF4uIIDfBEVgF4XEBxL9Bxhy6dnrm5hkzqmsWHGTiHqRiITNhMyFLyAiWndIJP7Z1NTteDg==} + dunder-proto@1.0.1: resolution: {integrity: sha512-KIN/nDJBQRcXw0MLVhZE9iQHmG68qAVIBg9CqmUYjmQIhgij9U5MFvrqkUL5FbtyyzZuOeOt0zdeRe4UY7ct+A==} engines: {node: '>= 0.4'} @@ -1504,6 +1572,10 @@ packages: electron-to-chromium@1.5.361: resolution: {integrity: sha512-Q6Hts7N9FnJc5LeGRINFvLhCI9xZmNtTDe5ZbcVezQz7cU4a8Aua3GH1b8J2XY8Al9PF+OCwYqhgsOOheMdvkA==} + entities@7.0.1: + resolution: {integrity: sha512-TWrgLOFUQTH994YUyl1yT4uyavY5nNB5muff+RtWaqNVCAK408b5ZnnbNAUEWLTCpum9w6arT70i1XdQ4UeOPA==} + engines: {node: '>=0.12'} + es-abstract@1.24.2: resolution: {integrity: sha512-2FpH9Q5i2RRwyEP1AylXe6nYLR5OhaJTZwmlcP0dL/+JCbgg7yyEo/sEK6HeGZRf3dFpWwThaRHVApXSkW3xeg==} engines: {node: '>= 0.4'} @@ -1781,6 +1853,10 @@ packages: resolution: {integrity: sha512-ZUKRh6/kUFoAiTAtTYPZJ3hw9wNxx+BIBOijnlG9PnrJsCcSjs1wyyD6vJpaYtgnzDrKYRSqf3OO6Rfa93xsRg==} engines: {node: '>= 0.4'} + happy-dom@20.9.0: + resolution: {integrity: sha512-GZZ9mKe8r646NUAf/zemnGbjYh4Bt8/MqASJY+pSm5ZDtc3YQox+4gsLI7yi1hba6o+eCsGxpHn5+iEVn31/FQ==} + engines: {node: '>=20.0.0'} + has-bigints@1.1.0: resolution: {integrity: sha512-R3pbpkcIqv2Pm3dUwgjclDRVmWpTJW2DcMzcIhEXEx1oh/CEMObMm3KLmRJOdvhM7o4uQBnwr8pzRK2sJWIqfg==} engines: {node: '>= 0.4'} @@ -1999,6 +2075,10 @@ packages: peerDependencies: react: ^16.5.1 || ^17.0.0 || ^18.0.0 || ^19.0.0 + lz-string@1.5.0: + resolution: {integrity: sha512-h5bgJWpxJNswbU7qCrV0tIKQCaS3blPDrqKWx+QxzuzL1zGUzij9XCWLrSLsJPu5t+eWA/ycetzYAO5IOMcWAQ==} + hasBin: true + magic-string@0.30.21: resolution: {integrity: sha512-vd2F4YUyEXKGcLHoq+TEyCjxueSeHnFxyyjNp80yg0XV4vUhnDer/lvvlqM/arB5bXQN5K2/3oinyCRyx8T2CQ==} @@ -2147,6 +2227,10 @@ packages: resolution: {integrity: sha512-vkcDPrRZo1QZLbn5RLGPpg/WmIQ65qoWWhcGKf/b5eplkkarX0m9z8ppCat4mlOqUsWpyNuYgO3VRyrYHSzX5g==} engines: {node: '>= 0.8.0'} + pretty-format@27.5.1: + resolution: {integrity: sha512-Qb1gy5OrP5+zDf2Bvnzdl3jsTf1qXVMazbvCoKhtKqVs4/YK4ozX4gKQJJVyNe+cajNPn0KoC0MC3FUmaHWEmQ==} + engines: {node: ^10.13.0 || ^12.13.0 || ^14.15.0 || >=15.0.0} + prop-types@15.8.1: resolution: {integrity: sha512-oj87CgZICdulUohogVAR7AjlC0327U4el4L6eAvOqCeudMDVU0NThNaV+b9Df4dXgSP1gXMTnPdhfe/2qDH5cg==} @@ -2174,6 +2258,9 @@ packages: react-is@16.13.1: resolution: {integrity: sha512-24e6ynE2H+OKt4kqsOvNd8kBpV65zoxbA4BVsEOB3ARVWQki/DHzaUoC5KuON/BiccDaCCTZBuOcfZs70kR8bQ==} + react-is@17.0.2: + resolution: {integrity: sha512-w2GsyukL62IJnlaff/nRegPQR94C/XXamvMWmSHRJ4y7Ts/4ocGRmTHvOs8PSE6pB3dWOrD/nueuU5sduBsQ4w==} + react-pdf-highlighter-plus@1.1.4: resolution: {integrity: sha512-cJPFZnKjp4mmPjnamh11eC2I0W4waFAwLLG1E3mTg4TQRyMyUY+C6SyUm8MAcQnogbaXIAvCXP9B4hsnTSflnA==} peerDependencies: @@ -2542,6 +2629,10 @@ packages: jsdom: optional: true + whatwg-mimetype@3.0.0: + resolution: {integrity: sha512-nt+N2dzIutVRxARx1nghPKGv1xHikU7HKdfafKkLNLindmPU/ch3U31NOCGGA/dmPcmb1VlofO0vnKAcsm0o/Q==} + engines: {node: '>=12'} + which-boxed-primitive@1.1.1: resolution: {integrity: sha512-TbX3mj8n0odCBFVlY8AxkqcHASw3L60jIuF8jFP78az3C2YhmGvqbHBpAjTRH2/xqYunrJ9g1jSyjCjpoWzIAA==} engines: {node: '>= 0.4'} @@ -2572,6 +2663,18 @@ packages: resolution: {integrity: sha512-BN22B5eaMMI9UMtjrGd5g5eCYPpCPDUy0FJXbYsaT5zYxjFOckS53SQDE3pWkVoWpHXVb3BrYcEN4Twa55B5cA==} engines: {node: '>=0.10.0'} + ws@8.21.0: + resolution: {integrity: sha512-Vsp28b7DRcimFQvrqu2Wek3z1iYxDCWqHYB8Qsnk/S4RfaCQzPGPyBNuVjJV3cd6UiKtUtp6sNM77gWvzcCH+g==} + engines: {node: '>=10.0.0'} + peerDependencies: + bufferutil: ^4.0.1 + utf-8-validate: '>=5.0.2' + peerDependenciesMeta: + bufferutil: + optional: true + utf-8-validate: + optional: true + yallist@3.1.1: resolution: {integrity: sha512-a4UGQaWPH59mOXUYnAG2ewncQS4i4F43Tv3JoAM+s2VDAmS9NsK8GpDMLrCHPksFT7h3K6TOoUNn2pb7RoXx4g==} @@ -2670,6 +2773,8 @@ snapshots: '@babel/core': 7.29.0 '@babel/helper-plugin-utils': 7.28.6 + '@babel/runtime@7.29.2': {} + '@babel/template@7.28.6': dependencies: '@babel/code-frame': 7.29.0 @@ -3482,11 +3587,38 @@ snapshots: '@tanstack/virtual-core@3.15.0': {} + '@testing-library/dom@10.4.1': + dependencies: + '@babel/code-frame': 7.29.0 + '@babel/runtime': 7.29.2 + '@types/aria-query': 5.0.4 + aria-query: 5.3.0 + dom-accessibility-api: 0.5.16 + lz-string: 1.5.0 + picocolors: 1.1.1 + pretty-format: 27.5.1 + + '@testing-library/react@16.3.2(@testing-library/dom@10.4.1)(@types/react-dom@18.3.7(@types/react@18.3.29))(@types/react@18.3.29)(react-dom@18.3.1(react@18.3.1))(react@18.3.1)': + dependencies: + '@babel/runtime': 7.29.2 + '@testing-library/dom': 10.4.1 + react: 18.3.1 + react-dom: 18.3.1(react@18.3.1) + optionalDependencies: + '@types/react': 18.3.29 + '@types/react-dom': 18.3.7(@types/react@18.3.29) + + '@testing-library/user-event@14.6.1(@testing-library/dom@10.4.1)': + dependencies: + '@testing-library/dom': 10.4.1 + '@tybys/wasm-util@0.10.2': dependencies: tslib: 2.8.1 optional: true + '@types/aria-query@5.0.4': {} + '@types/babel__core@7.20.5': dependencies: '@babel/parser': 7.29.3 @@ -3531,6 +3663,12 @@ snapshots: '@types/prop-types': 15.7.15 csstype: 3.2.3 + '@types/whatwg-mimetype@3.0.2': {} + + '@types/ws@8.18.1': + dependencies: + '@types/node': 20.19.41 + '@typescript-eslint/eslint-plugin@8.59.4(@typescript-eslint/parser@8.59.4(eslint@9.39.4)(typescript@5.9.3))(eslint@9.39.4)(typescript@5.9.3)': dependencies: '@eslint-community/regexpp': 4.12.2 @@ -3757,16 +3895,24 @@ snapshots: json-schema-traverse: 0.4.1 uri-js: 4.4.1 + ansi-regex@5.0.1: {} + ansi-styles@4.3.0: dependencies: color-convert: 2.0.1 + ansi-styles@5.2.0: {} + argparse@2.0.1: {} aria-hidden@1.2.6: dependencies: tslib: 2.8.1 + aria-query@5.3.0: + dependencies: + dequal: 2.0.3 + array-buffer-byte-length@1.0.2: dependencies: call-bound: 1.0.4 @@ -3956,12 +4102,16 @@ snapshots: has-property-descriptors: 1.0.2 object-keys: 1.1.1 + dequal@2.0.3: {} + detect-node-es@1.1.0: {} doctrine@2.1.0: dependencies: esutils: 2.0.3 + dom-accessibility-api@0.5.16: {} + dunder-proto@1.0.1: dependencies: call-bind-apply-helpers: 1.0.2 @@ -3970,6 +4120,8 @@ snapshots: electron-to-chromium@1.5.361: {} + entities@7.0.1: {} + es-abstract@1.24.2: dependencies: array-buffer-byte-length: 1.0.2 @@ -4352,6 +4504,18 @@ snapshots: gopd@1.2.0: {} + happy-dom@20.9.0: + dependencies: + '@types/node': 20.19.41 + '@types/whatwg-mimetype': 3.0.2 + '@types/ws': 8.18.1 + entities: 7.0.1 + whatwg-mimetype: 3.0.0 + ws: 8.21.0 + transitivePeerDependencies: + - bufferutil + - utf-8-validate + has-bigints@1.1.0: {} has-flag@4.0.0: {} @@ -4558,6 +4722,8 @@ snapshots: dependencies: react: 18.3.1 + lz-string@1.5.0: {} + magic-string@0.30.21: dependencies: '@jridgewell/sourcemap-codec': 1.5.5 @@ -4704,6 +4870,12 @@ snapshots: prelude-ls@1.2.1: {} + pretty-format@27.5.1: + dependencies: + ansi-regex: 5.0.1 + ansi-styles: 5.2.0 + react-is: 17.0.2 + prop-types@15.8.1: dependencies: loose-envify: 1.4.0 @@ -4732,6 +4904,8 @@ snapshots: react-is@16.13.1: {} + react-is@17.0.2: {} + react-pdf-highlighter-plus@1.1.4(@types/react-dom@18.3.7(@types/react@18.3.29))(@types/react@18.3.29)(pdfjs-dist@4.10.38)(react-dom@18.3.1(react@18.3.1))(react@18.3.1): dependencies: '@radix-ui/react-collapsible': 1.1.12(@types/react-dom@18.3.7(@types/react@18.3.29))(@types/react@18.3.29)(react-dom@18.3.1(react@18.3.1))(react@18.3.1) @@ -5179,7 +5353,7 @@ snapshots: '@types/node': 20.19.41 fsevents: 2.3.3 - vitest@2.1.9(@types/node@20.19.41): + vitest@2.1.9(@types/node@20.19.41)(happy-dom@20.9.0): dependencies: '@vitest/expect': 2.1.9 '@vitest/mocker': 2.1.9(vite@5.4.21(@types/node@20.19.41)) @@ -5203,6 +5377,7 @@ snapshots: why-is-node-running: 2.3.0 optionalDependencies: '@types/node': 20.19.41 + happy-dom: 20.9.0 transitivePeerDependencies: - less - lightningcss @@ -5214,6 +5389,8 @@ snapshots: - supports-color - terser + whatwg-mimetype@3.0.0: {} + which-boxed-primitive@1.1.1: dependencies: is-bigint: 1.1.0 @@ -5266,6 +5443,8 @@ snapshots: word-wrap@1.2.5: {} + ws@8.21.0: {} + yallist@3.1.1: {} yocto-queue@0.1.0: {} diff --git a/src/anchor/index.ts b/src/anchor/index.ts index 96abf3f..9da25b4 100644 --- a/src/anchor/index.ts +++ b/src/anchor/index.ts @@ -5,3 +5,9 @@ export { type PdfSpikeViewerProps, type StoredAnnotation, } from "./pdf-viewer-adapter-spike"; +export { + createSelectors, + resolveSelectors, + DEFAULT_CONTEXT_CHARS, + type CreateSelectorsOptions, +} from "./selectors"; diff --git a/src/anchor/selectors/create.test.ts b/src/anchor/selectors/create.test.ts new file mode 100644 index 0000000..e39c516 --- /dev/null +++ b/src/anchor/selectors/create.test.ts @@ -0,0 +1,136 @@ +import { describe, expect, it } from "vitest"; +import type { DocumentRepresentation } from "@shared/document"; +import type { DocumentId, RepresentationId } from "@shared/ids"; +import type { + PdfPageTextSelector, + PdfRectSelector, + TextPositionSelector, + TextQuoteSelector, +} from "@shared/selector"; +import { createSelectors } from "./create"; +import type { PdfSelectionCapture } from "../types"; + +function repr(canonicalText: string): DocumentRepresentation { + const pageLength = canonicalText.length; + return { + id: "rep_test" as RepresentationId, + documentId: "doc_test" as DocumentId, + representationType: "pdf-text", + contentHash: "test", + canonicalText, + pageMap: [{ page: 1, width: 595, height: 842 }], + offsetMap: [ + { page: 1, globalStart: 0, globalEnd: pageLength, pageLength }, + ], + generatedAt: "2026-05-25T00:00:00.000Z", + }; +} + +function capture(text: string, page = 1, rectsCount = 1): PdfSelectionCapture { + return { + kind: "pdf", + text, + page, + rects: Array.from({ length: rectsCount }, (_, i) => ({ + x: 0.1, + y: 0.2 + i * 0.05, + width: 0.5, + height: 0.04, + })), + boundingRect: { x: 0.1, y: 0.2, width: 0.5, height: 0.04 * rectsCount }, + }; +} + +describe("createSelectors", () => { + const text = "The quick brown fox jumps over the lazy dog near the river bank."; + const representation = repr(text); + + it("always includes a TextQuoteSelector with prefix and suffix from canonical text", () => { + const sels = createSelectors(capture("brown fox"), representation); + const quote = sels.find((s): s is TextQuoteSelector => s.type === "TextQuoteSelector"); + expect(quote).toBeDefined(); + expect(quote!.exact).toBe("brown fox"); + expect(quote!.prefix).toBe("The quick "); + expect(quote!.suffix).toBe(" jumps over the lazy dog near th"); + }); + + it("includes a TextPositionSelector pointing at the matched offset", () => { + const sels = createSelectors(capture("brown fox"), representation); + const pos = sels.find((s): s is TextPositionSelector => s.type === "TextPositionSelector"); + expect(pos).toBeDefined(); + expect(pos!.start).toBe(text.indexOf("brown fox")); + expect(pos!.end).toBe(text.indexOf("brown fox") + "brown fox".length); + }); + + it("includes a PdfRectSelector mirroring the capture's page and rects", () => { + const c = capture("brown fox", 1, 2); + const sels = createSelectors(c, representation); + const rect = sels.find((s): s is PdfRectSelector => s.type === "PdfRectSelector"); + expect(rect).toBeDefined(); + expect(rect!.page).toBe(1); + expect(rect!.rects).toEqual(c.rects); + }); + + it("includes a PdfPageTextSelector when the match falls inside the capture's page range", () => { + const sels = createSelectors(capture("brown fox"), representation); + const pageText = sels.find((s): s is PdfPageTextSelector => s.type === "PdfPageTextSelector"); + expect(pageText).toBeDefined(); + expect(pageText!.page).toBe(1); + expect(pageText!.start).toBe(text.indexOf("brown fox")); + }); + + it("omits the TextPositionSelector when the quote cannot be found in canonical text", () => { + const sels = createSelectors(capture("nonexistent phrase"), representation); + const pos = sels.find((s) => s.type === "TextPositionSelector"); + expect(pos).toBeUndefined(); + const quote = sels.find((s): s is TextQuoteSelector => s.type === "TextQuoteSelector"); + expect(quote!.exact).toBe("nonexistent phrase"); + expect(quote!.prefix).toBeUndefined(); + expect(quote!.suffix).toBeUndefined(); + }); + + it("clamps prefix at the start of the canonical text", () => { + const sels = createSelectors(capture("The quick"), representation); + const quote = sels.find((s): s is TextQuoteSelector => s.type === "TextQuoteSelector")!; + expect(quote.prefix).toBeUndefined(); + expect(quote.suffix).toBe(" brown fox jumps over the lazy d"); + }); + + it("clamps suffix at the end of the canonical text", () => { + const sels = createSelectors(capture("river bank."), representation); + const quote = sels.find((s): s is TextQuoteSelector => s.type === "TextQuoteSelector")!; + expect(quote.prefix).toBe("umps over the lazy dog near the "); + expect(quote.suffix).toBeUndefined(); + }); + + it("honors a custom contextChars option", () => { + const sels = createSelectors(capture("brown fox"), representation, { contextChars: 4 }); + const quote = sels.find((s): s is TextQuoteSelector => s.type === "TextQuoteSelector")!; + expect(quote.prefix).toBe("ick "); + expect(quote.suffix).toBe(" jum"); + }); + + it("prefers the on-page match when the quote appears on multiple pages", () => { + // Two-page representation where the quote appears once per page. + const canonical = "alpha echo bravo" + "\n\n" + "charlie echo delta"; + const rep: DocumentRepresentation = { + id: "rep_multi" as RepresentationId, + documentId: "doc_multi" as DocumentId, + representationType: "pdf-text", + contentHash: "h", + canonicalText: canonical, + pageMap: [ + { page: 1, width: 100, height: 100 }, + { page: 2, width: 100, height: 100 }, + ], + offsetMap: [ + { page: 1, globalStart: 0, globalEnd: 18, pageLength: 18 }, + { page: 2, globalStart: 18, globalEnd: canonical.length, pageLength: canonical.length - 18 }, + ], + generatedAt: "2026-05-25T00:00:00.000Z", + }; + const sels = createSelectors(capture("echo", 2), rep); + const pos = sels.find((s): s is TextPositionSelector => s.type === "TextPositionSelector")!; + expect(pos.start).toBe(canonical.indexOf("echo", 18)); + }); +}); diff --git a/src/anchor/selectors/create.ts b/src/anchor/selectors/create.ts new file mode 100644 index 0000000..5eb9ebc --- /dev/null +++ b/src/anchor/selectors/create.ts @@ -0,0 +1,157 @@ +/** + * Build the maximal `Selector[]` from a viewer's `SelectionCapture`. + * + * Implements the "always store all selector types that are available" rule + * from `wiki/SharedContracts.md` §3 (selector redundancy) and the create + * half of the `AnchorAdapter` contract in + * `wiki/ArchitectureOverview.md` §3.3. + * + * Output guarantee: every returned `Selector[]` includes a + * `TextQuoteSelector` (always) and adds `TextPositionSelector`, + * `PdfRectSelector`, `PdfPageTextSelector` only when the underlying data + * actually supports them. Resolvers can rely on the union being trimmed — + * a missing selector means "not available", not "skipped". + */ + +import type { DocumentRepresentation } from "@shared/document"; +import { normalize } from "@shared/text/normalize"; +import type { + PdfPageTextSelector, + PdfRectSelector, + Selector, + TextPositionSelector, + TextQuoteSelector, +} from "@shared/selector"; + +import type { PdfSelectionCapture, SelectionCapture } from "../types"; + +/** Default characters of prefix/suffix context stored on TextQuoteSelector. */ +export const DEFAULT_CONTEXT_CHARS = 32; + +export interface CreateSelectorsOptions { + readonly contextChars?: number; +} + +export function createSelectors( + capture: SelectionCapture, + representation: DocumentRepresentation, + options: CreateSelectorsOptions = {}, +): Selector[] { + // `SelectionCapture` is a discriminated union. The DOM branch is `never` + // in MVP, so the only runtime shape is `PdfSelectionCapture`. + return createSelectorsFromPdfCapture(capture, representation, options); +} + +function createSelectorsFromPdfCapture( + capture: PdfSelectionCapture, + representation: DocumentRepresentation, + options: CreateSelectorsOptions, +): Selector[] { + const contextChars = options.contextChars ?? DEFAULT_CONTEXT_CHARS; + const normalizedQuote = normalize(capture.text).text; + const out: Selector[] = []; + + const canonicalText = representation.canonicalText ?? ""; + const positions = canonicalText.length > 0 && normalizedQuote.length > 0 + ? findAllOccurrences(canonicalText, normalizedQuote) + : []; + + // Locate the match that falls on the capture's page (when offsetMap is + // known); otherwise fall back to the first match. If there is no match, + // we still emit a quote-only TextQuoteSelector so the annotation is + // recoverable later if the representation is rebuilt. + const pageRange = representation.offsetMap?.find((r) => r.page === capture.page); + const matchOffset = pickMatch(positions, pageRange); + + // 1. TextQuoteSelector — always included. + if (normalizedQuote.length > 0) { + const quote = matchOffset !== null + ? buildQuoteSelectorWithContext(canonicalText, matchOffset, normalizedQuote, contextChars) + : ({ type: "TextQuoteSelector", exact: normalizedQuote } satisfies TextQuoteSelector); + out.push(quote); + } + + // 2. TextPositionSelector — only when we have a unique-enough match. + if (matchOffset !== null) { + const pos: TextPositionSelector = { + type: "TextPositionSelector", + start: matchOffset, + end: matchOffset + normalizedQuote.length, + }; + out.push(pos); + } + + // 3. PdfRectSelector — straight from the capture; viewer-coordinate truth. + if (capture.rects.length > 0) { + const rect: PdfRectSelector = { + type: "PdfRectSelector", + page: capture.page, + rects: capture.rects, + }; + out.push(rect); + } + + // 4. PdfPageTextSelector — when we have offsetMap and a unique-enough match + // that falls inside the capture's page range. + if (matchOffset !== null && pageRange) { + if (matchOffset >= pageRange.globalStart && matchOffset + normalizedQuote.length <= pageRange.globalEnd) { + const pageText: PdfPageTextSelector = { + type: "PdfPageTextSelector", + page: capture.page, + start: matchOffset - pageRange.globalStart, + end: matchOffset - pageRange.globalStart + normalizedQuote.length, + }; + out.push(pageText); + } + } + + return out; +} + +function findAllOccurrences(haystack: string, needle: string): number[] { + if (needle.length === 0) return []; + const out: number[] = []; + let from = 0; + for (;;) { + const idx = haystack.indexOf(needle, from); + if (idx === -1) break; + out.push(idx); + from = idx + 1; + } + return out; +} + +function pickMatch( + positions: readonly number[], + pageRange: { globalStart: number; globalEnd: number } | undefined, +): number | null { + if (positions.length === 0) return null; + if (positions.length === 1) return positions[0]!; + if (pageRange) { + const onPage = positions.find( + (p) => p >= pageRange.globalStart && p < pageRange.globalEnd, + ); + if (onPage !== undefined) return onPage; + } + // Multiple matches and no page hint — return the first; resolve.ts will + // need prefix/suffix to disambiguate. + return positions[0]!; +} + +function buildQuoteSelectorWithContext( + canonicalText: string, + matchOffset: number, + exact: string, + contextChars: number, +): TextQuoteSelector { + const prefixStart = Math.max(0, matchOffset - contextChars); + const suffixEnd = Math.min(canonicalText.length, matchOffset + exact.length + contextChars); + const prefix = canonicalText.slice(prefixStart, matchOffset); + const suffix = canonicalText.slice(matchOffset + exact.length, suffixEnd); + return { + type: "TextQuoteSelector", + exact, + ...(prefix.length > 0 ? { prefix } : {}), + ...(suffix.length > 0 ? { suffix } : {}), + }; +} diff --git a/src/anchor/selectors/index.ts b/src/anchor/selectors/index.ts new file mode 100644 index 0000000..f47543c --- /dev/null +++ b/src/anchor/selectors/index.ts @@ -0,0 +1,6 @@ +export { + createSelectors, + DEFAULT_CONTEXT_CHARS, + type CreateSelectorsOptions, +} from "./create"; +export { resolveSelectors } from "./resolve"; diff --git a/src/anchor/selectors/resolve.test.ts b/src/anchor/selectors/resolve.test.ts new file mode 100644 index 0000000..028f95a --- /dev/null +++ b/src/anchor/selectors/resolve.test.ts @@ -0,0 +1,137 @@ +import { describe, expect, it } from "vitest"; +import type { DocumentRepresentation } from "@shared/document"; +import type { DocumentId, RepresentationId } from "@shared/ids"; +import type { Selector } from "@shared/selector"; +import { resolveSelectors } from "./resolve"; + +function repr(canonicalText: string, pages = 1): DocumentRepresentation { + const segmentLen = pages === 1 + ? canonicalText.length + : Math.floor(canonicalText.length / pages); + const offsetMap = []; + for (let i = 0; i < pages; i++) { + const start = i * segmentLen; + const end = i === pages - 1 ? canonicalText.length : start + segmentLen; + offsetMap.push({ page: i + 1, globalStart: start, globalEnd: end, pageLength: end - start }); + } + return { + id: "rep_test" as RepresentationId, + documentId: "doc_test" as DocumentId, + representationType: "pdf-text", + contentHash: "test", + canonicalText, + pageMap: Array.from({ length: pages }, (_, i) => ({ page: i + 1, width: 595, height: 842 })), + offsetMap, + generatedAt: "2026-05-25T00:00:00.000Z", + }; +} + +describe("resolveSelectors", () => { + const text = "The quick brown fox jumps over the lazy dog."; + const representation = repr(text); + const brownFoxStart = text.indexOf("brown fox"); + const brownFoxEnd = brownFoxStart + "brown fox".length; + + it("returns 1.0 confidence when position and quote agree exactly", () => { + const selectors: Selector[] = [ + { type: "TextPositionSelector", start: brownFoxStart, end: brownFoxEnd }, + { type: "TextQuoteSelector", exact: "brown fox" }, + ]; + const r = resolveSelectors(selectors, representation); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(1.0); + expect(r.candidates[0]?.textPosition).toEqual({ start: brownFoxStart, end: brownFoxEnd }); + expect(r.candidates[0]?.page).toBe(1); + expect(r.usedSelectorTypes).toEqual(["TextPositionSelector", "TextQuoteSelector"]); + }); + + it("falls back to quote search when position is stale, and records a warning", () => { + const selectors: Selector[] = [ + { type: "TextPositionSelector", start: 0, end: 9 }, // "The quick" + { type: "TextQuoteSelector", exact: "brown fox" }, + ]; + const r = resolveSelectors(selectors, representation); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(0.95); + expect(r.candidates[0]?.textPosition).toEqual({ start: brownFoxStart, end: brownFoxEnd }); + expect(r.warnings?.[0]).toMatch(/did not match/); + expect(r.usedSelectorTypes).toEqual(["TextQuoteSelector"]); + }); + + it("returns 0.85 for a position-only selector with no quote to verify", () => { + const selectors: Selector[] = [ + { type: "TextPositionSelector", start: brownFoxStart, end: brownFoxEnd }, + ]; + const r = resolveSelectors(selectors, representation); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(0.85); + }); + + it("returns 0.95 when only TextQuoteSelector is present and the quote is unique", () => { + const r = resolveSelectors( + [{ type: "TextQuoteSelector", exact: "brown fox" }], + representation, + ); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(0.95); + }); + + it("returns 0.9 when a duplicated quote is disambiguated by prefix/suffix", () => { + const dup = "alpha echo bravo charlie echo delta"; + const r = resolveSelectors( + [{ type: "TextQuoteSelector", exact: "echo", prefix: "charlie ", suffix: " delta" }], + repr(dup), + ); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(0.9); + expect(r.candidates[0]?.textPosition?.start).toBe(dup.indexOf("echo", 10)); + }); + + it("returns ambiguous when a duplicated quote cannot be disambiguated", () => { + const dup = "echo and echo"; + const r = resolveSelectors( + [{ type: "TextQuoteSelector", exact: "echo" }], + repr(dup), + ); + expect(r.status).toBe("ambiguous"); + expect(r.confidence).toBe(0.5); + }); + + it("falls back to PdfPageTextSelector via the OffsetMap", () => { + // Single page, "brown fox" at offset 10..19. + const r = resolveSelectors( + [{ type: "PdfPageTextSelector", page: 1, start: brownFoxStart, end: brownFoxEnd }], + representation, + ); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(0.8); + expect(r.candidates[0]?.textPosition).toEqual({ start: brownFoxStart, end: brownFoxEnd }); + expect(r.candidates[0]?.page).toBe(1); + }); + + it("falls back to PdfRectSelector with page+rects only at 0.7 confidence", () => { + const r = resolveSelectors( + [{ + type: "PdfRectSelector", + page: 2, + rects: [{ x: 0.1, y: 0.2, width: 0.3, height: 0.04 }], + }], + repr(text, 1), + ); + expect(r.status).toBe("resolved"); + expect(r.confidence).toBe(0.7); + expect(r.candidates[0]?.page).toBe(2); + expect(r.candidates[0]?.textPosition).toBeUndefined(); + expect(r.candidates[0]?.rects).toHaveLength(1); + }); + + it("returns unresolved when nothing matches", () => { + const r = resolveSelectors( + [{ type: "TextQuoteSelector", exact: "missing string" }], + representation, + ); + expect(r.status).toBe("unresolved"); + expect(r.confidence).toBe(0); + expect(r.candidates).toEqual([]); + }); +}); diff --git a/src/anchor/selectors/resolve.ts b/src/anchor/selectors/resolve.ts new file mode 100644 index 0000000..af3efb6 --- /dev/null +++ b/src/anchor/selectors/resolve.ts @@ -0,0 +1,260 @@ +/** + * Resolve a `Selector[]` against a `DocumentRepresentation`. + * + * Implements the resolution strategy from `wiki/ArchitectureOverview.md` §7, + * MVP-trimmed: + * + * 1. Try `TextPositionSelector` (cheapest — direct slice). + * 2. Verify with `TextQuoteSelector` at that position. + * 3. Try `TextQuoteSelector` on its own. If multiple matches, disambiguate + * by prefix/suffix. + * 4. Try `PdfPageTextSelector` (page-local offsets through the OffsetMap). + * 5. Fall back to `PdfRectSelector` for a page+rects-only target. + * 6. Return `unresolved` if nothing above succeeds. + * + * Fuzzy matching is out of scope here; a later workplan owns it. + * + * Confidence ladder (0..1): + * 1.00 — TextPosition + TextQuote agree exactly + * 0.95 — TextQuote unique match (no position to cross-check) + * 0.90 — TextQuote disambiguated by prefix/suffix + * 0.85 — TextPosition only (no quote to cross-check) + * 0.80 — PdfPageTextSelector resolved via OffsetMap + * 0.70 — PdfRectSelector only (page+rects, no text verification) + */ + +import type { DocumentRepresentation } from "@shared/document"; +import type { + PdfPageTextSelector, + PdfRectSelector, + Selector, + SelectorType, + TextPositionSelector, + TextQuoteSelector, +} from "@shared/selector"; + +import type { AnchorResolution, ResolvedAnchorTarget } from "../types"; + +export function resolveSelectors( + selectors: readonly Selector[], + representation: DocumentRepresentation, +): AnchorResolution { + const canonicalText = representation.canonicalText ?? ""; + const offsetMap = representation.offsetMap ?? []; + const representationId = representation.id; + + const byType = indexByType(selectors); + const used: SelectorType[] = []; + const warnings: string[] = []; + + // 1 & 2. Try TextPositionSelector, verify with TextQuoteSelector. + if (byType.TextPositionSelector && canonicalText.length > 0) { + const pos = byType.TextPositionSelector; + const slice = sliceSafely(canonicalText, pos.start, pos.end); + if (slice !== null) { + const quote = byType.TextQuoteSelector; + if (quote) { + if (slice === quote.exact) { + used.push("TextPositionSelector", "TextQuoteSelector"); + return resolved( + { representationId, textPosition: { start: pos.start, end: pos.end }, ...pageFor(pos, offsetMap) }, + 1.0, + used, + warnings, + ); + } + warnings.push( + "TextPositionSelector slice did not match TextQuoteSelector.exact; falling back to quote search.", + ); + } else { + // Position with no quote to verify — accept at lower confidence. + used.push("TextPositionSelector"); + return resolved( + { representationId, textPosition: { start: pos.start, end: pos.end }, ...pageFor(pos, offsetMap) }, + 0.85, + used, + warnings, + ); + } + } + } + + // 3. TextQuoteSelector on its own (or after the position fallback above). + if (byType.TextQuoteSelector && canonicalText.length > 0) { + const quoteResult = resolveByQuote(canonicalText, byType.TextQuoteSelector); + if (quoteResult) { + used.push("TextQuoteSelector"); + return resolved( + { + representationId, + textPosition: { start: quoteResult.offset, end: quoteResult.offset + byType.TextQuoteSelector.exact.length }, + ...pageFor({ start: quoteResult.offset, end: quoteResult.offset + byType.TextQuoteSelector.exact.length }, offsetMap), + }, + quoteResult.confidence, + used, + warnings, + quoteResult.status, + ); + } + } + + // 4. PdfPageTextSelector through OffsetMap. + if (byType.PdfPageTextSelector && offsetMap.length > 0) { + const pageText = byType.PdfPageTextSelector; + const range = offsetMap.find((r) => r.page === pageText.page); + if (range && pageText.start >= 0 && pageText.end <= range.pageLength && pageText.start < pageText.end) { + const globalStart = range.globalStart + pageText.start; + const globalEnd = range.globalStart + pageText.end; + used.push("PdfPageTextSelector"); + return resolved( + { + representationId, + page: pageText.page, + textPosition: { start: globalStart, end: globalEnd }, + }, + 0.8, + used, + warnings, + ); + } + } + + // 5. PdfRectSelector fallback (no text verification possible). + if (byType.PdfRectSelector) { + const rect = byType.PdfRectSelector; + used.push("PdfRectSelector"); + return resolved( + { representationId, page: rect.page, rects: rect.rects }, + 0.7, + used, + warnings, + ); + } + + return unresolved(warnings); +} + +interface QuoteResolutionResult { + readonly offset: number; + readonly confidence: number; + readonly status: "resolved" | "ambiguous"; +} + +function resolveByQuote(canonicalText: string, quote: TextQuoteSelector): QuoteResolutionResult | null { + const positions = findAllOccurrences(canonicalText, quote.exact); + if (positions.length === 0) return null; + if (positions.length === 1) { + return { offset: positions[0]!, confidence: 0.95, status: "resolved" }; + } + // Multiple matches — try to disambiguate by prefix/suffix. + const filtered = positions.filter((p) => prefixSuffixMatches(canonicalText, p, quote)); + if (filtered.length === 1) { + return { offset: filtered[0]!, confidence: 0.9, status: "resolved" }; + } + if (filtered.length > 1) { + return { offset: filtered[0]!, confidence: 0.5, status: "ambiguous" }; + } + // No prefix/suffix info or no matches with context — return ambiguous on first. + return { offset: positions[0]!, confidence: 0.5, status: "ambiguous" }; +} + +function prefixSuffixMatches( + canonicalText: string, + offset: number, + quote: TextQuoteSelector, +): boolean { + if (quote.prefix !== undefined) { + const prefixEnd = offset; + const prefixStart = Math.max(0, prefixEnd - quote.prefix.length); + const actualPrefix = canonicalText.slice(prefixStart, prefixEnd); + if (!actualPrefix.endsWith(quote.prefix)) return false; + } + if (quote.suffix !== undefined) { + const suffixStart = offset + quote.exact.length; + const suffixEnd = Math.min(canonicalText.length, suffixStart + quote.suffix.length); + const actualSuffix = canonicalText.slice(suffixStart, suffixEnd); + if (!actualSuffix.startsWith(quote.suffix)) return false; + } + return true; +} + +interface SelectorIndex { + TextQuoteSelector?: TextQuoteSelector; + TextPositionSelector?: TextPositionSelector; + PdfRectSelector?: PdfRectSelector; + PdfPageTextSelector?: PdfPageTextSelector; +} + +function indexByType(selectors: readonly Selector[]): SelectorIndex { + const idx: SelectorIndex = {}; + for (const s of selectors) { + switch (s.type) { + case "TextQuoteSelector": + idx.TextQuoteSelector = s; + break; + case "TextPositionSelector": + idx.TextPositionSelector = s; + break; + case "PdfRectSelector": + idx.PdfRectSelector = s; + break; + case "PdfPageTextSelector": + idx.PdfPageTextSelector = s; + break; + } + } + return idx; +} + +function sliceSafely(text: string, start: number, end: number): string | null { + if (start < 0 || end > text.length || start >= end) return null; + return text.slice(start, end); +} + +function pageFor( + span: { start: number; end: number }, + offsetMap: readonly { page: number; globalStart: number; globalEnd: number }[], +): { page?: number } { + if (offsetMap.length === 0) return {}; + const range = offsetMap.find((r) => span.start >= r.globalStart && span.end <= r.globalEnd); + return range ? { page: range.page } : {}; +} + +function findAllOccurrences(haystack: string, needle: string): number[] { + if (needle.length === 0) return []; + const out: number[] = []; + let from = 0; + for (;;) { + const idx = haystack.indexOf(needle, from); + if (idx === -1) break; + out.push(idx); + from = idx + 1; + } + return out; +} + +function resolved( + target: ResolvedAnchorTarget, + confidence: number, + used: readonly SelectorType[], + warnings: readonly string[], + status: "resolved" | "ambiguous" = "resolved", +): AnchorResolution { + return { + status, + confidence, + candidates: [target], + usedSelectorTypes: used, + ...(warnings.length > 0 ? { warnings } : {}), + }; +} + +function unresolved(warnings: readonly string[]): AnchorResolution { + return { + status: "unresolved", + confidence: 0, + candidates: [], + usedSelectorTypes: [], + ...(warnings.length > 0 ? { warnings } : {}), + }; +} diff --git a/src/app/App.tsx b/src/app/App.tsx new file mode 100644 index 0000000..996a98c --- /dev/null +++ b/src/app/App.tsx @@ -0,0 +1,40 @@ +/** + * App — the citation-evidence MVP shell. + * + * Three-pane layout per `wiki/ArchitectureOverview.md` §12.1: + * + * ┌────────────┬──────────────────┬────────────┐ + * │ Collection │ Document Viewer │ Evidence │ + * │ List │ │ Sidebar │ + * └────────────┴──────────────────┴────────────┘ + * + * CE-WP-0002-T06 stops at "viewer shell is rendered, evidence list is + * displayed". T07 wires the selection → annotation → evidence flow; T08 + * wires the sidebar-click → scroll-to-passage round-trip. + */ + +import { + CollectionList, + EngineProvider, + EvidenceSidebar, + ViewerShell, +} from "@work/index"; + +export function App() { + return ( + +
+ + + +
+
+ ); +} diff --git a/src/app/SpikeApp.tsx b/src/app/SpikeApp.tsx deleted file mode 100644 index 3f42eae..0000000 --- a/src/app/SpikeApp.tsx +++ /dev/null @@ -1,233 +0,0 @@ -/** - * CE-WP-0002-T02 spike host page. - * - * Lists the fixtures from `fixtures/pdfs/manifest.json`, lets the user load - * one in the spike PDF viewer, capture a selection (the viewer's - * `onSelection` fires when text is selected), persist the resulting - * selectors to `localStorage`, and on reload restore + scroll to them. - * - * Success looks like: select a quote → click "save" → reload the tab → - * the highlight is rendered on the same passage and the page is scrolled - * to it. - */ - -import { useEffect, useMemo, useState } from "react"; -import { - PdfSpikeViewer, - type PdfSelectionCapture, - type StoredAnnotation, -} from "@anchor/index"; -import type { Selector } from "@shared/selector"; -import { newId } from "@shared/ids"; -import manifest from "../../fixtures/pdfs/manifest.json"; - -interface FixtureEntry { - id: string; - filename: string; - description: string; - page_count: number; - known_good_quote: string; - known_good_quote_page: number; -} - -const FIXTURES: FixtureEntry[] = (manifest as { fixtures: FixtureEntry[] }).fixtures; - -const STORAGE_KEY = "ce-wp-0002-spike-annotations-v1"; - -interface StoredEntry { - id: string; - fixtureId: string; - text: string; - selectors: Selector[]; - createdAt: string; -} - -function loadStore(): StoredEntry[] { - try { - const raw = localStorage.getItem(STORAGE_KEY); - if (!raw) return []; - const parsed = JSON.parse(raw) as unknown; - if (!Array.isArray(parsed)) return []; - return parsed as StoredEntry[]; - } catch { - return []; - } -} - -function saveStore(entries: StoredEntry[]) { - localStorage.setItem(STORAGE_KEY, JSON.stringify(entries)); -} - -export function SpikeApp() { - const [activeFixtureId, setActiveFixtureId] = useState(null); - const [entries, setEntries] = useState(() => loadStore()); - const [pending, setPending] = useState< - | { capture: PdfSelectionCapture; selectors: Selector[] } - | null - >(null); - const [scrollTo, setScrollTo] = useState(null); - - useEffect(() => { - saveStore(entries); - }, [entries]); - - const activeFixture = useMemo( - () => FIXTURES.find((f) => f.id === activeFixtureId) ?? null, - [activeFixtureId], - ); - - const annotationsForActive = useMemo(() => { - if (!activeFixtureId) return []; - return entries - .filter((e) => e.fixtureId === activeFixtureId) - .map((e) => ({ id: e.id, text: e.text, selectors: e.selectors })); - }, [activeFixtureId, entries]); - - function handleSave() { - if (!pending || !activeFixtureId) return; - const entry: StoredEntry = { - id: newId("annotation"), - fixtureId: activeFixtureId, - text: pending.capture.text, - selectors: pending.selectors, - createdAt: new Date().toISOString(), - }; - setEntries((prev) => [...prev, entry]); - setPending(null); - } - - function handleClear() { - if (!activeFixtureId) return; - setEntries((prev) => prev.filter((e) => e.fixtureId !== activeFixtureId)); - } - - return ( -
- - -
- {activeFixture ? ( - - setPending({ capture, selectors }) - } - /> - ) : ( -
- Pick a fixture on the left to begin. -
- )} -
-
- ); -} diff --git a/src/app/index.ts b/src/app/index.ts index e9ca6a1..713869c 100644 --- a/src/app/index.ts +++ b/src/app/index.ts @@ -1 +1 @@ -export { SpikeApp } from "./SpikeApp"; +export { App } from "./App"; diff --git a/src/app/main.tsx b/src/app/main.tsx index 320147f..ef3196e 100644 --- a/src/app/main.tsx +++ b/src/app/main.tsx @@ -1,12 +1,12 @@ import { StrictMode } from "react"; import { createRoot } from "react-dom/client"; -import { SpikeApp } from "./SpikeApp"; +import { App } from "./App"; const container = document.getElementById("root"); if (!container) throw new Error("#root not found"); createRoot(container).render( - + , ); diff --git a/src/engine/engine.test.ts b/src/engine/engine.test.ts new file mode 100644 index 0000000..30745d2 --- /dev/null +++ b/src/engine/engine.test.ts @@ -0,0 +1,168 @@ +import { beforeEach, describe, expect, it } from "vitest"; +import type { Document, DocumentRepresentation } from "@shared/document"; +import type { DocumentId, RepresentationId } from "@shared/ids"; +import type { Selector } from "@shared/selector"; +import { createEngine, type Engine, type EngineEvent } from "./index"; + +function fakeDocAndRep(): { document: Document; representation: DocumentRepresentation } { + const docId = "doc_fake" as DocumentId; + const repId = "rep_fake" as RepresentationId; + return { + document: { + id: docId, + mediaType: "application/pdf", + createdAt: "2026-05-25T00:00:00.000Z", + updatedAt: "2026-05-25T00:00:00.000Z", + }, + representation: { + id: repId, + documentId: docId, + representationType: "pdf-text", + contentHash: "h", + canonicalText: "The quick brown fox.", + pageMap: [{ page: 1, width: 100, height: 100 }], + offsetMap: [{ page: 1, globalStart: 0, globalEnd: 20, pageLength: 20 }], + generatedAt: "2026-05-25T00:00:00.000Z", + }, + }; +} + +describe("Engine integration", () => { + let engine: Engine; + let events: EngineEvent[]; + + beforeEach(() => { + engine = createEngine(); + events = []; + engine.bus.onAny((e) => events.push(e)); + }); + + it("documentService.register stores both and emits DocumentImported + DocumentRepresentationGenerated", () => { + const { document, representation } = fakeDocAndRep(); + const result = engine.documents.register({ document, representation }); + expect(result.document).toBe(document); + expect(result.representation).toBe(representation); + expect(engine.documents.get(document.id)).toBe(document); + expect(engine.documents.getRepresentation(representation.id)).toBe(representation); + expect(events.map((e) => e.type)).toEqual(["DocumentImported", "DocumentRepresentationGenerated"]); + }); + + it("annotationService.create stamps an ID + normalize version + timestamps, then emits AnnotationCreated", () => { + const { document, representation } = fakeDocAndRep(); + engine.documents.register({ document, representation }); + const selectors: Selector[] = [{ type: "TextQuoteSelector", exact: "brown fox" }]; + const ann = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors, + quote: "brown fox", + note: "a quick mark", + }); + expect(ann.id).toMatch(/^ann_/); + expect(ann.normalizeVersion).toBeGreaterThan(0); + expect(ann.createdAt).toBe(ann.updatedAt); + expect(engine.annotations.get(ann.id)).toBe(ann); + const created = events.find((e) => e.type === "AnnotationCreated"); + expect(created?.type).toBe("AnnotationCreated"); + }); + + it("setResolutionStatus emits AnnotationResolved for resolved/ambiguous and AnnotationResolutionFailed for unresolved/stale", () => { + const { document, representation } = fakeDocAndRep(); + engine.documents.register({ document, representation }); + const ann = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors: [{ type: "TextQuoteSelector", exact: "x" }], + }); + events.length = 0; + engine.annotations.setResolutionStatus(ann.id, "resolved", { confidence: 0.95 }); + expect(events.map((e) => e.type)).toEqual(["AnnotationResolved"]); + engine.annotations.setResolutionStatus(ann.id, "unresolved", { confidence: 0, reason: "no quote match" }); + expect(events.map((e) => e.type)).toEqual(["AnnotationResolved", "AnnotationResolutionFailed"]); + }); + + it("evidenceService.create requires at least one annotation and emits EvidenceItemCreated", () => { + const { document, representation } = fakeDocAndRep(); + engine.documents.register({ document, representation }); + const ann = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors: [{ type: "TextQuoteSelector", exact: "brown fox" }], + }); + expect(() => engine.evidence.create({ annotationIds: [] })).toThrow(); + const item = engine.evidence.create({ + annotationIds: [ann.id], + commentary: "good quote", + }); + expect(item.status).toBe("candidate"); + expect(item.annotationIds).toEqual([ann.id]); + expect(events.find((e) => e.type === "EvidenceItemCreated")).toBeDefined(); + }); + + it("setStatus emits EvidenceItemUpdated only on real change and carries previousStatus", () => { + const { document, representation } = fakeDocAndRep(); + engine.documents.register({ document, representation }); + const ann = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors: [{ type: "TextQuoteSelector", exact: "brown fox" }], + }); + const item = engine.evidence.create({ annotationIds: [ann.id] }); + events.length = 0; + const same = engine.evidence.setStatus(item.id, "candidate"); + expect(same).toBe(item); + expect(events).toEqual([]); + engine.evidence.setStatus(item.id, "confirmed"); + const updated = events.find((e) => e.type === "EvidenceItemUpdated"); + expect(updated).toBeDefined(); + if (updated?.type === "EvidenceItemUpdated") { + expect(updated.previousStatus).toBe("candidate"); + } + }); + + it("listByDocument scopes evidence items to a single document via annotation lookup", () => { + const a = fakeDocAndRep(); + engine.documents.register(a); + const annA = engine.annotations.create({ + documentId: a.document.id, + representationId: a.representation.id, + selectors: [{ type: "TextQuoteSelector", exact: "brown fox" }], + }); + engine.evidence.create({ annotationIds: [annA.id], commentary: "a" }); + + // Second, distinct document. + const otherDocId = "doc_other" as DocumentId; + const otherRepId = "rep_other" as RepresentationId; + engine.documents.register({ + document: { ...a.document, id: otherDocId }, + representation: { ...a.representation, id: otherRepId, documentId: otherDocId }, + }); + const annB = engine.annotations.create({ + documentId: otherDocId, + representationId: otherRepId, + selectors: [{ type: "TextQuoteSelector", exact: "z" }], + }); + engine.evidence.create({ annotationIds: [annB.id], commentary: "b" }); + + expect(engine.evidence.listByDocument(a.document.id)).toHaveLength(1); + expect(engine.evidence.listByDocument(otherDocId)).toHaveLength(1); + }); + + it("activate emits EvidenceItemActivated without mutating the item", () => { + const { document, representation } = fakeDocAndRep(); + engine.documents.register({ document, representation }); + const ann = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors: [{ type: "TextQuoteSelector", exact: "x" }], + }); + const item = engine.evidence.create({ annotationIds: [ann.id] }); + events.length = 0; + engine.evidence.activate(item.id, "sidebar"); + const activated = events.find((e) => e.type === "EvidenceItemActivated"); + expect(activated).toBeDefined(); + if (activated?.type === "EvidenceItemActivated") { + expect(activated.source).toBe("sidebar"); + } + }); +}); diff --git a/src/engine/events/bus.test.ts b/src/engine/events/bus.test.ts new file mode 100644 index 0000000..e0ce4c3 --- /dev/null +++ b/src/engine/events/bus.test.ts @@ -0,0 +1,64 @@ +import { describe, expect, it, vi } from "vitest"; +import type { DocumentId } from "@shared/ids"; +import { createEventBus } from "./bus"; + +const docId = "doc_test" as DocumentId; +const minimalDoc = { + id: docId, + mediaType: "application/pdf", + createdAt: "2026-05-25T00:00:00.000Z", + updatedAt: "2026-05-25T00:00:00.000Z", +}; + +describe("EventBus", () => { + it("delivers typed events to the registered listener", () => { + const bus = createEventBus(); + const spy = vi.fn(); + bus.on("DocumentImported", spy); + const result = bus.emit({ type: "DocumentImported", documentId: docId, document: minimalDoc }); + expect(spy).toHaveBeenCalledOnce(); + expect(spy.mock.calls[0]![0]).toMatchObject({ type: "DocumentImported", documentId: docId }); + expect(result.listenerCount).toBe(1); + expect(result.errors).toEqual([]); + }); + + it("does not deliver an event to listeners of a different type", () => { + const bus = createEventBus(); + const spy = vi.fn(); + bus.on("AnnotationCreated", spy); + bus.emit({ type: "DocumentImported", documentId: docId, document: minimalDoc }); + expect(spy).not.toHaveBeenCalled(); + }); + + it("delivers every event to onAny listeners", () => { + const bus = createEventBus(); + const spy = vi.fn(); + bus.onAny(spy); + bus.emit({ type: "DocumentImported", documentId: docId, document: minimalDoc }); + bus.emit({ type: "EvidenceItemActivated", evidenceItemId: "ev_x" as never }); + expect(spy).toHaveBeenCalledTimes(2); + }); + + it("returns an unsubscribe function from on()", () => { + const bus = createEventBus(); + const spy = vi.fn(); + const off = bus.on("DocumentImported", spy); + off(); + bus.emit({ type: "DocumentImported", documentId: docId, document: minimalDoc }); + expect(spy).not.toHaveBeenCalled(); + }); + + it("captures listener errors and still calls subsequent listeners", () => { + const bus = createEventBus(); + const boom = new Error("listener exploded"); + const a = vi.fn(() => { throw boom; }); + const b = vi.fn(); + bus.on("DocumentImported", a); + bus.on("DocumentImported", b); + const result = bus.emit({ type: "DocumentImported", documentId: docId, document: minimalDoc }); + expect(a).toHaveBeenCalledOnce(); + expect(b).toHaveBeenCalledOnce(); + expect(result.errors).toEqual([boom]); + expect(result.listenerCount).toBe(2); + }); +}); diff --git a/src/engine/events/bus.ts b/src/engine/events/bus.ts new file mode 100644 index 0000000..0d844f3 --- /dev/null +++ b/src/engine/events/bus.ts @@ -0,0 +1,79 @@ +/** + * Synchronous in-process event bus. + * + * Listeners fire in registration order on the calling stack; `emit` returns + * after every listener has run. A listener throwing does not stop later + * listeners — its error surfaces through the returned `errors` array so + * callers can decide whether to log, rethrow, or ignore. + * + * MVP-sufficient. ADR-0005 (persistence) will decide whether to upgrade to + * an async/queued bus when storage becomes durable. + */ + +import type { EngineEvent, EngineEventOf, EngineEventType } from "./types"; + +export type EngineEventListener = ( + event: EngineEventOf, +) => void; + +export type AnyEngineEventListener = (event: EngineEvent) => void; + +export interface EmitResult { + readonly listenerCount: number; + readonly errors: readonly unknown[]; +} + +export interface EventBus { + on(type: T, listener: EngineEventListener): () => void; + onAny(listener: AnyEngineEventListener): () => void; + emit(event: EngineEventOf): EmitResult; +} + +export function createEventBus(): EventBus { + const typedListeners = new Map>(); + const anyListeners = new Set(); + + return { + on(type, listener) { + let set = typedListeners.get(type); + if (!set) { + set = new Set(); + typedListeners.set(type, set); + } + set.add(listener as unknown as EngineEventListener); + return () => { + set!.delete(listener as unknown as EngineEventListener); + }; + }, + onAny(listener) { + anyListeners.add(listener); + return () => { + anyListeners.delete(listener); + }; + }, + emit(event) { + const errors: unknown[] = []; + let count = 0; + const typedSet = typedListeners.get(event.type); + if (typedSet) { + for (const l of typedSet) { + count++; + try { + (l as AnyEngineEventListener)(event); + } catch (err) { + errors.push(err); + } + } + } + for (const l of anyListeners) { + count++; + try { + l(event); + } catch (err) { + errors.push(err); + } + } + return { listenerCount: count, errors }; + }, + }; +} diff --git a/src/engine/events/index.ts b/src/engine/events/index.ts new file mode 100644 index 0000000..daf22cc --- /dev/null +++ b/src/engine/events/index.ts @@ -0,0 +1,8 @@ +export * from "./types"; +export { + createEventBus, + type EventBus, + type EngineEventListener, + type AnyEngineEventListener, + type EmitResult, +} from "./bus"; diff --git a/src/engine/events/types.ts b/src/engine/events/types.ts new file mode 100644 index 0000000..2bff9c6 --- /dev/null +++ b/src/engine/events/types.ts @@ -0,0 +1,84 @@ +/** + * Engine event vocabulary. + * + * Implements `wiki/SharedContracts.md` §4 (closed event list). Each event + * carries the *minimum* identifying payload needed by downstream listeners; + * services hand back the full domain object to the caller separately. + * + * Adding an event requires updating SharedContracts.md first. + */ + +import type { Annotation, AnnotationResolutionStatus } from "@shared/annotation"; +import type { Document, DocumentRepresentation } from "@shared/document"; +import type { EvidenceItem, EvidenceItemStatus } from "@shared/evidence"; +import type { + AnnotationId, + DocumentId, + EvidenceItemId, + RepresentationId, +} from "@shared/ids"; + +export interface DocumentImportedEvent { + readonly type: "DocumentImported"; + readonly documentId: DocumentId; + readonly document: Document; +} + +export interface DocumentRepresentationGeneratedEvent { + readonly type: "DocumentRepresentationGenerated"; + readonly documentId: DocumentId; + readonly representationId: RepresentationId; + readonly representation: DocumentRepresentation; +} + +export interface AnnotationCreatedEvent { + readonly type: "AnnotationCreated"; + readonly annotationId: AnnotationId; + readonly annotation: Annotation; +} + +export interface AnnotationResolvedEvent { + readonly type: "AnnotationResolved"; + readonly annotationId: AnnotationId; + readonly status: AnnotationResolutionStatus; + readonly confidence: number; +} + +export interface AnnotationResolutionFailedEvent { + readonly type: "AnnotationResolutionFailed"; + readonly annotationId: AnnotationId; + readonly reason: string; +} + +export interface EvidenceItemCreatedEvent { + readonly type: "EvidenceItemCreated"; + readonly evidenceItemId: EvidenceItemId; + readonly evidenceItem: EvidenceItem; +} + +export interface EvidenceItemUpdatedEvent { + readonly type: "EvidenceItemUpdated"; + readonly evidenceItemId: EvidenceItemId; + readonly evidenceItem: EvidenceItem; + readonly previousStatus: EvidenceItemStatus; +} + +export interface EvidenceItemActivatedEvent { + readonly type: "EvidenceItemActivated"; + readonly evidenceItemId: EvidenceItemId; + readonly source?: "sidebar" | "form-field" | "citation-card"; +} + +export type EngineEvent = + | DocumentImportedEvent + | DocumentRepresentationGeneratedEvent + | AnnotationCreatedEvent + | AnnotationResolvedEvent + | AnnotationResolutionFailedEvent + | EvidenceItemCreatedEvent + | EvidenceItemUpdatedEvent + | EvidenceItemActivatedEvent; + +export type EngineEventType = EngineEvent["type"]; + +export type EngineEventOf = Extract; diff --git a/src/engine/index.ts b/src/engine/index.ts index cb0ff5c..0b165e8 100644 --- a/src/engine/index.ts +++ b/src/engine/index.ts @@ -1 +1,60 @@ -export {}; +/** + * Engine composition root. + * + * `createEngine()` wires in-memory repos to the services and shares a single + * event bus. The app layer holds the returned `Engine` instance and passes + * its services into the UI. + * + * Swapping the repository implementation later (ADR-0005) is a matter of + * replacing `createInMemoryRepos()` here. The service signatures don't + * change. + */ + +import { createEventBus, type EventBus } from "./events"; +import { + createInMemoryRepos, + type InMemoryRepos, +} from "./repos"; +import { + createAnnotationService, + createDocumentService, + createEvidenceService, + type AnnotationService, + type DocumentService, + type EvidenceService, +} from "./services"; + +export * from "./events"; +export * from "./repos"; +export * from "./services"; +export { + SNAPSHOT_VERSION, + attachPersister, + captureSnapshot, + documentIdsIn, + restoreFromStorage, + restoreSnapshot, + type EngineSnapshot, + type PersisterOptions, +} from "./persistence"; + +export interface Engine { + readonly bus: EventBus; + readonly repos: InMemoryRepos; + readonly documents: DocumentService; + readonly annotations: AnnotationService; + readonly evidence: EvidenceService; +} + +export function createEngine(): Engine { + const bus = createEventBus(); + const repos = createInMemoryRepos(); + const documents = createDocumentService(repos.documents, repos.representations, bus); + const annotations = createAnnotationService(repos.annotations, bus); + const evidence = createEvidenceService( + repos.evidenceItems, + (id) => repos.annotations.get(id), + bus, + ); + return { bus, repos, documents, annotations, evidence }; +} diff --git a/src/engine/persistence.test.ts b/src/engine/persistence.test.ts new file mode 100644 index 0000000..8ce7f4a --- /dev/null +++ b/src/engine/persistence.test.ts @@ -0,0 +1,183 @@ +import { beforeEach, describe, expect, it, vi } from "vitest"; +import type { Document, DocumentRepresentation } from "@shared/document"; +import type { DocumentId, RepresentationId } from "@shared/ids"; +import { + attachPersister, + captureSnapshot, + createEngine, + restoreFromStorage, + restoreSnapshot, + type Engine, + type EngineEvent, + type EngineSnapshot, +} from "./index"; + +function fakeDocAndRep(suffix: string): { + document: Document; + representation: DocumentRepresentation; +} { + const docId = `doc_${suffix}` as DocumentId; + const repId = `rep_${suffix}` as RepresentationId; + return { + document: { + id: docId, + mediaType: "application/pdf", + title: `Doc ${suffix}`, + createdAt: "2026-05-25T00:00:00.000Z", + updatedAt: "2026-05-25T00:00:00.000Z", + }, + representation: { + id: repId, + documentId: docId, + representationType: "pdf-text", + contentHash: `hash-${suffix}`, + canonicalText: "The quick brown fox.", + pageMap: [{ page: 1, width: 100, height: 100 }], + offsetMap: [{ page: 1, globalStart: 0, globalEnd: 20, pageLength: 20 }], + generatedAt: "2026-05-25T00:00:00.000Z", + }, + }; +} + +function memoryStorage(): Pick { + const map = new Map(); + return { + getItem: (k) => map.get(k) ?? null, + setItem: (k, v) => void map.set(k, v), + removeItem: (k) => void map.delete(k), + }; +} + +function seed(engine: Engine, suffix: string) { + const { document, representation } = fakeDocAndRep(suffix); + engine.documents.register({ document, representation }); + const ann = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors: [{ type: "TextQuoteSelector", exact: "brown fox" }], + quote: "brown fox", + }); + const item = engine.evidence.create({ + annotationIds: [ann.id], + commentary: `commentary-${suffix}`, + }); + return { document, representation, ann, item }; +} + +describe("captureSnapshot + restoreSnapshot", () => { + it("round-trips documents, representations, annotations and evidence items", () => { + const src = createEngine(); + const a = seed(src, "a"); + const b = seed(src, "b"); + const snap = captureSnapshot(src); + expect(snap.documents).toHaveLength(2); + expect(snap.representations).toHaveLength(2); + expect(snap.annotations).toHaveLength(2); + expect(snap.evidenceItems).toHaveLength(2); + + const dst = createEngine(); + restoreSnapshot(dst, snap); + expect(dst.documents.get(a.document.id)?.title).toBe("Doc a"); + expect(dst.documents.get(b.document.id)?.title).toBe("Doc b"); + expect(dst.annotations.get(a.ann.id)?.quote).toBe("brown fox"); + expect(dst.evidence.get(a.item.id)?.commentary).toBe("commentary-a"); + }); + + it("restoreSnapshot does NOT emit *Created events (events would loop the persister)", () => { + const src = createEngine(); + seed(src, "x"); + const snap = captureSnapshot(src); + + const dst = createEngine(); + const seen: EngineEvent["type"][] = []; + dst.bus.onAny((e) => seen.push(e.type)); + restoreSnapshot(dst, snap); + expect(seen).toEqual([]); + }); + + it("rejects a snapshot with a mismatching version", () => { + const dst = createEngine(); + expect(() => + restoreSnapshot(dst, { + version: 999, + documents: [], + representations: [], + annotations: [], + evidenceItems: [], + } as EngineSnapshot), + ).toThrow(/version/); + }); +}); + +describe("attachPersister", () => { + let storage: ReturnType; + let engine: Engine; + const KEY = "ce-test-snap"; + + beforeEach(() => { + storage = memoryStorage(); + engine = createEngine(); + }); + + it("writes a snapshot to storage on every mutating event", () => { + const off = attachPersister(engine, { key: KEY, storage }); + expect(storage.getItem(KEY)).toBeNull(); + seed(engine, "z"); + const raw = storage.getItem(KEY); + expect(raw).not.toBeNull(); + const snap = JSON.parse(raw!) as EngineSnapshot; + expect(snap.documents).toHaveLength(1); + expect(snap.evidenceItems).toHaveLength(1); + off(); + }); + + it("stops writing after the unsubscribe is called", () => { + const off = attachPersister(engine, { key: KEY, storage }); + seed(engine, "q"); + const after = storage.getItem(KEY); + off(); + seed(engine, "r"); + expect(storage.getItem(KEY)).toBe(after); + }); + + it("survives a JSON.stringify failure without throwing into the caller", () => { + const warn = vi.spyOn(console, "warn").mockImplementation(() => {}); + const failing = { ...memoryStorage(), setItem: () => { throw new Error("boom"); } }; + attachPersister(engine, { key: KEY, storage: failing }); + expect(() => seed(engine, "k")).not.toThrow(); + expect(warn).toHaveBeenCalled(); + warn.mockRestore(); + }); +}); + +describe("restoreFromStorage", () => { + it("returns {restored: false} when the key is empty", () => { + const storage = memoryStorage(); + const engine = createEngine(); + const result = restoreFromStorage(engine, { key: "missing", storage }); + expect(result.restored).toBe(false); + }); + + it("hydrates the engine when storage holds a valid snapshot", () => { + const src = createEngine(); + seed(src, "rs"); + const storage = memoryStorage(); + storage.setItem("snap", JSON.stringify(captureSnapshot(src))); + + const dst = createEngine(); + const result = restoreFromStorage(dst, { key: "snap", storage }); + expect(result.restored).toBe(true); + expect(dst.documents.list()).toHaveLength(1); + }); + + it("ignores malformed JSON without throwing", () => { + const warn = vi.spyOn(console, "warn").mockImplementation(() => {}); + const storage = memoryStorage(); + storage.setItem("snap", "not-json"); + const engine = createEngine(); + const result = restoreFromStorage(engine, { key: "snap", storage }); + expect(result.restored).toBe(false); + expect(warn).toHaveBeenCalled(); + warn.mockRestore(); + }); +}); diff --git a/src/engine/persistence.ts b/src/engine/persistence.ts new file mode 100644 index 0000000..1de5c8a --- /dev/null +++ b/src/engine/persistence.ts @@ -0,0 +1,138 @@ +/** + * Engine snapshot + restore. + * + * MVP "persistence" — capture the engine's in-memory state into a JSON blob + * and restore it later. Used by the SPA to survive page reloads via + * `localStorage` until ADR-0005 lands a real store. + * + * Restore deliberately bypasses the service layer: it writes directly to + * the repos so no `*Created` events fire. Without that, restoring would + * trigger the persister to re-write the same snapshot — and if the user + * has another tab open, it would also broadcast spurious "this annotation + * just appeared" events to UI listeners. + */ + +import type { Annotation } from "@shared/annotation"; +import type { Document, DocumentRepresentation } from "@shared/document"; +import type { EvidenceItem } from "@shared/evidence"; +import type { DocumentId } from "@shared/ids"; + +import type { Engine } from "./index"; + +export const SNAPSHOT_VERSION = 1; + +export interface EngineSnapshot { + readonly version: number; + readonly documents: readonly Document[]; + readonly representations: readonly DocumentRepresentation[]; + readonly annotations: readonly Annotation[]; + readonly evidenceItems: readonly EvidenceItem[]; +} + +export function captureSnapshot(engine: Engine): EngineSnapshot { + const documents = engine.documents.list(); + // Gather representations per known document. + const representations: DocumentRepresentation[] = []; + const annotations: Annotation[] = []; + const evidenceItems: EvidenceItem[] = []; + const seenItemIds = new Set(); + for (const doc of documents) { + representations.push(...engine.documents.listRepresentations(doc.id)); + annotations.push(...engine.annotations.listByDocument(doc.id)); + for (const item of engine.evidence.listByDocument(doc.id)) { + // listByDocument keys off annotation lookup; an item that shares + // annotations across two documents would surface twice. De-dupe. + if (!seenItemIds.has(item.id)) { + seenItemIds.add(item.id); + evidenceItems.push(item); + } + } + } + return { + version: SNAPSHOT_VERSION, + documents: [...documents], + representations, + annotations, + evidenceItems, + }; +} + +export function restoreSnapshot(engine: Engine, snapshot: EngineSnapshot): void { + if (snapshot.version !== SNAPSHOT_VERSION) { + throw new Error( + `restoreSnapshot: snapshot version ${snapshot.version} does not match current ${SNAPSHOT_VERSION}`, + ); + } + for (const d of snapshot.documents) engine.repos.documents.create(d); + for (const r of snapshot.representations) engine.repos.representations.create(r); + for (const a of snapshot.annotations) engine.repos.annotations.create(a); + for (const i of snapshot.evidenceItems) engine.repos.evidenceItems.create(i); +} + +export interface PersisterOptions { + /** Storage key. */ + readonly key: string; + /** Storage shim — defaults to globalThis.localStorage. */ + readonly storage?: Pick; +} + +/** + * Subscribe to engine events and write a fresh snapshot on every mutation. + * Returns the unsubscribe function. + * + * Initial snapshot is NOT written — call `captureSnapshot` + `storage.setItem` + * yourself if you want a baseline. + */ +export function attachPersister(engine: Engine, options: PersisterOptions): () => void { + const storage = options.storage ?? globalThis.localStorage; + const write = () => { + const snap = captureSnapshot(engine); + try { + storage.setItem(options.key, JSON.stringify(snap)); + } catch (err) { + // localStorage quota / serialization errors shouldn't crash the app. + // Surface to the console; ADR-0005 owns the durable fix. + console.warn("attachPersister: write failed", err); + } + }; + const offs = [ + engine.bus.on("DocumentImported", write), + engine.bus.on("DocumentRepresentationGenerated", write), + engine.bus.on("AnnotationCreated", write), + engine.bus.on("AnnotationResolved", write), + engine.bus.on("AnnotationResolutionFailed", write), + engine.bus.on("EvidenceItemCreated", write), + engine.bus.on("EvidenceItemUpdated", write), + ]; + return () => { + for (const off of offs) off(); + }; +} + +export type RestoreFromStorageOptions = PersisterOptions; + +export function restoreFromStorage( + engine: Engine, + options: RestoreFromStorageOptions, +): { readonly restored: boolean; readonly snapshot?: EngineSnapshot } { + const storage = options.storage ?? globalThis.localStorage; + const raw = storage.getItem(options.key); + if (!raw) return { restored: false }; + try { + const parsed = JSON.parse(raw) as EngineSnapshot; + if (typeof parsed !== "object" || parsed === null) return { restored: false }; + restoreSnapshot(engine, parsed); + return { restored: true, snapshot: parsed }; + } catch (err) { + console.warn("restoreFromStorage: parse failed, ignoring stored snapshot", err); + return { restored: false }; + } +} + +/** + * Narrow helper: get the set of document ids restored from a snapshot. + * Useful for the SPA's "show me what was open last time" logic. + */ +export function documentIdsIn(snapshot: EngineSnapshot): readonly DocumentId[] { + return snapshot.documents.map((d) => d.id); +} diff --git a/src/engine/repos/in-memory.ts b/src/engine/repos/in-memory.ts new file mode 100644 index 0000000..1fe6fb5 --- /dev/null +++ b/src/engine/repos/in-memory.ts @@ -0,0 +1,151 @@ +/** + * In-memory `Map`-backed repositories. + * + * Implements the MVP storage layer. The repository interfaces match the + * shape that ADR-0005's eventual persistence implementation will satisfy, + * so swapping `createInMemoryRepos()` for a SQLite/Postgres factory later + * is a localised change. + * + * All mutating methods return the *stored* object so callers can pick up + * server-assigned fields (none in MVP, but the contract anticipates it). + */ + +import type { Annotation } from "@shared/annotation"; +import type { Document, DocumentRepresentation } from "@shared/document"; +import type { EvidenceItem } from "@shared/evidence"; +import type { + AnnotationId, + DocumentId, + EvidenceItemId, + RepresentationId, +} from "@shared/ids"; + +export interface DocumentRepository { + create(document: Document): Document; + get(id: DocumentId): Document | null; + list(): readonly Document[]; + update(document: Document): Document; +} + +export interface RepresentationRepository { + create(representation: DocumentRepresentation): DocumentRepresentation; + get(id: RepresentationId): DocumentRepresentation | null; + listByDocument(documentId: DocumentId): readonly DocumentRepresentation[]; +} + +export interface AnnotationRepository { + create(annotation: Annotation): Annotation; + get(id: AnnotationId): Annotation | null; + listByDocument(documentId: DocumentId): readonly Annotation[]; + update(annotation: Annotation): Annotation; +} + +export interface EvidenceItemRepository { + create(item: EvidenceItem): EvidenceItem; + get(id: EvidenceItemId): EvidenceItem | null; + listByDocument( + documentId: DocumentId, + annotationLookup: (id: AnnotationId) => Annotation | null, + ): readonly EvidenceItem[]; + update(item: EvidenceItem): EvidenceItem; +} + +export interface InMemoryRepos { + readonly documents: DocumentRepository; + readonly representations: RepresentationRepository; + readonly annotations: AnnotationRepository; + readonly evidenceItems: EvidenceItemRepository; +} + +export function createInMemoryRepos(): InMemoryRepos { + const documents = new Map(); + const representations = new Map(); + const annotations = new Map(); + const evidenceItems = new Map(); + + return { + documents: { + create(document) { + documents.set(document.id, document); + return document; + }, + get(id) { + return documents.get(id) ?? null; + }, + list() { + return [...documents.values()]; + }, + update(document) { + if (!documents.has(document.id)) { + throw new Error(`DocumentRepository.update: unknown id ${document.id}`); + } + documents.set(document.id, document); + return document; + }, + }, + representations: { + create(representation) { + representations.set(representation.id, representation); + return representation; + }, + get(id) { + return representations.get(id) ?? null; + }, + listByDocument(documentId) { + const out: DocumentRepresentation[] = []; + for (const rep of representations.values()) { + if (rep.documentId === documentId) out.push(rep); + } + return out; + }, + }, + annotations: { + create(annotation) { + annotations.set(annotation.id, annotation); + return annotation; + }, + get(id) { + return annotations.get(id) ?? null; + }, + listByDocument(documentId) { + const out: Annotation[] = []; + for (const ann of annotations.values()) { + if (ann.documentId === documentId) out.push(ann); + } + return out; + }, + update(annotation) { + if (!annotations.has(annotation.id)) { + throw new Error(`AnnotationRepository.update: unknown id ${annotation.id}`); + } + annotations.set(annotation.id, annotation); + return annotation; + }, + }, + evidenceItems: { + create(item) { + evidenceItems.set(item.id, item); + return item; + }, + get(id) { + return evidenceItems.get(id) ?? null; + }, + listByDocument(documentId, annotationLookup) { + const out: EvidenceItem[] = []; + for (const item of evidenceItems.values()) { + if (item.annotationIds.some((aid) => annotationLookup(aid)?.documentId === documentId)) { + out.push(item); + } + } + return out; + }, + update(item) { + if (!evidenceItems.has(item.id)) { + throw new Error(`EvidenceItemRepository.update: unknown id ${item.id}`); + } + evidenceItems.set(item.id, item); + return item; + }, + }, + }; +} diff --git a/src/engine/repos/index.ts b/src/engine/repos/index.ts new file mode 100644 index 0000000..9f96c4c --- /dev/null +++ b/src/engine/repos/index.ts @@ -0,0 +1,8 @@ +export { + createInMemoryRepos, + type InMemoryRepos, + type DocumentRepository, + type RepresentationRepository, + type AnnotationRepository, + type EvidenceItemRepository, +} from "./in-memory"; diff --git a/src/engine/services/annotations.ts b/src/engine/services/annotations.ts new file mode 100644 index 0000000..6a25e27 --- /dev/null +++ b/src/engine/services/annotations.ts @@ -0,0 +1,102 @@ +/** + * Annotation service — creates technical marks on document ranges and + * emits `AnnotationCreated`. Resolution-status updates emit + * `AnnotationResolved` / `AnnotationResolutionFailed`. + * + * Annotation creation is the engine's response to a user action in the + * viewer (T07). The viewer adapter has already turned the selection into + * `Selector[]`; this service stamps an ID, normalize-version, timestamps, + * persists, and broadcasts. + */ + +import type { + Annotation, + AnnotationResolutionStatus, +} from "@shared/annotation"; +import type { DocumentId, RepresentationId, AnnotationId } from "@shared/ids"; +import type { Selector } from "@shared/selector"; +import { newId } from "@shared/ids"; +import { NORMALIZE_VERSION } from "@shared/text/normalize"; + +import type { EventBus } from "../events"; +import type { AnnotationRepository } from "../repos"; + +export interface CreateAnnotationInput { + readonly documentId: DocumentId; + readonly representationId?: RepresentationId; + readonly selectors: readonly Selector[]; + readonly quote?: string; + readonly note?: string; + readonly createdBy?: string; +} + +export interface AnnotationService { + create(input: CreateAnnotationInput): Annotation; + get(id: AnnotationId): Annotation | null; + listByDocument(documentId: DocumentId): readonly Annotation[]; + setResolutionStatus( + id: AnnotationId, + status: AnnotationResolutionStatus, + opts: { readonly confidence: number; readonly reason?: string }, + ): Annotation; +} + +export function createAnnotationService( + annotations: AnnotationRepository, + bus: EventBus, + now: () => string = () => new Date().toISOString(), +): AnnotationService { + return { + create(input) { + const ts = now(); + const annotation: Annotation = { + id: newId("annotation"), + documentId: input.documentId, + ...(input.representationId !== undefined ? { representationId: input.representationId } : {}), + selectors: input.selectors, + ...(input.quote !== undefined ? { quote: input.quote } : {}), + ...(input.note !== undefined ? { note: input.note } : {}), + normalizeVersion: NORMALIZE_VERSION, + ...(input.createdBy !== undefined ? { createdBy: input.createdBy } : {}), + createdAt: ts, + updatedAt: ts, + }; + const stored = annotations.create(annotation); + bus.emit({ type: "AnnotationCreated", annotationId: stored.id, annotation: stored }); + return stored; + }, + get(id) { + return annotations.get(id); + }, + listByDocument(documentId) { + return annotations.listByDocument(documentId); + }, + setResolutionStatus(id, status, opts) { + const existing = annotations.get(id); + if (!existing) { + throw new Error(`AnnotationService.setResolutionStatus: unknown id ${id}`); + } + const updated: Annotation = { + ...existing, + resolutionStatus: status, + updatedAt: now(), + }; + const stored = annotations.update(updated); + if (status === "unresolved" || status === "stale") { + bus.emit({ + type: "AnnotationResolutionFailed", + annotationId: stored.id, + reason: opts.reason ?? status, + }); + } else { + bus.emit({ + type: "AnnotationResolved", + annotationId: stored.id, + status, + confidence: opts.confidence, + }); + } + return stored; + }, + }; +} diff --git a/src/engine/services/documents.ts b/src/engine/services/documents.ts new file mode 100644 index 0000000..d1cafc0 --- /dev/null +++ b/src/engine/services/documents.ts @@ -0,0 +1,63 @@ +/** + * Document service — registers ingested documents and emits the §4 events. + * + * The ingest pipeline (`src/source/pdf/ingest.ts`) is a pure function over + * bytes — it does not touch the engine. The app composition root calls + * `ingestPdf` then hands the result to `documentService.register()`, which + * is where the engine takes over: persist into the repos, emit + * `DocumentImported` + `DocumentRepresentationGenerated`. + */ + +import type { Document, DocumentRepresentation } from "@shared/document"; +import type { DocumentId, RepresentationId } from "@shared/ids"; + +import type { EventBus } from "../events"; +import type { DocumentRepository, RepresentationRepository } from "../repos"; + +export interface DocumentService { + register(input: { + readonly document: Document; + readonly representation: DocumentRepresentation; + }): { readonly document: Document; readonly representation: DocumentRepresentation }; + get(id: DocumentId): Document | null; + list(): readonly Document[]; + getRepresentation(id: RepresentationId): DocumentRepresentation | null; + listRepresentations(documentId: DocumentId): readonly DocumentRepresentation[]; +} + +export function createDocumentService( + documents: DocumentRepository, + representations: RepresentationRepository, + bus: EventBus, +): DocumentService { + return { + register({ document, representation }) { + const storedDocument = documents.create(document); + const storedRepresentation = representations.create(representation); + bus.emit({ + type: "DocumentImported", + documentId: storedDocument.id, + document: storedDocument, + }); + bus.emit({ + type: "DocumentRepresentationGenerated", + documentId: storedDocument.id, + representationId: storedRepresentation.id, + representation: storedRepresentation, + }); + return { document: storedDocument, representation: storedRepresentation }; + }, + get(id) { + return documents.get(id); + }, + list() { + return documents.list(); + }, + getRepresentation(id) { + return representations.get(id); + }, + listRepresentations(documentId) { + return representations.listByDocument(documentId); + }, + }; +} diff --git a/src/engine/services/evidence.ts b/src/engine/services/evidence.ts new file mode 100644 index 0000000..49bf71c --- /dev/null +++ b/src/engine/services/evidence.ts @@ -0,0 +1,127 @@ +/** + * Evidence service — creates EvidenceItems on top of annotations and + * tracks their lifecycle. Emits §4 events: `EvidenceItemCreated`, + * `EvidenceItemUpdated`, `EvidenceItemActivated`. + * + * MVP item shape per `wiki/SharedContracts.md` §2.2: status starts at + * `candidate`, may transition to `confirmed | rejected | needs-check`. + * Item-level relation/strength (supports/contradicts/...) lives on the + * link, not the item — that's CE-WP-0003. + */ + +import type { Annotation } from "@shared/annotation"; +import type { + EvidenceItem, + EvidenceItemStatus, +} from "@shared/evidence"; +import type { + AnnotationId, + DocumentId, + EvidenceItemId, +} from "@shared/ids"; +import { newId } from "@shared/ids"; + +import type { EventBus, EvidenceItemActivatedEvent } from "../events"; +import type { EvidenceItemRepository } from "../repos"; + +export interface CreateEvidenceItemInput { + readonly annotationIds: readonly AnnotationId[]; + readonly title?: string; + readonly commentary?: string; + readonly status?: EvidenceItemStatus; + readonly confidence?: number; + readonly tags?: readonly string[]; + readonly createdBy?: string; +} + +export interface EvidenceService { + create(input: CreateEvidenceItemInput): EvidenceItem; + get(id: EvidenceItemId): EvidenceItem | null; + listByDocument(documentId: DocumentId): readonly EvidenceItem[]; + setStatus(id: EvidenceItemId, status: EvidenceItemStatus): EvidenceItem; + updateCommentary(id: EvidenceItemId, commentary: string): EvidenceItem; + activate( + id: EvidenceItemId, + source?: EvidenceItemActivatedEvent["source"], + ): EvidenceItem; +} + +export function createEvidenceService( + items: EvidenceItemRepository, + annotationLookup: (id: AnnotationId) => Annotation | null, + bus: EventBus, + now: () => string = () => new Date().toISOString(), +): EvidenceService { + return { + create(input) { + if (input.annotationIds.length === 0) { + throw new Error("EvidenceService.create: at least one annotationId is required"); + } + const ts = now(); + const item: EvidenceItem = { + id: newId("evidence"), + annotationIds: input.annotationIds, + ...(input.title !== undefined ? { title: input.title } : {}), + ...(input.commentary !== undefined ? { commentary: input.commentary } : {}), + status: input.status ?? "candidate", + ...(input.confidence !== undefined ? { confidence: input.confidence } : {}), + ...(input.tags !== undefined ? { tags: input.tags } : {}), + ...(input.createdBy !== undefined ? { createdBy: input.createdBy } : {}), + createdAt: ts, + updatedAt: ts, + }; + const stored = items.create(item); + bus.emit({ type: "EvidenceItemCreated", evidenceItemId: stored.id, evidenceItem: stored }); + return stored; + }, + get(id) { + return items.get(id); + }, + listByDocument(documentId) { + return items.listByDocument(documentId, annotationLookup); + }, + setStatus(id, status) { + const existing = items.get(id); + if (!existing) { + throw new Error(`EvidenceService.setStatus: unknown id ${id}`); + } + if (existing.status === status) return existing; + const updated: EvidenceItem = { ...existing, status, updatedAt: now() }; + const stored = items.update(updated); + bus.emit({ + type: "EvidenceItemUpdated", + evidenceItemId: stored.id, + evidenceItem: stored, + previousStatus: existing.status, + }); + return stored; + }, + updateCommentary(id, commentary) { + const existing = items.get(id); + if (!existing) { + throw new Error(`EvidenceService.updateCommentary: unknown id ${id}`); + } + const updated: EvidenceItem = { ...existing, commentary, updatedAt: now() }; + const stored = items.update(updated); + bus.emit({ + type: "EvidenceItemUpdated", + evidenceItemId: stored.id, + evidenceItem: stored, + previousStatus: existing.status, + }); + return stored; + }, + activate(id, source) { + const existing = items.get(id); + if (!existing) { + throw new Error(`EvidenceService.activate: unknown id ${id}`); + } + bus.emit({ + type: "EvidenceItemActivated", + evidenceItemId: existing.id, + ...(source !== undefined ? { source } : {}), + }); + return existing; + }, + }; +} diff --git a/src/engine/services/index.ts b/src/engine/services/index.ts new file mode 100644 index 0000000..bdce285 --- /dev/null +++ b/src/engine/services/index.ts @@ -0,0 +1,14 @@ +export { + createDocumentService, + type DocumentService, +} from "./documents"; +export { + createAnnotationService, + type AnnotationService, + type CreateAnnotationInput, +} from "./annotations"; +export { + createEvidenceService, + type EvidenceService, + type CreateEvidenceItemInput, +} from "./evidence"; diff --git a/src/source/index.ts b/src/source/index.ts index cb0ff5c..0eea433 100644 --- a/src/source/index.ts +++ b/src/source/index.ts @@ -1 +1,8 @@ -export {}; +export { + ingestPdf, + type IngestPdfInput, + type IngestPdfOptions, + type IngestPdfResult, +} from "./pdf/ingest"; +export { extractPdf, type PdfExtractionResult } from "./pdf/extract"; +export { fingerprintBytes } from "./pdf/fingerprint"; diff --git a/src/source/pdf/extract.ts b/src/source/pdf/extract.ts new file mode 100644 index 0000000..63de6bc --- /dev/null +++ b/src/source/pdf/extract.ts @@ -0,0 +1,122 @@ +/** + * PDF text extraction → canonical text + PageMap + OffsetMap. + * + * Implements `wiki/ArchitectureOverview.md` §3.4 ("extract canonical text / + * build format-specific maps") for the `pdf-text` representation + * (`wiki/SharedContracts.md` §1, §3) and §6 (canonical normalization). + * + * Runtime independence: the PDF.js worker must be configured by the host + * application (`GlobalWorkerOptions.workerSrc`) before this module is + * called. In Vite/browser code the worker is bundled via the viewer; in + * Node tests the test setup file points it at + * `pdfjs-dist/legacy/build/pdf.worker.mjs`. No worker setup happens here + * so the same module loads cleanly in both runtimes. + * + * Page boundary semantics: canonical text concatenates per-page normalized + * text with a single "\n\n" paragraph separator. The separator is treated + * as belonging to the *preceding* page in `OffsetMap`, so the map covers + * `[0, canonicalText.length)` with no gaps. The last page has no trailing + * separator. This means `pageLength = globalEnd - globalStart` for + * every page; for non-last pages it equals (normalized page text length + + * 2). See `PageOffsetRange` in `@shared/document.ts`. + */ + +import { getDocument } from "pdfjs-dist"; +import type { PDFPageProxy } from "pdfjs-dist"; +import type { + OffsetMap, + PageInfo, + PageMap, + PageOffsetRange, +} from "@shared/document"; +import { normalize } from "@shared/text/normalize"; + +const PAGE_SEPARATOR = "\n\n"; + +export interface PdfExtractionResult { + readonly canonicalText: string; + readonly pageMap: PageMap; + readonly offsetMap: OffsetMap; + readonly pageCount: number; +} + +export async function extractPdf(bytes: Uint8Array): Promise { + // PDF.js mutates the bytes buffer (transfers ownership). Pass a fresh copy + // so the caller's Uint8Array stays usable for fingerprinting after extract. + const data = new Uint8Array(bytes); + const loadingTask = getDocument({ data }); + const doc = await loadingTask.promise; + + try { + const pageCount = doc.numPages; + const pageInfos: PageInfo[] = []; + const pageNormalizedTexts: string[] = []; + + for (let pageNumber = 1; pageNumber <= pageCount; pageNumber++) { + const page = await doc.getPage(pageNumber); + try { + const viewport = page.getViewport({ scale: 1 }); + pageInfos.push({ + page: pageNumber, + width: viewport.width, + height: viewport.height, + }); + + const rawText = await extractPageText(page); + pageNormalizedTexts.push(normalize(rawText).text); + } finally { + page.cleanup(); + } + } + + const { canonicalText, offsetMap } = buildOffsetMap(pageNormalizedTexts); + + return { + canonicalText, + pageMap: pageInfos, + offsetMap, + pageCount, + }; + } finally { + await doc.destroy(); + } +} + +async function extractPageText(page: PDFPageProxy): Promise { + const content = await page.getTextContent(); + // textContent.items are TextItem | TextMarkedContent. We want only the + // TextItem strings (those have a `str` field); marked-content entries are + // structural anchors and have no visible text. + const parts: string[] = []; + for (const item of content.items) { + if ("str" in item) { + parts.push(item.str); + if (item.hasEOL) parts.push("\n"); + } + } + return parts.join(""); +} + +function buildOffsetMap(pageTexts: readonly string[]): { + canonicalText: string; + offsetMap: OffsetMap; +} { + const ranges: PageOffsetRange[] = []; + let offset = 0; + for (let i = 0; i < pageTexts.length; i++) { + const text = pageTexts[i]!; + const isLast = i === pageTexts.length - 1; + const segmentLength = text.length + (isLast ? 0 : PAGE_SEPARATOR.length); + const globalStart = offset; + const globalEnd = offset + segmentLength; + ranges.push({ + page: i + 1, + globalStart, + globalEnd, + pageLength: segmentLength, + }); + offset = globalEnd; + } + const canonicalText = pageTexts.join(PAGE_SEPARATOR); + return { canonicalText, offsetMap: ranges }; +} diff --git a/src/source/pdf/fingerprint.ts b/src/source/pdf/fingerprint.ts new file mode 100644 index 0000000..e5ee2e2 --- /dev/null +++ b/src/source/pdf/fingerprint.ts @@ -0,0 +1,31 @@ +/** + * SHA-256 fingerprint of raw document bytes. + * + * Implements the fingerprint half of `wiki/ArchitectureOverview.md` §3.4 + * (the "compute fingerprint" pipeline step) and populates + * `Document.fingerprint` (`wiki/SharedContracts.md` §1). + * + * Uses Web Crypto's `crypto.subtle.digest`, which is available in browsers + * and in Node ≥ 20 (where it is exposed on `globalThis.crypto`). No + * platform branching — the API is the same in both environments. + */ + +export async function fingerprintBytes(bytes: Uint8Array): Promise { + // Copy into a fresh ArrayBuffer (not SharedArrayBuffer) so the digest call + // satisfies TS's updated `BufferSource` type, which excludes + // `SharedArrayBuffer`. The copy is O(n) — fine even for large PDFs since + // SHA-256 itself is already O(n). + const ab = new ArrayBuffer(bytes.byteLength); + new Uint8Array(ab).set(bytes); + const digest = await crypto.subtle.digest("SHA-256", ab); + return bytesToHex(new Uint8Array(digest)); +} + +function bytesToHex(bytes: Uint8Array): string { + let hex = ""; + for (let i = 0; i < bytes.length; i++) { + const b = bytes[i]!; + hex += (b < 0x10 ? "0" : "") + b.toString(16); + } + return hex; +} diff --git a/src/source/pdf/ingest.test.ts b/src/source/pdf/ingest.test.ts new file mode 100644 index 0000000..0ffbc59 --- /dev/null +++ b/src/source/pdf/ingest.test.ts @@ -0,0 +1,142 @@ +/** + * Fixture-driven contract tests for the PDF ingest pipeline. + * + * For each fixture in `fixtures/pdfs/manifest.json`: + * 1. Read the PDF bytes from disk. + * 2. Run `ingestPdf` end-to-end. + * 3. Assert the resulting Document + DocumentRepresentation honour the + * manifest contract: media type is application/pdf, fingerprint is a + * 64-hex SHA-256, pageMap matches `page_count`, canonicalText + * contains `known_good_quote`, and the offsetMap covers + * `[0, canonicalText.length)` with no gaps. + * + * This is the verification gate for CE-WP-0002-T03. + */ + +import { readFileSync } from "node:fs"; +import { dirname, resolve } from "node:path"; +import { createRequire } from "node:module"; +import { fileURLToPath } from "node:url"; +import { beforeAll, describe, expect, it } from "vitest"; + +import { ingestPdf } from "./ingest"; +import { fingerprintBytes } from "./fingerprint"; +import manifest from "../../../fixtures/pdfs/manifest.json" with { type: "json" }; + +const __dirname = dirname(fileURLToPath(import.meta.url)); +const FIXTURE_DIR = resolve(__dirname, "../../../fixtures/pdfs"); + +interface Fixture { + id: string; + filename: string; + page_count: number; + known_good_quote: string; + known_good_quote_page: number; +} + +const FIXTURES: readonly Fixture[] = manifest.fixtures; + +beforeAll(async () => { + // PDF.js needs a workerSrc set. In Node tests we point it at the legacy + // worker bundle — the modern bundle uses APIs that aren't present in + // Node. The legacy worker is bundled as plain JS and runs through the + // fake-worker fallback that PDF.js spins up when no real Worker is + // available. + const pdfjs = await import("pdfjs-dist"); + const require = createRequire(import.meta.url); + pdfjs.GlobalWorkerOptions.workerSrc = require.resolve( + "pdfjs-dist/legacy/build/pdf.worker.mjs", + ); +}); + +describe("ingestPdf — fixture corpus", () => { + for (const fixture of FIXTURES) { + describe(fixture.id, () => { + const path = resolve(FIXTURE_DIR, fixture.filename); + const bytes = new Uint8Array(readFileSync(path)); + + it("produces a Document with PDF media type and SHA-256 fingerprint", async () => { + const { document } = await ingestPdf(bytes, { filename: fixture.filename }); + expect(document.mediaType).toBe("application/pdf"); + expect(document.fingerprint).toMatch(/^[0-9a-f]{64}$/); + expect(document.title).toBe(fixture.filename); + // Fingerprint must be deterministic across runs. + const expected = await fingerprintBytes(bytes); + expect(document.fingerprint).toBe(expected); + }); + + it("produces a pdf-text representation with the expected page count", async () => { + const { representation } = await ingestPdf(bytes); + expect(representation.representationType).toBe("pdf-text"); + expect(representation.pageMap?.length).toBe(fixture.page_count); + expect(representation.offsetMap?.length).toBe(fixture.page_count); + }); + + it("canonical text contains the manifest's known-good quote", async () => { + const { representation } = await ingestPdf(bytes); + const text = representation.canonicalText ?? ""; + expect(text).toContain(fixture.known_good_quote); + }); + + it("offsetMap is gap-free and covers [0, canonicalText.length)", async () => { + const { representation } = await ingestPdf(bytes); + const text = representation.canonicalText ?? ""; + const offsets = representation.offsetMap ?? []; + expect(offsets.length).toBeGreaterThan(0); + expect(offsets[0]!.globalStart).toBe(0); + expect(offsets.at(-1)!.globalEnd).toBe(text.length); + for (let i = 0; i < offsets.length; i++) { + const r = offsets[i]!; + expect(r.page).toBe(i + 1); + expect(r.globalEnd - r.globalStart).toBe(r.pageLength); + if (i > 0) expect(r.globalStart).toBe(offsets[i - 1]!.globalEnd); + } + }); + + it("pageMap entries have positive width and height in user-space points", async () => { + const { representation } = await ingestPdf(bytes); + const pages = representation.pageMap ?? []; + for (let i = 0; i < pages.length; i++) { + const p = pages[i]!; + expect(p.page).toBe(i + 1); + expect(p.width).toBeGreaterThan(0); + expect(p.height).toBeGreaterThan(0); + } + }); + }); + } +}); + +describe("ingestPdf — option handling", () => { + const fixture = FIXTURES[0]!; + const path = resolve(FIXTURE_DIR, fixture.filename); + const bytes = new Uint8Array(readFileSync(path)); + + it("uses explicit title over filename", async () => { + const { document } = await ingestPdf(bytes, { + filename: fixture.filename, + title: "Custom Title", + }); + expect(document.title).toBe("Custom Title"); + }); + + it("omits title entirely when neither filename nor title is supplied", async () => { + const { document } = await ingestPdf(bytes); + expect(document.title).toBeUndefined(); + }); + + it("propagates uri and metadata when supplied", async () => { + const { document } = await ingestPdf(bytes, { + uri: "file:///example.pdf", + metadata: { source: "test" }, + }); + expect(document.uri).toBe("file:///example.pdf"); + expect(document.metadata).toEqual({ source: "test" }); + }); + + it("accepts ArrayBuffer input", async () => { + const ab = bytes.buffer.slice(bytes.byteOffset, bytes.byteOffset + bytes.byteLength); + const { document } = await ingestPdf(ab); + expect(document.fingerprint).toMatch(/^[0-9a-f]{64}$/); + }); +}); diff --git a/src/source/pdf/ingest.ts b/src/source/pdf/ingest.ts new file mode 100644 index 0000000..a473e3f --- /dev/null +++ b/src/source/pdf/ingest.ts @@ -0,0 +1,88 @@ +/** + * PDF ingest pipeline → `{ document, representation }`. + * + * Implements `wiki/ArchitectureOverview.md` §3.4 ("Raw Source → identify + * media type → compute fingerprint → extract metadata → extract canonical + * text → build format-specific maps → persist Document + + * DocumentRepresentation") for the PDF source format. + * + * Ingest is a pure function over bytes: it does not persist anything. The + * caller (engine repositories in T05, app layer in T06) writes the returned + * Document + DocumentRepresentation into the chosen store. + */ + +import { + type Document, + type DocumentRepresentation, +} from "@shared/document"; +import { newId } from "@shared/ids"; +import { extractPdf } from "./extract"; +import { fingerprintBytes } from "./fingerprint"; + +const PDF_MEDIA_TYPE = "application/pdf"; + +export interface IngestPdfOptions { + /** Original filename, used as the default title when no title is given. */ + readonly filename?: string; + /** Optional pre-existing title (overrides filename). */ + readonly title?: string; + /** Optional source URI (e.g. file:// or https://). */ + readonly uri?: string; + /** Free-form metadata persisted on the Document record. */ + readonly metadata?: Readonly>; +} + +export interface IngestPdfResult { + readonly document: Document; + readonly representation: DocumentRepresentation; +} + +export type IngestPdfInput = Uint8Array | ArrayBuffer | Blob; + +export async function ingestPdf( + input: IngestPdfInput, + options: IngestPdfOptions = {}, +): Promise { + const bytes = await toBytes(input); + const [fingerprint, extraction] = await Promise.all([ + fingerprintBytes(bytes), + extractPdf(bytes), + ]); + + const now = new Date().toISOString(); + const documentId = newId("document"); + const representationId = newId("representation"); + const title = options.title ?? options.filename; + + const document: Document = { + id: documentId, + mediaType: PDF_MEDIA_TYPE, + fingerprint, + createdAt: now, + updatedAt: now, + ...(title !== undefined ? { title } : {}), + ...(options.uri !== undefined ? { uri: options.uri } : {}), + ...(options.metadata !== undefined ? { metadata: options.metadata } : {}), + }; + + const representation: DocumentRepresentation = { + id: representationId, + documentId, + representationType: "pdf-text", + contentHash: fingerprint, + canonicalText: extraction.canonicalText, + pageMap: extraction.pageMap, + offsetMap: extraction.offsetMap, + generatedAt: now, + }; + + return { document, representation }; +} + +async function toBytes(input: IngestPdfInput): Promise { + if (input instanceof Uint8Array) return input; + if (input instanceof ArrayBuffer) return new Uint8Array(input); + // Blob (covers `File` in browsers — File extends Blob). + const buf = await input.arrayBuffer(); + return new Uint8Array(buf); +} diff --git a/src/work/AnnotationToolbar.tsx b/src/work/AnnotationToolbar.tsx new file mode 100644 index 0000000..82eb49c --- /dev/null +++ b/src/work/AnnotationToolbar.tsx @@ -0,0 +1,100 @@ +/** + * AnnotationToolbar — wires "I selected text" into "evidence appears in + * the sidebar". + * + * Visible only when a `pendingSelection` is set (the viewer publishes + * captures into context, then this toolbar lets the user attach commentary + * and commit). On Save it runs the full pipeline: + * + * 1. `createSelectors(capture, representation)` — anchor builds the + * maximal selector set against the active representation. + * 2. `engine.annotations.create(...)` — engine mints an Annotation + + * emits AnnotationCreated. + * 3. `engine.evidence.create(...)` — engine mints the EvidenceItem with + * the user's commentary, emits EvidenceItemCreated. + * + * The sidebar re-renders via the engine event bus, so no other glue is + * needed. + */ + +import { useEffect, useState } from "react"; +import { createSelectors } from "@anchor/index"; +import { + useActiveDocument, + useEngine, + usePendingSelection, +} from "./EngineContext"; + +export function AnnotationToolbar() { + const engine = useEngine(); + const { document, representation } = useActiveDocument(); + const { pending, set } = usePendingSelection(); + const [commentary, setCommentary] = useState(""); + + // Reset the commentary box whenever a fresh selection arrives. + useEffect(() => { + setCommentary(""); + }, [pending]); + + if (!pending || !document || !representation) return null; + + const handleSave = () => { + const selectors = createSelectors(pending.capture, representation); + const annotation = engine.annotations.create({ + documentId: document.id, + representationId: representation.id, + selectors, + quote: pending.capture.text, + }); + engine.evidence.create({ + annotationIds: [annotation.id], + ...(commentary.trim().length > 0 ? { commentary: commentary.trim() } : {}), + }); + set(null); + }; + + const handleDiscard = () => set(null); + + const quote = pending.capture.text; + const shortQuote = quote.length > 200 ? `${quote.slice(0, 200)}…` : quote; + + return ( +
+
+ New annotation ({pending.selectors.length} selector{pending.selectors.length === 1 ? "" : "s"}) +
+
+ “{shortQuote}” +
+