Implement CE-WP-0001 Foundations: TS scaffold, lint boundaries, normalize v1, fixtures

T01 Toolchain — vite + pnpm 9.15 + React 18 + strict TS (ADR-0001).
T02 Folder layout — src/{shared,engine,anchor,source,binder,work,app}/
    mirroring the future subsystem split, with path aliases.
T03 Boundary lint — eslint-plugin-boundaries enforcing the dependency
    edges from wiki/DependencyMap.md §4; verified by a violating fixture.
T04 Canonical normalization v1 — src/shared/text/normalize.ts with
    NORMALIZE_VERSION=1; 10/10 vitest covering ligatures, CRLF, soft
    hyphens (including line-break reassembly), mixed whitespace.
T05 PDF fixture corpus — 7 user-supplied German PDFs in fixtures/pdfs/
    (gitignored binaries) plus a manifest with verbatim known-good
    quotes and page counts, ready for CE-WP-0002 selector tests.
T06 README upgrade — umbrella README points at wiki/docs/workplans
    and documents the dev workflow.
T07 ADR-0002..0006 stubs in docs/decisions/.

Toolchain end-to-end: pnpm install + lint + typecheck + test all green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-25 00:13:03 +02:00
parent 707620adfb
commit 2f25f99cae
32 changed files with 4756 additions and 9 deletions

35
.gitignore vendored
View File

@@ -174,3 +174,38 @@ cython_debug/
# PyPI configuration file # PyPI configuration file
.pypirc .pypirc
# ---> Node / Vite / pnpm (CE-WP-0001-T01)
node_modules/
.pnpm-store/
.pnpm-debug.log*
npm-debug.log*
yarn-debug.log*
yarn-error.log*
# Vite build output
dist-ssr/
*.local
# Vitest
coverage/
# TypeScript incremental build cache
*.tsbuildinfo
# Editor / IDE
.vscode/
!.vscode/extensions.json
.idea/
*.swp
*.swo
# OS
.DS_Store
Thumbs.db
# ---> PDF fixtures (CE-WP-0001-T05)
# Binaries stay local until per-file licensing is cleared in
# fixtures/pdfs/SOURCES.md. The manifest and SOURCES.md are committed;
# add `!fixtures/pdfs/<filename>.pdf` exceptions as files are cleared.
fixtures/pdfs/*.pdf

1
.nvmrc Normal file
View File

@@ -0,0 +1 @@
20.10.0

View File

@@ -1 +1,77 @@
document-centered evidence workspace for capturing, managing, presenting, and re-opening citations # citation-evidence
A document-centered evidence workspace for capturing, managing, presenting,
and re-opening citations. The umbrella over the six-package design described
in `INTENT.md` and `wiki/ArchitectureOverview.md`.
During the MVP all code lives here under `src/` (see "Repository layout"
below). Sister repos hold INTENT only — code migrates outward when each
subsystem stabilises.
## Documentation
| Where | What |
|------------------------|---------------------------------------------------------|
| `INTENT.md` | Project intent, scope, the umbrella-first decision |
| `wiki/` | PRD, Architecture, SharedContracts, DependencyMap |
| `docs/decisions/` | ADRs (architecturally significant decisions) |
| `workplans/` | Ralph-driven workplans that implement the MVP slice |
| `history/` | Time-stamped assessments and post-mortems |
The canonical contracts are in [`wiki/SharedContracts.md`](wiki/SharedContracts.md);
the partition boundaries are in [`wiki/DependencyMap.md`](wiki/DependencyMap.md).
Both are referenced from every workplan and from each sister repo's INTENT.md.
## Repository layout
```
src/
shared/ # vocabulary, types, pure helpers → becomes part of citation-engine
engine/ # services, repositories, event bus → becomes part of citation-engine
anchor/ # selector creation/resolution, viewer adapter contract → becomes evidence-anchor
source/ # ingest, fingerprint, extraction, recovery → becomes evidence-source
binder/ # evidence-to-target binding, visual guide → becomes evidence-binder
work/ # review UI (sidebar, viewer shell) → becomes citation-work
app/ # the reference workspace shell → stays in citation-evidence
```
The dependency-edge rules between partitions are enforced by ESLint via
`eslint-plugin-boundaries` (see `eslint.config.js`). Extraction to a sister
repo is intended to be a `git mv` plus a `package.json` cut — nothing more.
## Sister repos
Peers under `~/`; each holds INTENT.md only during MVP:
- [`~/citation-engine`](../citation-engine/) — shared model + engine services
- [`~/evidence-anchor`](../evidence-anchor/) — selectors + adapter contract
- [`~/evidence-source`](../evidence-source/) — ingest, representation, recovery
- [`~/evidence-binder`](../evidence-binder/) — binding, visual guide, rect registry
- [`~/citation-work`](../citation-work/) — review UI surfaces
## Dev workflow
Requirements: Node 20 LTS (see `.nvmrc`) and `pnpm` 9.
```bash
pnpm install
pnpm dev # vite dev server (once src/app/ has a real entry)
pnpm test # vitest one-shot
pnpm test:watch
pnpm lint # eslint with boundary rules
pnpm typecheck # tsc --noEmit
pnpm build # production bundle
```
## Workplans (Ralph)
Workplans drive incremental implementation through the ralph loop. The harness
lives in `~/ralph-workplan/`; see `workplans/README.md` for the active list
and ordering.
```bash
/ralph-workplan workplans/CE-WP-0001-foundations.md
```
The loop self-retires when every task in the file has `status: done` and the
workplan's frontmatter `status: done`.

View File

@@ -0,0 +1,68 @@
# ADR-0001 — Toolchain (Vite + pnpm + React 18 + strict TypeScript)
- Status: accepted
- Date: 2026-05-24
- Workplan: CE-WP-0001-T01
## Context
`citation-evidence` is the umbrella repo for an MVP that will eventually be
segmented into six packages (`shared/engine`, `anchor`, `source`, `binder`,
`work`, `app` per `wiki/DependencyMap.md`). We need a single toolchain that:
1. Gives a fast inner dev loop for a React-based reference workspace.
2. Plays well with a future pnpm workspace split (so each `src/<name>/` folder
can become a workspace package with a `git mv` and a `package.json` cut).
3. Provides first-class TypeScript with the strictest practical settings — the
shared contracts in `wiki/SharedContracts.md` only pay off if the type
system actually enforces them.
4. Has a credible unit-test story for the engine/anchor/source pure-logic code
and an integration path for the UI later.
## Options considered
- **Vite + pnpm + React + Vitest** *(chosen)*
- Fast HMR; well-supported React plugin; Vitest shares the Vite pipeline so
tests use the same module resolution as the app.
- pnpm workspaces are the most ergonomic path to the eventual multi-package
split.
- React 18 because the PRD's reference workspace is a desktop-class web app
and the ecosystem (PDF viewer libraries, drag-and-drop, etc.) targets it.
- **Next.js (App Router)**
- Heavier than needed for a local-first reference workspace; SSR/route
handlers add complexity the MVP doesn't use.
- Harder to split into independent packages later.
- **tsc-only + custom runner**
- Simplest, but no HMR and we'd hand-roll the React + bundler integration.
Pointless overhead for a UI-centric project.
- **Bun / Deno**
- Toolchain bets that would add risk to the PDF/viewer integration spike,
which is already the highest-risk part of the project (see
`CE-WP-0002-T02`).
## Decision
Use **Vite 5** + **pnpm 9** + **React 18** + **TypeScript 5 with `strict`,
`noUncheckedIndexedAccess`, `exactOptionalPropertyTypes`, `noImplicitOverride`,
`noFallthroughCasesInSwitch`, `verbatimModuleSyntax`** turned on. Use
**Vitest 2** as the test runner. Node version pinned to **20.10.0 LTS** via
`.nvmrc`. Path aliases (`@shared/*`, `@engine/*`, etc.) map to `src/<name>/*`
so import sites read the same whether or not the folder is later extracted.
## Consequences
- Bumping React or Node is a deliberate, ADR-worthy change.
- The eventual pnpm workspace split keeps the same import names — each
package's `name` becomes `@citation-evidence/<folder>` and the path aliases
are replaced by package resolution. No source-code churn required.
- Vitest's Vite-aware resolution means a contract test that imports across
partitions will fail at the same boundary that production code would —
there is no test-only loophole.
- ESLint rules enforcing the dependency map (CE-WP-0001-T03) layer on top
cleanly: `eslint-plugin-boundaries` reads the same `tsconfig` paths.
- No application dependencies are installed in this task — only the toolchain.
Subsequent workplans install PDF, drag-and-drop, etc. on demand and record
them in their own ADRs where the choice is non-obvious.

View File

@@ -0,0 +1,50 @@
# ADR-0002 — Monorepo vs polyrepo for the six subsystems
- Status: proposed
- Date: 2026-05-24
- Workplan: CE-WP-0001-T07 (stub)
## Context
The umbrella-first MVP lives entirely in `citation-evidence/` under
`src/{shared,engine,anchor,source,binder,work,app}/`. Each folder is named
after its eventual extracted package. At some point — driven by an external
consumer needing one subsystem, or by independent release cadence — code
will move out into its sister repo.
We need a written answer to: when that moment comes, do we (a) keep one
repository with pnpm workspaces, (b) split into six independent repos with
published packages, or (c) something in between?
The decision affects: dependency management, release cadence, CI surface
area, contributor friction, and how `wiki/SharedContracts.md` is enforced
across the boundary.
## Options
- **A. Single repo, pnpm workspaces**
- Pros: one CI, one version of every dep, atomic cross-package PRs, easy
refactors. Shared contracts enforced by the type checker.
- Cons: any consumer outside this repo needs a private registry or
git-tag-based installs. Release cadence is shared.
- **B. Six independent repos, published packages**
- Pros: clean external publish story, independent versioning. Forces the
contract to be a real package boundary.
- Cons: dependency upgrades require coordinated PR trains. Refactors that
span subsystems become multi-repo dances. Hard to keep
`SharedContracts.md` in sync across repos.
- **C. Hybrid — monorepo with publishable workspaces**
- Pros: best of both: one repo for dev, but `pnpm publish` from any
workspace package. Tools: changesets / nx / turbo.
- Cons: more tooling to learn; per-workspace `package.json` cuts to
maintain.
## Decision
(blank — to be answered before the first subsystem extraction lands.)
## Consequences
(blank)

View File

@@ -0,0 +1,44 @@
# ADR-0003 — W3C Web Annotation mapping: native model or import/export?
- Status: proposed
- Date: 2026-05-24
- Workplan: CE-WP-0001-T07 (stub)
## Context
The PRD mandates compatibility with the W3C Web Annotation Data Model
(FR-009 of `wiki/ProductRequirementsDocument.md`). `Selector` shapes already
mirror the W3C taxonomy. Open question: do we serialize our internal types
*as* JSON-LD Web Annotations natively, or maintain our own JSON shape with
an import/export mapping?
The choice affects: storage format, the public API of `evidence-source`'s
ingest/export paths, what "compatible" means when a user imports an existing
W3C annotation collection, and how much our internal model can diverge from
the spec (e.g. our `EvidenceItem` has no W3C analogue).
## Options
- **A. Native JSON-LD as the canonical store**
- Pros: maximally interoperable; no mapping layer to keep in sync.
- Cons: JSON-LD adds verbosity and context resolution; our extensions
(EvidenceItem, EvidenceLink, EvidenceSet) need custom JSON-LD contexts.
Bad fit for an in-memory MVP.
- **B. Internal model + import/export mapping** *(currently assumed)*
- Pros: terse internal types; clean fit for `wiki/SharedContracts.md`.
Mapping only runs at the system boundary.
- Cons: two shapes to maintain; subtle divergence risk.
- **C. Hybrid — internal model that is a strict superset of W3C JSON shape**
- Pros: serializes losslessly to W3C without a full JSON-LD context.
- Cons: ties internal naming to W3C naming forever, which constrains
future extensions.
## Decision
(blank — required before evidence-source ships its first export path.)
## Consequences
(blank)

View File

@@ -0,0 +1,47 @@
# ADR-0004 — PDF viewer library for the reference workspace
- Status: proposed
- Date: 2026-05-24
- Workplan: CE-WP-0001-T07 (stub); validated in CE-WP-0002-T02
## Context
The PDF round-trip (select text → store selectors → reload → resolve →
scroll → highlight) is the riskiest architectural assumption in the MVP
(see `history/2026-05-24-initial-assessment.md`). The viewer library must:
- Render PDF.js-backed pages in a React shell.
- Expose stable APIs for programmatic text selection and highlight overlay.
- Not leak its types into `src/shared/` or `src/engine/` (enforced by the
T03 boundary lint rules).
- Survive across versions of PDF.js without trapping us in old versions.
CE-WP-0002-T02 is the spike that validates whichever library we pick. If
the spike fails the success criteria, this ADR is the place to record the
failure and propose an alternative.
## Options
- **A. `react-pdf-highlighter-plus`** *(current assumption)*
- Pros: React-native, opinionated overlay layer, well-tested fixture
coverage in the community.
- Cons: bundles a particular PDF.js version; risk of needing to fork to
get clean adapter boundaries.
- **B. `react-pdf` (the official PDF.js React binding) + custom overlay**
- Pros: thinnest abstraction; we own the overlay layer.
- Cons: significantly more code to write and maintain for selection/
highlight; reinventing PDF.js text-layer interaction.
- **C. PDF.js directly (no React wrapper)**
- Pros: maximum control.
- Cons: highest implementation cost; harder to integrate into the React
composition root.
## Decision
(blank — to be filled by the outcome of CE-WP-0002-T02.)
## Consequences
(blank)

View File

@@ -0,0 +1,38 @@
# ADR-0005 — Persistence layer (MVP and beyond)
- Status: proposed
- Date: 2026-05-24
- Workplan: CE-WP-0001-T07 (stub); MVP placeholder in CE-WP-0002-T08
## Context
The MVP needs persistence so that "click an evidence item and have the PDF
jump to and highlight the passage — even after a full page reload" works
(PRD §20 step 4). The acceptable MVP shortcut is `localStorage` (decided
explicitly in CE-WP-0002-T08).
This ADR is the durable home for the real persistence decision: where do
documents, annotations, evidence items, links, and sets live in v1.0?
## Options
- **A. Browser-local only (IndexedDB via `idb` or `dexie`)**
- Pros: zero infra; great for a single-user reference workspace.
- Cons: no cross-device sync; export/import only via files.
- **B. Local-first + sync server (e.g. CRDT-backed)**
- Pros: matches the long-term vision of a workspace tool; conflict-free
multi-device.
- Cons: significant infra and CRDT design cost; out of MVP scope.
- **C. Traditional client/server with a REST or GraphQL API**
- Pros: familiar; easy team-sharing story.
- Cons: requires hosting; loses the local-first character.
## Decision
(blank — to be answered before the second product slice past MVP.)
## Consequences
(blank)

View File

@@ -0,0 +1,46 @@
# ADR-0006 — Selector ownership split (types in engine, algorithms in anchor)
- Status: proposed
- Date: 2026-05-24
- Workplan: CE-WP-0001-T07 (stub); echoes `wiki/SharedContracts.md` §8
## Context
The original sister-repo INTENT files had overlapping ownership claims for
`Selector`: `citation-engine` listed it as an owned domain type, while
`evidence-anchor`'s scope claimed "selector type definitions related to
anchoring". This was resolved on 2026-05-24 in `wiki/SharedContracts.md` §8:
type *interfaces* live in engine (`src/shared/selector.ts`), creation and
resolution *algorithms* live in anchor.
This ADR makes the split formal so that future code reviews have a written
answer when somebody proposes moving the types into anchor or moving the
algorithms into shared.
## Options
- **A. Status quo: types in `shared/`, algorithms in `anchor/`** *(default)*
- Pros: anchor depends on shared (allowed by DependencyMap §4); type
consumers (binder, work) never have to import anchor.
- Cons: tiny risk of types drifting out of sync with what anchor can
actually produce.
- **B. Co-locate types and algorithms in `anchor/`**
- Pros: one home for everything selector-related.
- Cons: any partition that mentions a `Selector` type (which is most of
them) would have to import from `anchor/`. Breaks the
"shared has no internal imports" invariant of DependencyMap §4.
- **C. Split selector kinds: text-quote in shared, PDF-rect in anchor**
- Pros: only adapter-specific selectors leave shared.
- Cons: forces a discriminated union spanning two packages — type
narrowing becomes painful for consumers.
## Decision
(blank — option A is the working assumption codified in SharedContracts.md;
fill this in if a future use case challenges it.)
## Consequences
(blank)

71
eslint.config.js Normal file
View File

@@ -0,0 +1,71 @@
// ESLint flat config (ESLint 9+).
// Enforces the partition dependency map in wiki/DependencyMap.md §4.
//
// Element types (folders) and allowed importers:
// shared : importable by every other element (no internal imports of its own).
// engine : imports shared.
// anchor : imports shared, engine.
// source : imports shared, engine.
// binder : imports shared, engine, anchor.
// work : imports shared, engine, anchor, source. (NOT binder.)
// app : imports anything.
//
// Path aliases (@shared/*, @engine/*, etc.) come from tsconfig.json paths and
// are resolved by eslint-import-resolver-typescript.
import js from "@eslint/js";
import tseslint from "typescript-eslint";
import boundaries from "eslint-plugin-boundaries";
import importPlugin from "eslint-plugin-import";
import globals from "globals";
export default tseslint.config(
{
ignores: ["dist/", "node_modules/", "coverage/", "**/*.d.ts"],
},
js.configs.recommended,
...tseslint.configs.recommended,
{
files: ["src/**/*.{ts,tsx}"],
languageOptions: {
ecmaVersion: 2022,
sourceType: "module",
globals: { ...globals.browser, ...globals.node },
},
plugins: {
boundaries,
import: importPlugin,
},
settings: {
"import/resolver": {
typescript: { project: "./tsconfig.json" },
},
"boundaries/elements": [
{ type: "shared", pattern: "src/shared/**" },
{ type: "engine", pattern: "src/engine/**" },
{ type: "anchor", pattern: "src/anchor/**" },
{ type: "source", pattern: "src/source/**" },
{ type: "binder", pattern: "src/binder/**" },
{ type: "work", pattern: "src/work/**" },
{ type: "app", pattern: "src/app/**" },
],
},
rules: {
"boundaries/element-types": [
2,
{
default: "disallow",
rules: [
{ from: "shared", allow: [] },
{ from: "engine", allow: ["shared"] },
{ from: "anchor", allow: ["shared", "engine"] },
{ from: "source", allow: ["shared", "engine"] },
{ from: "binder", allow: ["shared", "engine", "anchor"] },
{ from: "work", allow: ["shared", "engine", "anchor", "source"] },
{ from: "app", allow: ["shared", "engine", "anchor", "source", "binder", "work"] },
],
},
],
},
},
);

41
fixtures/pdfs/SOURCES.md Normal file
View File

@@ -0,0 +1,41 @@
# PDF Fixture Sources
These PDFs back the selector and recovery tests for CE-WP-0002 and later
workplans. They are kept here as a stable, in-repo path that test code can
reference deterministically.
## Status: local-only by default
`fixtures/pdfs/*.pdf` is **gitignored** until each file's licensing and
privacy status has been resolved. The current corpus consists of
Bernd-supplied private documents (German correspondence, forms, and
reference material) — they exist locally for development but must not be
pushed without an explicit per-file decision.
When a PDF is cleared for commit, add a `!fixtures/pdfs/<filename>.pdf`
exception to `.gitignore` and append a row to the table below.
## Cleared for commit
(none yet)
| Filename | Source | License | Cleared by | Date |
|----------|--------|---------|------------|------|
| | | | | |
## Local-only fixtures (provided by Bernd 2026-05-24)
| Filename | Notes |
|---------------------------------------------------------|--------------------------------------|
| 031-Kemal Güldag Betriebskosten 2024.pdf | Personal correspondence |
| 061-260215-brief-trennungRoxanaAngebot_v1_final.pdf | Personal/legal correspondence |
| 063-26.01_Sonderkosten.pdf | Personal correspondence |
| 61595286_Vollständigkeitserklärung_2024.pdf | Form document |
| Aufnahmeschein Naturfriedhof.pdf | Form document |
| Fristsetzung zur Bezifferung GÜ an Gegenseite 3 Wochen.pdf | Personal/legal correspondence |
| Zeugnissprüche_Klasse_7_8.pdf | Reference document |
Future fixtures may also include synthesized or public-domain documents to
cover gaps in the matrix (single-column English text, two-column academic
PDFs, etc.) — those should land alongside these and be cleared for commit
immediately.

View File

@@ -0,0 +1,70 @@
{
"_schema_version": 1,
"_description": "PDF fixture corpus for citation-evidence selector tests. Each entry binds a stable id (used by test code) to a file path, page count, and a verbatim known-good quote with its 1-indexed physical PDF page number. The quote is short, unique within the document, and chosen to round-trip cleanly through the canonical text normalizer.",
"_provenance": "Page counts and quotes extracted on 2026-05-24 by reading each PDF directly. The Betriebskosten file is a scanned/handwritten form with noisy OCR text — its quote is taken from the reliably-extracted printed boilerplate, not from the handwritten fields.",
"fixtures": [
{
"id": "betriebskosten-2024",
"filename": "031-Kemal Güldag Betriebskosten 2024.pdf",
"description": "German Betriebskostenabrechnung (utility-cost statement) for a Seeheim apartment — scanned cover letter + filled-in Abrechnung form. OCR-noisy text and handwritten field values. Useful for stress-testing canonical normalization and selector resolution on imperfect extraction.",
"page_count": 2,
"known_good_quote": "Ich bitte um Überweisung auf das Konto bei",
"known_good_quote_page": 1,
"characteristics": ["german", "umlauts", "scanned", "ocr-noisy", "form", "handwritten"]
},
{
"id": "brief-trennung-angebot",
"filename": "061-260215-brief-trennungRoxanaAngebot_v1_final.pdf",
"description": "German correspondence — three-page settlement proposal letter with bullet lists, embedded amounts, and a signature on the final page. Single-column prose; representative of typical legal/personal letters.",
"page_count": 3,
"known_good_quote": "Dieser Vorschlag ist befristet bis zum 31.03.2026",
"known_good_quote_page": 3,
"characteristics": ["german", "umlauts", "single-column", "prose", "multi-page", "bullet-lists"]
},
{
"id": "sonderkosten",
"filename": "063-26.01_Sonderkosten.pdf",
"description": "German Sonderkosten ledger — tabular expense statement across multiple years and people. Three pages of column-heavy tables; representative of spreadsheet-exported PDFs where text extraction order is column-driven.",
"page_count": 3,
"known_good_quote": "Einzahlung Bernd 23.09.24",
"known_good_quote_page": 2,
"characteristics": ["german", "tables", "spreadsheet-export", "multi-column", "amounts"]
},
{
"id": "vollstaendigkeitserklaerung-2024",
"filename": "61595286_Vollständigkeitserklärung_2024.pdf",
"description": "Single-page German Vollständigkeitserklärung tax-prep form (VLH). Dense form with checkboxes, labelled fields, and small-print legal text — a good test of selector creation on form-heavy layouts.",
"page_count": 1,
"known_good_quote": "Mitglied beim Lohnsteuerhilfeverein Vereinigte Lohnsteuerhilfe e.V.",
"known_good_quote_page": 1,
"characteristics": ["german", "umlauts", "form", "checkboxes", "single-page", "dense"]
},
{
"id": "aufnahmeschein-naturfriedhof",
"filename": "Aufnahmeschein Naturfriedhof.pdf",
"description": "Three-page German admission form (Aufnahmeschein) for the Mühltal natural cemetery: fillable form (p1-2) plus an excerpt of the Friedhofsordnung statute (p3). Tests selectors that must distinguish form labels from underline-fields and prose.",
"page_count": 3,
"known_good_quote": "Mehrere Verpflichtete haften als Gesamtschuldner",
"known_good_quote_page": 3,
"characteristics": ["german", "umlauts", "form", "statute", "mixed-layout"]
},
{
"id": "fristsetzung-bezifferung",
"filename": "Fristsetzung zur Bezifferung GÜ an Gegenseite 3 Wochen.pdf",
"description": "Single-page formal court letter from the Amtsgericht Darmstadt — header block, addressed block, a one-sentence ruling, and a signature block. Excellent for clean selector round-trip tests.",
"page_count": 1,
"known_good_quote": "wird der Antragsgegnerin eine Frist von 3 Wochen zur Bezifferung gesetzt",
"known_good_quote_page": 1,
"characteristics": ["german", "umlauts", "legal", "single-page", "clean-text"]
},
{
"id": "zeugnisspruche-klasse-7-8",
"filename": "Zeugnissprüche_Klasse_7_8.pdf",
"description": "Long-form reference document (29 PDF pages): title page + 28 pages of curated quotes for German school-year reports, each with author, dates, and short biography. Multi-language inclusions (English, Spanish, Greek). Ideal for cross-page selector and heading-hierarchy tests.",
"page_count": 29,
"known_good_quote": "Der Friede der Welt beginnt in den Herzen der Menschen",
"known_good_quote_page": 2,
"characteristics": ["german", "umlauts", "long-form", "multi-language", "hierarchy", "structured"]
}
]
}

40
package.json Normal file
View File

@@ -0,0 +1,40 @@
{
"name": "citation-evidence",
"version": "0.0.0",
"private": true,
"description": "Document-centered evidence workspace — umbrella-first MVP.",
"license": "Apache-2.0",
"type": "module",
"packageManager": "pnpm@9.15.0",
"engines": {
"node": ">=20.10.0"
},
"scripts": {
"dev": "vite",
"build": "tsc -b && vite build",
"preview": "vite preview",
"test": "vitest run",
"test:watch": "vitest",
"lint": "eslint .",
"typecheck": "tsc -b --noEmit"
},
"dependencies": {
"react": "^18.3.1",
"react-dom": "^18.3.1"
},
"devDependencies": {
"@types/node": "^20.14.0",
"@types/react": "^18.3.3",
"@types/react-dom": "^18.3.0",
"@vitejs/plugin-react": "^4.3.1",
"eslint": "^9.7.0",
"eslint-import-resolver-typescript": "^3.6.3",
"eslint-plugin-boundaries": "^4.2.2",
"eslint-plugin-import": "^2.30.0",
"globals": "^15.9.0",
"typescript": "^5.5.4",
"typescript-eslint": "^8.0.0",
"vite": "^5.4.0",
"vitest": "^2.0.5"
}
}

3919
pnpm-lock.yaml generated Normal file

File diff suppressed because it is too large Load Diff

7
src/anchor/README.md Normal file
View File

@@ -0,0 +1,7 @@
# `src/anchor/` — selector creation, resolution, viewer adapter contract
Future home: `evidence-anchor`.
Owns: `createSelectors`, `resolveSelectors`, the `DocumentViewerAdapter`
contract, and concrete viewer adapters (PDF first).
May import from: `shared/`, `engine/` (`wiki/DependencyMap.md` §4).

1
src/anchor/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

7
src/app/README.md Normal file
View File

@@ -0,0 +1,7 @@
# `src/app/` — reference workspace shell, routes, composition root
Future home: stays in `citation-evidence` (or splits into a thin shell).
Owns: top-level React app, routing, composition of the engine + work + binder
subsystems into a runnable product.
May import from: any partition (`wiki/DependencyMap.md` §4).

1
src/app/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

8
src/binder/README.md Normal file
View File

@@ -0,0 +1,8 @@
# `src/binder/` — evidence-to-target binding, visual guide, rect registry
Future home: `evidence-binder`.
Owns: `EvidenceLink`/`EvidenceSet` services, active-state machine, rect
registry, SVG visual-guide overlay.
May import from: `shared/`, `engine/`, `anchor/`
(`wiki/DependencyMap.md` §4).

1
src/binder/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

7
src/engine/README.md Normal file
View File

@@ -0,0 +1,7 @@
# `src/engine/` — services, repositories, event bus
Future home: `citation-engine` (the services half).
Owns: repositories for `Document`/`Annotation`/`EvidenceItem`/`EvidenceLink`,
ID generation orchestration, the event bus, and pure orchestration services.
May import from: `shared/` only (`wiki/DependencyMap.md` §4).

1
src/engine/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

8
src/shared/README.md Normal file
View File

@@ -0,0 +1,8 @@
# `src/shared/` — vocabulary, types, pure helpers
Future home: `citation-engine` (the shared types and contracts half of it).
Owns: `Document`, `Selector`, `Annotation`, `EvidenceItem`, `EvidenceLink`,
`EvidenceSet`, state enums, branded IDs, canonical text normalization.
May import from: nothing internal. Leaf node of the dependency graph
(`wiki/DependencyMap.md` §4).

1
src/shared/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

View File

@@ -0,0 +1,56 @@
import { describe, expect, it } from "vitest";
import { NORMALIZE_VERSION, normalize } from "./normalize.js";
describe("normalize (NORMALIZE_VERSION=1)", () => {
it("returns the version constant alongside the text", () => {
const out = normalize("hello");
expect(out.version).toBe(NORMALIZE_VERSION);
expect(out.text).toBe("hello");
});
it("applies Unicode NFC composition", () => {
// "é" decomposed (e + combining acute) vs precomposed.
const decomposed = "café";
const precomposed = "café";
expect(normalize(decomposed).text).toBe(precomposed);
});
it("normalizes CRLF and CR line endings to LF", () => {
expect(normalize("a\r\nb\rc").text).toBe("a\nb\nc");
});
it("collapses horizontal whitespace runs to a single space", () => {
expect(normalize("a b\t\tc d").text).toBe("a b c d");
});
it("preserves paragraph boundaries but collapses 3+ blank lines to one", () => {
const input = "para one\n\n\n\npara two\n\npara three";
expect(normalize(input).text).toBe("para one\n\npara two\n\npara three");
});
it("strips soft hyphens (German line-broken word reassembly)", () => {
// German "Donau­dampf­schiff" line-broken with soft hyphens.
expect(normalize("Donau­dampf­schiff").text).toBe(
"Donaudampfschiff",
);
});
it("strips soft hyphens that span a newline ('word-\\nfragment' → 'wordfragment')", () => {
expect(normalize("word­\nfragment").text).toBe("wordfragment");
});
it("does not mangle ligatures (preserves the round-trip)", () => {
// The ligature "fi" (U+FB01) is left as-is — NFC does NOT decompose it.
// Test documents that current behavior so a future change is intentional.
expect(normalize("efficient").text).toBe("efficient");
});
it("handles a mixed-whitespace paragraph realistically", () => {
const input = " First line \r\n Second line.\r\n\r\n\r\nNext para. ";
expect(normalize(input).text).toBe("First line\nSecond line.\n\nNext para.");
});
it("returns an empty string for whitespace-only input", () => {
expect(normalize(" \n\n \t ").text).toBe("");
});
});

View File

@@ -0,0 +1,49 @@
// Canonical text normalization for selectors and stored quotes.
// Contract: wiki/SharedContracts.md §6.
//
// IMPORTANT: NORMALIZE_VERSION is stored on every Annotation. Bumping it is a
// migration event — old selectors must be re-resolved against re-normalized
// text before the new version becomes the default.
export const NORMALIZE_VERSION = 1;
// Soft hyphen (U+00AD), optionally followed by a single \n so that a PDF-
// extracted "word­\nfragment" reassembles to "wordfragment" rather than
// leaving a stray line break in the middle of a hyphenated word.
const SOFT_HYPHEN_AT_BREAK = /­\n?/g;
// Horizontal whitespace = any \s except \n and \r. The double-negation
// [^\S\r\n] is the idiomatic regex: \S is "not whitespace", so
// "not (not-whitespace or line-ending)" = "whitespace that is not a newline".
// Covers space, tab, NBSP, narrow NBSP, em-space, all Zs general-category.
const HORIZONTAL_WHITESPACE_RUN = /[^\S\r\n]+/g;
// 3+ newlines collapse to exactly two (one paragraph boundary).
const PARAGRAPH_RUN = /\n{3,}/g;
export function normalize(input: string): { text: string; version: number } {
// 1. Unicode NFC.
let text = input.normalize("NFC");
// 2. Normalize line endings: CRLF and CR -> LF.
text = text.replace(/\r\n?/g, "\n");
// 4. Strip soft hyphens (U+00AD) — including the line break that follows
// one — so PDF line-broken hyphenations reassemble. Done before
// horizontal collapse so no stray space remains.
text = text.replace(SOFT_HYPHEN_AT_BREAK, "");
// 3. Collapse horizontal whitespace runs to a single space.
text = text.replace(HORIZONTAL_WHITESPACE_RUN, " ");
// 5. Preserve paragraph boundaries (\n\n); collapse 3+ blank lines to 2.
text = text.replace(PARAGRAPH_RUN, "\n\n");
// Trim line-edge whitespace left over after horizontal collapse.
text = text.replace(/ +\n/g, "\n").replace(/\n +/g, "\n");
// Trim leading/trailing whitespace from the whole document.
text = text.trim();
return { text, version: NORMALIZE_VERSION };
}

7
src/source/README.md Normal file
View File

@@ -0,0 +1,7 @@
# `src/source/` — ingest, fingerprint, representation extraction, recovery
Future home: `evidence-source`.
Owns: PDF/HTML/MD ingest, fingerprinting, page-/offset-map construction,
canonical-text extraction, and citation-recovery behavior.
May import from: `shared/`, `engine/` (`wiki/DependencyMap.md` §4).

1
src/source/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

8
src/work/README.md Normal file
View File

@@ -0,0 +1,8 @@
# `src/work/` — review UI, sidebar, viewer shell
Future home: `citation-work`.
Owns: the React surfaces for the review workflow — collection list, viewer
shell, evidence sidebar, annotation create flow.
May import from: `shared/`, `engine/`, `anchor/`, `source/`
(`wiki/DependencyMap.md` §4). May NOT import from `binder/`.

1
src/work/index.ts Normal file
View File

@@ -0,0 +1 @@
export {};

37
tsconfig.json Normal file
View File

@@ -0,0 +1,37 @@
{
"compilerOptions": {
"target": "ES2022",
"lib": ["ES2022", "DOM", "DOM.Iterable"],
"module": "ESNext",
"moduleResolution": "Bundler",
"jsx": "react-jsx",
"strict": true,
"noImplicitOverride": true,
"noFallthroughCasesInSwitch": true,
"noUncheckedIndexedAccess": true,
"exactOptionalPropertyTypes": true,
"noUnusedLocals": true,
"noUnusedParameters": true,
"esModuleInterop": true,
"forceConsistentCasingInFileNames": true,
"isolatedModules": true,
"skipLibCheck": true,
"resolveJsonModule": true,
"verbatimModuleSyntax": true,
"baseUrl": ".",
"paths": {
"@shared/*": ["src/shared/*"],
"@engine/*": ["src/engine/*"],
"@anchor/*": ["src/anchor/*"],
"@source/*": ["src/source/*"],
"@work/*": ["src/work/*"],
"@binder/*": ["src/binder/*"],
"@app/*": ["src/app/*"]
}
},
"include": ["src", "vite.config.ts", "vitest.config.ts"],
"exclude": ["node_modules", "dist"]
}

View File

@@ -8,7 +8,7 @@ repo_id: a677c189-b4e2-4f2a-9e48-faa482c277e6
topic_slug: citation_evidence_mvp topic_slug: citation_evidence_mvp
topic_id: 96fa8e80-9f74-40f2-84cd-644e9747b9ec topic_id: 96fa8e80-9f74-40f2-84cd-644e9747b9ec
state_hub_workstream_id: 1737bf6e-a3cb-413e-81b8-932f6f85791c state_hub_workstream_id: 1737bf6e-a3cb-413e-81b8-932f6f85791c
status: todo status: done
owner: Bernd owner: Bernd
created: 2026-05-24 created: 2026-05-24
updated: 2026-05-24 updated: 2026-05-24
@@ -57,7 +57,7 @@ T01 (toolchain decision + package.json)
id: CE-WP-0001-T01 id: CE-WP-0001-T01
state_hub_task_id: 4de816d0-34de-4bdf-a802-da1b0feefc19 state_hub_task_id: 4de816d0-34de-4bdf-a802-da1b0feefc19
priority: critical priority: critical
status: todo status: done
``` ```
Decide the TS toolchain (vite vs tsc-only vs Next.js) and write a single Decide the TS toolchain (vite vs tsc-only vs Next.js) and write a single
@@ -85,7 +85,7 @@ Do not install application dependencies yet — just the toolchain.
id: CE-WP-0001-T02 id: CE-WP-0001-T02
state_hub_task_id: 448d2d93-9517-4649-8aac-e00907a12a0a state_hub_task_id: 448d2d93-9517-4649-8aac-e00907a12a0a
priority: critical priority: critical
status: todo status: done
depends_on: [T01] depends_on: [T01]
``` ```
@@ -116,7 +116,7 @@ Add path aliases in `tsconfig.json`: `@shared/*`, `@engine/*`, etc.
id: CE-WP-0001-T03 id: CE-WP-0001-T03
state_hub_task_id: abd08afb-78e5-4b41-b956-53e5605c1113 state_hub_task_id: abd08afb-78e5-4b41-b956-53e5605c1113
priority: high priority: high
status: todo status: done
depends_on: [T02] depends_on: [T02]
``` ```
@@ -146,7 +146,7 @@ lint catches it; remove the fixture afterward.
id: CE-WP-0001-T04 id: CE-WP-0001-T04
state_hub_task_id: 0ca4f848-20c5-425e-8996-a73569c9be16 state_hub_task_id: 0ca4f848-20c5-425e-8996-a73569c9be16
priority: critical priority: critical
status: todo status: done
depends_on: [T02] depends_on: [T02]
``` ```
@@ -179,7 +179,7 @@ changes can be detected as a migration concern.
id: CE-WP-0001-T05 id: CE-WP-0001-T05
state_hub_task_id: 0b686530-ef89-4172-b5c8-de97fa7b7ef0 state_hub_task_id: 0b686530-ef89-4172-b5c8-de97fa7b7ef0
priority: high priority: high
status: todo status: done
depends_on: [T01] depends_on: [T01]
``` ```
@@ -209,7 +209,7 @@ Keep each PDF small (< 1 MB) and check sources/licenses into
id: CE-WP-0001-T06 id: CE-WP-0001-T06
state_hub_task_id: b0a5b5a4-81f0-4359-a6e1-67bc6c77e52b state_hub_task_id: b0a5b5a4-81f0-4359-a6e1-67bc6c77e52b
priority: medium priority: medium
status: todo status: done
depends_on: [T01, T02] depends_on: [T01, T02]
``` ```
@@ -233,7 +233,7 @@ the MVP phase.
id: CE-WP-0001-T07 id: CE-WP-0001-T07
state_hub_task_id: 15456374-73e0-403e-b805-2e259247e615 state_hub_task_id: 15456374-73e0-403e-b805-2e259247e615
priority: medium priority: medium
status: todo status: done
depends_on: [T01] depends_on: [T01]
``` ```