generated from coulomb/repo-seed
1361 lines
35 KiB
Markdown
1361 lines
35 KiB
Markdown
# Architecture Overview: citation-evidence
|
|
|
|
## 1. Purpose
|
|
|
|
This document describes the initial architecture for **citation-evidence**, a modular evidence workspace for capturing, managing, presenting, and reopening citations across PDFs and other document formats.
|
|
|
|
The architecture is designed to support three primary product modes:
|
|
|
|
1. **Document Review** — add documents to a collection, mark passages, comment on them, and create reusable evidence items.
|
|
2. **Evidence-Backed Forms** — display documents next to forms, bind evidence to fields, and navigate from field to cited source context.
|
|
3. **Citation Recovery** — start from an external citation, quote, or source clue, find the digital source if available, locate the cited passage, and create an annotation.
|
|
|
|
The system should remain viewer-independent, format-neutral, and suitable for future agentic workflows.
|
|
|
|
---
|
|
|
|
## 2. Architectural Summary
|
|
|
|
At its core, **citation-evidence** separates five concerns:
|
|
|
|
```text
|
|
Document Source
|
|
The original PDF, Markdown, HTML, web page, scan, or other document.
|
|
|
|
Document Representation
|
|
A normalized, searchable, addressable representation derived from the source.
|
|
|
|
Annotation Anchor
|
|
A durable technical reference to a passage inside a representation.
|
|
|
|
Evidence Item
|
|
A meaningful evidence object built from one or more annotations and commentary.
|
|
|
|
Evidence Binding
|
|
A connection between evidence and a structured target such as a form field, claim, requirement, or decision.
|
|
```
|
|
|
|
The high-level architecture is:
|
|
|
|
```text
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ citation-evidence │
|
|
│ Umbrella app, workspace shell, integration, demos, docs │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ citation-engine │
|
|
│ Core domain model, APIs, persistence contracts, citation rendering │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
│
|
|
┌───────────┼───────────────────────┬───────────────────────┐
|
|
▼ ▼ ▼ ▼
|
|
┌───────────────┐ ┌────────────────┐ ┌────────────────┐
|
|
│ evidence- │ │ evidence- │ │ citation- │
|
|
│ source │ │ anchor │ │ work │
|
|
│ ingestion, │ │ selectors, │ │ review UI, │
|
|
│ extraction, │ │ resolving, │ │ collections, │
|
|
│ recovery │ │ highlighting │ │ annotation UX │
|
|
└───────────────┘ └────────────────┘ └────────────────┘
|
|
│ │ │
|
|
└───────────────┬───────┴───────────────┬───────┘
|
|
▼ ▼
|
|
┌────────────────┐ ┌────────────────┐
|
|
│ evidence- │ │ viewer adapters │
|
|
│ binder │ │ PDF / HTML / MD │
|
|
│ field/claim │ │ and later more │
|
|
│ evidence links │ │ │
|
|
└────────────────┘ └────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Repository and Subsystem Boundaries
|
|
|
|
### 3.1 citation-evidence
|
|
|
|
**Role:** Umbrella product and integration repository.
|
|
|
|
This repository ties the subsystem implementations together and provides the reference product experience.
|
|
|
|
Responsibilities:
|
|
|
|
* Workspace shell.
|
|
* Cross-subsystem integration.
|
|
* Reference web application.
|
|
* Demo scenarios.
|
|
* Product documentation.
|
|
* System-level tests.
|
|
* Example deployments.
|
|
* Developer onboarding.
|
|
|
|
Should contain:
|
|
|
|
```text
|
|
citation-evidence/
|
|
README.md
|
|
INTENT.md
|
|
ARCHITECTURE.md
|
|
PRODUCT_REQUIREMENTS.md
|
|
apps/
|
|
workspace-demo/
|
|
docs/
|
|
concepts/
|
|
decisions/
|
|
examples/
|
|
integration-tests/
|
|
docker-compose.yml
|
|
```
|
|
|
|
Should not contain:
|
|
|
|
* The low-level anchoring algorithms.
|
|
* The complete document ingestion implementation.
|
|
* The full domain engine implementation.
|
|
* Viewer-specific internals except as integration examples.
|
|
|
|
---
|
|
|
|
### 3.2 citation-engine
|
|
|
|
**Role:** Core domain engine and service layer.
|
|
|
|
This is the conceptual center of the system. It owns the stable domain model and the API contracts used by the other subsystems.
|
|
|
|
Responsibilities:
|
|
|
|
* Core domain model.
|
|
* Document, annotation, evidence, and binding APIs.
|
|
* Persistence interfaces.
|
|
* Citation card rendering contracts.
|
|
* Markdown and HTML export logic.
|
|
* W3C Web Annotation-compatible mapping.
|
|
* Event model.
|
|
* Orchestration between source, anchor, work, and binder subsystems.
|
|
|
|
Key concepts owned:
|
|
|
|
```text
|
|
Document
|
|
DocumentRepresentation
|
|
Annotation
|
|
Selector
|
|
EvidenceItem
|
|
EvidenceLink
|
|
EvidenceSet
|
|
CitationCard
|
|
CitationRecoveryAttempt
|
|
```
|
|
|
|
Suggested package structure:
|
|
|
|
```text
|
|
citation-engine/
|
|
packages/
|
|
model/
|
|
api-contracts/
|
|
persistence/
|
|
citation-rendering/
|
|
events/
|
|
w3c-mapping/
|
|
docs/
|
|
tests/
|
|
```
|
|
|
|
Primary interfaces:
|
|
|
|
```ts
|
|
type DocumentId = string;
|
|
type AnnotationId = string;
|
|
type EvidenceItemId = string;
|
|
type EvidenceLinkId = string;
|
|
|
|
interface CitationEngine {
|
|
documents: DocumentService;
|
|
annotations: AnnotationService;
|
|
evidence: EvidenceService;
|
|
bindings: EvidenceBindingService;
|
|
rendering: CitationRenderingService;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 3.3 evidence-anchor
|
|
|
|
**Role:** Format-neutral anchoring, selector resolution, and highlighting contract.
|
|
|
|
This repository is responsible for making citations durable and reopenable.
|
|
|
|
Responsibilities:
|
|
|
|
* Selector model.
|
|
* Text quote selectors.
|
|
* Text position selectors.
|
|
* PDF page/rectangle selectors.
|
|
* DOM/structural selectors.
|
|
* Selector creation from user selections.
|
|
* Selector resolution against document representations.
|
|
* Fuzzy re-anchoring.
|
|
* Highlight rendering contract.
|
|
* Orphaned annotation detection.
|
|
|
|
Key architectural rule:
|
|
|
|
**No citation should depend on a single visual coordinate system only.**
|
|
|
|
The subsystem should store redundant selectors where possible:
|
|
|
|
```text
|
|
PDF citation:
|
|
- exact quote
|
|
- prefix/suffix
|
|
- page number
|
|
- normalized page rectangles
|
|
- page-local text offsets
|
|
- global canonical text offsets
|
|
|
|
HTML/Markdown citation:
|
|
- exact quote
|
|
- prefix/suffix
|
|
- canonical text offsets
|
|
- DOM range or structural path
|
|
- heading/section context
|
|
```
|
|
|
|
Suggested package structure:
|
|
|
|
```text
|
|
evidence-anchor/
|
|
packages/
|
|
selectors/
|
|
resolver/
|
|
fuzzy-match/
|
|
highlight-contract/
|
|
pdf-selectors/
|
|
dom-selectors/
|
|
docs/
|
|
tests/
|
|
```
|
|
|
|
Core interface:
|
|
|
|
```ts
|
|
interface AnchorAdapter {
|
|
createSelectors(selection: SelectionCapture): Promise<Selector[]>;
|
|
resolveSelectors(
|
|
representation: DocumentRepresentation,
|
|
selectors: Selector[]
|
|
): Promise<AnchorResolution>;
|
|
renderHighlight(
|
|
target: ResolvedAnchorTarget,
|
|
options?: HighlightRenderOptions
|
|
): Promise<void>;
|
|
scrollToTarget(
|
|
target: ResolvedAnchorTarget,
|
|
options?: ScrollToTargetOptions
|
|
): Promise<void>;
|
|
}
|
|
```
|
|
|
|
Resolution result:
|
|
|
|
```ts
|
|
type AnchorResolution = {
|
|
status: "resolved" | "ambiguous" | "unresolved" | "stale";
|
|
confidence: number;
|
|
candidates: ResolvedAnchorTarget[];
|
|
usedSelectorTypes: string[];
|
|
warnings?: string[];
|
|
};
|
|
```
|
|
|
|
---
|
|
|
|
### 3.4 evidence-source
|
|
|
|
**Role:** Document ingestion, source metadata, full-text extraction, and citation recovery.
|
|
|
|
This repository turns raw sources into usable document representations and supports the process of recovering cited passages from external references.
|
|
|
|
Responsibilities:
|
|
|
|
* Document import.
|
|
* Source URI handling.
|
|
* Metadata extraction.
|
|
* Fingerprinting.
|
|
* Text extraction.
|
|
* PDF text extraction pipeline.
|
|
* Markdown normalization.
|
|
* HTML normalization and sanitization.
|
|
* Optional OCR integration later.
|
|
* Local source matching.
|
|
* External source discovery hooks.
|
|
* Citation recovery attempts.
|
|
|
|
Suggested package structure:
|
|
|
|
```text
|
|
evidence-source/
|
|
packages/
|
|
ingest-core/
|
|
fingerprinting/
|
|
metadata/
|
|
extract-pdf/
|
|
extract-markdown/
|
|
extract-html/
|
|
source-lookup/
|
|
citation-recovery/
|
|
docs/
|
|
tests/
|
|
```
|
|
|
|
Core ingestion pipeline:
|
|
|
|
```text
|
|
Raw Source
|
|
→ identify media type
|
|
→ compute fingerprint
|
|
→ extract metadata
|
|
→ extract canonical text
|
|
→ build format-specific maps
|
|
→ persist Document + DocumentRepresentation
|
|
```
|
|
|
|
PDF representation should include:
|
|
|
|
```text
|
|
page count
|
|
page text
|
|
global canonical text
|
|
page-local offset map
|
|
text item map
|
|
page dimensions
|
|
optional normalized rectangles for selections
|
|
```
|
|
|
|
Markdown/HTML representation should include:
|
|
|
|
```text
|
|
canonical text
|
|
DOM or AST structure
|
|
heading map
|
|
offset-to-node map
|
|
source line map where available
|
|
sanitized render output
|
|
```
|
|
|
|
Citation recovery pipeline:
|
|
|
|
```text
|
|
Citation clue / quote / reference
|
|
→ parse clue
|
|
→ search local library
|
|
→ search configured external sources if allowed
|
|
→ identify candidate documents
|
|
→ extract/index candidate text
|
|
→ exact quote search
|
|
→ fuzzy quote search
|
|
→ present candidates
|
|
→ user confirms
|
|
→ create annotation + evidence item
|
|
```
|
|
|
|
---
|
|
|
|
### 3.5 citation-work
|
|
|
|
**Role:** Review workspace and annotation user experience.
|
|
|
|
This repository provides the user-facing workflows for reviewing document collections and creating evidence from selected passages.
|
|
|
|
Responsibilities:
|
|
|
|
* Document collection UI.
|
|
* Review queue.
|
|
* Document viewer composition.
|
|
* Annotation creation UX.
|
|
* Evidence sidebar.
|
|
* Review state management.
|
|
* Tagging and filtering.
|
|
* Navigation between evidence items and source context.
|
|
|
|
Suggested package structure:
|
|
|
|
```text
|
|
citation-work/
|
|
packages/
|
|
review-ui/
|
|
collection-ui/
|
|
evidence-sidebar/
|
|
annotation-toolbar/
|
|
viewer-shell/
|
|
review-state/
|
|
docs/
|
|
tests/
|
|
```
|
|
|
|
Core UI layout:
|
|
|
|
```text
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Collection / Review Header │
|
|
├───────────────┬───────────────────────────────┬─────────────┤
|
|
│ Document List │ Document Viewer │ Evidence │
|
|
│ / Filters │ PDF / HTML / Markdown │ Sidebar │
|
|
└───────────────┴───────────────────────────────┴─────────────┘
|
|
```
|
|
|
|
Review states:
|
|
|
|
```text
|
|
unreviewed
|
|
in-review
|
|
marked
|
|
relevant
|
|
rejected
|
|
needs-follow-up
|
|
cited
|
|
verified
|
|
```
|
|
|
|
Evidence states:
|
|
|
|
```text
|
|
candidate
|
|
confirmed
|
|
rejected
|
|
needs-check
|
|
strong-support
|
|
weak-support
|
|
contradicts
|
|
```
|
|
|
|
---
|
|
|
|
### 3.6 evidence-binder
|
|
|
|
**Role:** Binding evidence to structured targets such as form fields, claims, requirements, decisions, or document sections.
|
|
|
|
This repository provides the graph-like layer between evidence and the things it supports.
|
|
|
|
Responsibilities:
|
|
|
|
* Evidence-to-field links.
|
|
* Evidence-to-claim links.
|
|
* Evidence-to-requirement links.
|
|
* Evidence sets.
|
|
* Relation types.
|
|
* Form synchronization state.
|
|
* Active field/evidence/annotation state.
|
|
* Visual guide model.
|
|
* Evidence completeness indicators.
|
|
|
|
Suggested package structure:
|
|
|
|
```text
|
|
evidence-binder/
|
|
packages/
|
|
binding-model/
|
|
form-evidence-state/
|
|
evidence-switcher/
|
|
visual-guide-overlay/
|
|
target-adapters/
|
|
docs/
|
|
tests/
|
|
```
|
|
|
|
Core model:
|
|
|
|
```ts
|
|
type EvidenceTargetType =
|
|
| "form-field"
|
|
| "claim"
|
|
| "requirement"
|
|
| "decision"
|
|
| "document-section";
|
|
|
|
type EvidenceRelation =
|
|
| "supports"
|
|
| "contradicts"
|
|
| "explains"
|
|
| "source-for"
|
|
| "qualifies";
|
|
|
|
interface EvidenceLink {
|
|
id: string;
|
|
evidenceItemId: string;
|
|
targetType: EvidenceTargetType;
|
|
targetId: string;
|
|
relation: EvidenceRelation;
|
|
confidence?: number;
|
|
status?: "candidate" | "confirmed" | "rejected" | "needs-check";
|
|
}
|
|
```
|
|
|
|
Evidence form UI model:
|
|
|
|
```text
|
|
Form Field Activated
|
|
→ evidence-binder loads linked EvidenceSet
|
|
→ citation-engine resolves active evidence
|
|
→ evidence-anchor scrolls document viewer to annotation
|
|
→ visual-guide-overlay connects field, evidence card, and highlight
|
|
```
|
|
|
|
Visual guide architecture:
|
|
|
|
```text
|
|
Element Registry
|
|
field target rect
|
|
evidence card rect
|
|
annotation highlight rect
|
|
|
|
Guide Overlay
|
|
SVG line or curve from field to evidence card
|
|
SVG line or curve from evidence card to annotation
|
|
active state updates on scroll, resize, focus, and evidence switch
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Core Domain Model
|
|
|
|
### 4.1 Document
|
|
|
|
A source object known to the system.
|
|
|
|
```ts
|
|
interface Document {
|
|
id: string;
|
|
title?: string;
|
|
uri?: string;
|
|
mediaType: string;
|
|
fingerprint?: string;
|
|
version?: string;
|
|
createdAt: string;
|
|
updatedAt: string;
|
|
metadata?: Record<string, unknown>;
|
|
}
|
|
```
|
|
|
|
### 4.2 DocumentRepresentation
|
|
|
|
A normalized representation generated from a document source.
|
|
|
|
```ts
|
|
interface DocumentRepresentation {
|
|
id: string;
|
|
documentId: string;
|
|
representationType:
|
|
| "pdf-text"
|
|
| "html-dom"
|
|
| "markdown-rendered"
|
|
| "plain-text"
|
|
| "ocr-text";
|
|
contentHash: string;
|
|
canonicalText?: string;
|
|
pageMap?: PageMap;
|
|
structureMap?: StructureMap;
|
|
offsetMap?: OffsetMap;
|
|
generatedAt: string;
|
|
}
|
|
```
|
|
|
|
### 4.3 Selector
|
|
|
|
A technical locator for a document passage.
|
|
|
|
```ts
|
|
type Selector =
|
|
| TextQuoteSelector
|
|
| TextPositionSelector
|
|
| PdfRectSelector
|
|
| DomRangeSelector
|
|
| StructuralSelector;
|
|
```
|
|
|
|
Recommended selector redundancy:
|
|
|
|
```text
|
|
Always capture:
|
|
- exact quote
|
|
- prefix/suffix context
|
|
|
|
Capture when available:
|
|
- canonical text offsets
|
|
- PDF page/rectangles
|
|
- DOM range
|
|
- structural path
|
|
- heading context
|
|
```
|
|
|
|
### 4.4 Annotation
|
|
|
|
A technical mark on a document range.
|
|
|
|
```ts
|
|
interface Annotation {
|
|
id: string;
|
|
documentId: string;
|
|
representationId?: string;
|
|
selectors: Selector[];
|
|
quote?: string;
|
|
note?: string;
|
|
createdBy?: string;
|
|
createdAt: string;
|
|
updatedAt: string;
|
|
}
|
|
```
|
|
|
|
### 4.5 EvidenceItem
|
|
|
|
A meaningful evidence object built from one or more annotations.
|
|
|
|
```ts
|
|
interface EvidenceItem {
|
|
id: string;
|
|
annotationIds: string[];
|
|
title?: string;
|
|
commentary?: string;
|
|
status: "candidate" | "confirmed" | "rejected" | "needs-check";
|
|
confidence?: number;
|
|
tags?: string[];
|
|
createdBy?: string;
|
|
createdAt: string;
|
|
updatedAt: string;
|
|
}
|
|
```
|
|
|
|
### 4.6 EvidenceSet
|
|
|
|
A group of evidence items connected to a target or topic.
|
|
|
|
```ts
|
|
interface EvidenceSet {
|
|
id: string;
|
|
label?: string;
|
|
targetType?: string;
|
|
targetId?: string;
|
|
evidenceItemIds: string[];
|
|
activeEvidenceItemId?: string;
|
|
}
|
|
```
|
|
|
|
### 4.7 CitationCard
|
|
|
|
A presentable rendering of an evidence item.
|
|
|
|
```ts
|
|
interface CitationCard {
|
|
id: string;
|
|
evidenceItemId: string;
|
|
quote: string;
|
|
sourceLabel: string;
|
|
commentary?: string;
|
|
openContextUrl?: string;
|
|
format: "html" | "markdown" | "web-component";
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Viewer Adapter Architecture
|
|
|
|
The system must not hard-code one viewer implementation into the citation model.
|
|
|
|
Each document format should be supported through a viewer adapter.
|
|
|
|
```ts
|
|
interface DocumentViewerAdapter {
|
|
mediaTypes: string[];
|
|
|
|
load(document: Document, representation?: DocumentRepresentation): Promise<void>;
|
|
|
|
getCurrentSelection(): Promise<SelectionCapture | null>;
|
|
|
|
createSelectorsFromSelection(
|
|
selection: SelectionCapture
|
|
): Promise<Selector[]>;
|
|
|
|
resolveSelectors(
|
|
selectors: Selector[]
|
|
): Promise<AnchorResolution>;
|
|
|
|
scrollToResolvedTarget(
|
|
target: ResolvedAnchorTarget,
|
|
options?: {
|
|
center?: boolean;
|
|
behavior?: "auto" | "smooth";
|
|
}
|
|
): Promise<void>;
|
|
|
|
renderHighlight(
|
|
target: ResolvedAnchorTarget,
|
|
options?: HighlightRenderOptions
|
|
): Promise<void>;
|
|
|
|
getHighlightClientRects(
|
|
annotationId: string
|
|
): Promise<DOMRect[]>;
|
|
}
|
|
```
|
|
|
|
Initial adapters:
|
|
|
|
```text
|
|
PDFViewerAdapter
|
|
PDF.js / react-pdf-highlighter-plus based
|
|
|
|
HtmlViewerAdapter
|
|
sanitized HTML, DOM selection, DOM ranges
|
|
|
|
MarkdownViewerAdapter
|
|
markdown → HTML rendering, DOM selection, optional source-map support
|
|
```
|
|
|
|
Future adapters:
|
|
|
|
```text
|
|
DocxViewerAdapter
|
|
EpubViewerAdapter
|
|
ImageOcrViewerAdapter
|
|
PlainTextViewerAdapter
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Data Flow: Document Review
|
|
|
|
```text
|
|
User adds document
|
|
→ evidence-source imports source
|
|
→ document fingerprint is computed
|
|
→ document metadata is extracted
|
|
→ document representation is generated
|
|
→ citation-engine stores Document and DocumentRepresentation
|
|
→ citation-work displays document
|
|
→ user selects passage
|
|
→ viewer adapter captures selection
|
|
→ evidence-anchor creates selectors
|
|
→ citation-engine creates Annotation
|
|
→ user adds commentary
|
|
→ citation-engine creates EvidenceItem
|
|
→ citation-work shows item in evidence sidebar
|
|
```
|
|
|
|
Result:
|
|
|
|
```text
|
|
Document + Representation + Annotation + EvidenceItem
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Data Flow: Reopen Citation Context
|
|
|
|
```text
|
|
User clicks citation or evidence item
|
|
→ citation-engine loads EvidenceItem
|
|
→ citation-engine loads Annotation
|
|
→ citation-engine loads Document and Representation
|
|
→ viewer adapter opens document if needed
|
|
→ evidence-anchor resolves selectors
|
|
→ viewer adapter scrolls target into center
|
|
→ viewer adapter renders highlight
|
|
→ citation-work/evidence-binder shows active state
|
|
```
|
|
|
|
Resolution strategy:
|
|
|
|
```text
|
|
1. Try exact representation/version match.
|
|
2. Try position selector.
|
|
3. Verify exact quote.
|
|
4. Try PDF page/rectangle selector if PDF.
|
|
5. Try text quote selector with prefix/suffix.
|
|
6. Try fuzzy quote matching.
|
|
7. If multiple matches, rank by structural/page context.
|
|
8. If unresolved, mark annotation as orphaned.
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Data Flow: Evidence-Backed Form Field
|
|
|
|
```text
|
|
User focuses form field
|
|
→ evidence-binder identifies EvidenceSet for field
|
|
→ evidence-binder selects active EvidenceItem
|
|
→ citation-engine loads annotation and source context
|
|
→ viewer adapter resolves and scrolls to annotation
|
|
→ evidence sidebar highlights active evidence item
|
|
→ form field shows active evidence state
|
|
→ visual guide overlay connects field, evidence, and highlight
|
|
```
|
|
|
|
Evidence switch:
|
|
|
|
```text
|
|
User selects next evidence item
|
|
→ activeEvidenceItemId changes
|
|
→ annotation is resolved
|
|
→ viewer scrolls to new passage
|
|
→ guide overlay updates
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Data Flow: Citation Recovery
|
|
|
|
```text
|
|
User enters citation clue / quote / source reference
|
|
→ evidence-source parses clue
|
|
→ search local document library
|
|
→ rank local candidates
|
|
→ if allowed, search configured external sources
|
|
→ fetch/load candidate representation where permitted
|
|
→ exact quote search
|
|
→ fuzzy quote search
|
|
→ show candidate passages
|
|
→ user confirms passage
|
|
→ evidence-anchor creates selectors
|
|
→ citation-engine creates Annotation
|
|
→ citation-engine creates EvidenceItem
|
|
→ optional: evidence-binder links item to target
|
|
```
|
|
|
|
Recovery states:
|
|
|
|
```text
|
|
source-found-fulltext
|
|
source-found-preview-only
|
|
source-found-metadata-only
|
|
source-not-found
|
|
quote-found
|
|
quote-not-found
|
|
manual-confirmation-needed
|
|
annotation-created
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Persistence Architecture
|
|
|
|
The architecture should support multiple persistence modes.
|
|
|
|
### 10.1 Local-First Development Mode
|
|
|
|
Suitable for early MVPs and personal use.
|
|
|
|
```text
|
|
SQLite / DuckDB / local filesystem
|
|
documents stored as files
|
|
metadata stored in SQLite
|
|
extracted text cached locally
|
|
annotations stored as JSON or relational rows
|
|
```
|
|
|
|
Advantages:
|
|
|
|
* Simple setup.
|
|
* Good for CLI and desktop-like workflows.
|
|
* Agent-friendly.
|
|
* Easy to version and inspect.
|
|
|
|
### 10.2 Web Application Mode
|
|
|
|
Suitable for team or server deployment.
|
|
|
|
```text
|
|
Object storage
|
|
original documents
|
|
|
|
PostgreSQL
|
|
documents
|
|
representations
|
|
annotations
|
|
evidence items
|
|
evidence links
|
|
|
|
Search index
|
|
full-text and quote search
|
|
```
|
|
|
|
Recommended baseline:
|
|
|
|
```text
|
|
PostgreSQL
|
|
canonical metadata and relationships
|
|
|
|
Object storage / filesystem
|
|
document blobs and generated representations
|
|
|
|
Meilisearch / Typesense / OpenSearch
|
|
full-text document and evidence search
|
|
```
|
|
|
|
### 10.3 Persistence Boundaries
|
|
|
|
`citation-engine` should define persistence interfaces.
|
|
|
|
Concrete storage implementations should be replaceable.
|
|
|
|
```ts
|
|
interface AnnotationRepository {
|
|
create(annotation: Annotation): Promise<Annotation>;
|
|
get(id: string): Promise<Annotation | null>;
|
|
listByDocument(documentId: string): Promise<Annotation[]>;
|
|
update(annotation: Annotation): Promise<Annotation>;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 11. Search and Indexing Architecture
|
|
|
|
Search is needed for:
|
|
|
|
* Finding documents.
|
|
* Finding evidence items.
|
|
* Searching within a document.
|
|
* Citation recovery.
|
|
* Fuzzy re-anchoring.
|
|
|
|
Index types:
|
|
|
|
```text
|
|
Document metadata index
|
|
title, author, source URI, document type, collection
|
|
|
|
Full-text document index
|
|
canonical text, page text, section text
|
|
|
|
Evidence index
|
|
quote, commentary, tags, target links
|
|
|
|
Anchor recovery index
|
|
n-grams, quote fragments, prefix/suffix context
|
|
```
|
|
|
|
For the MVP, local full-text search may be enough.
|
|
|
|
Later, source recovery and large document collections will benefit from a dedicated search service.
|
|
|
|
---
|
|
|
|
## 12. UI Architecture
|
|
|
|
### 12.1 Review Workspace
|
|
|
|
```text
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Workspace Header │
|
|
├───────────────┬───────────────────────────────┬─────────────┤
|
|
│ Collection │ Document Viewer │ Evidence │
|
|
│ Navigation │ │ Sidebar │
|
|
└───────────────┴───────────────────────────────┴─────────────┘
|
|
```
|
|
|
|
Primary interactions:
|
|
|
|
* Select text.
|
|
* Create annotation.
|
|
* Add commentary.
|
|
* Tag evidence.
|
|
* Click evidence to reopen context.
|
|
* Filter by status/tag/document.
|
|
|
|
### 12.2 Evidence Form Workspace
|
|
|
|
```text
|
|
┌───────────────────────────────┬─────────────────────────────┐
|
|
│ Structured Form │ Document Viewer │
|
|
│ │ │
|
|
│ Field A │ Active citation highlight │
|
|
│ evidence chips │ │
|
|
│ Field B │ │
|
|
│ evidence chips │ │
|
|
├───────────────────────────────┴─────────────────────────────┤
|
|
│ Optional Evidence Tray / Active Citation Details │
|
|
└───────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
Visual guide overlay:
|
|
|
|
```text
|
|
field element → evidence chip/card → document highlight
|
|
```
|
|
|
|
The overlay should be independent from both the form renderer and document viewer.
|
|
|
|
### 12.3 Citation Recovery Workspace
|
|
|
|
```text
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Citation / Quote Input │
|
|
├──────────────────────┬──────────────────────────────────────┤
|
|
│ Candidate Sources │ Candidate Passages │
|
|
├──────────────────────┴──────────────────────────────────────┤
|
|
│ Confirm / Create Annotation │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Event Model
|
|
|
|
Subsystems should communicate through explicit domain events where useful.
|
|
|
|
Examples:
|
|
|
|
```text
|
|
DocumentImported
|
|
DocumentRepresentationGenerated
|
|
AnnotationCreated
|
|
AnnotationResolved
|
|
AnnotationResolutionFailed
|
|
EvidenceItemCreated
|
|
EvidenceItemLinked
|
|
EvidenceItemActivated
|
|
FormFieldActivated
|
|
CitationCardRendered
|
|
CitationRecoveryStarted
|
|
CitationRecoveryCandidateFound
|
|
CitationRecoveryConfirmed
|
|
```
|
|
|
|
Example event:
|
|
|
|
```ts
|
|
interface EvidenceItemActivatedEvent {
|
|
type: "EvidenceItemActivated";
|
|
evidenceItemId: string;
|
|
source?: "sidebar" | "form-field" | "citation-card";
|
|
targetContext?: {
|
|
type: "form-field" | "claim" | "requirement";
|
|
id: string;
|
|
};
|
|
}
|
|
```
|
|
|
|
Events should be useful both for UI synchronization and later automation/agent workflows.
|
|
|
|
---
|
|
|
|
## 14. External Standards and Compatibility
|
|
|
|
The architecture should align with existing standards where practical.
|
|
|
|
### 14.1 W3C Web Annotation
|
|
|
|
Use W3C Web Annotation concepts for:
|
|
|
|
* Annotation.
|
|
* Body.
|
|
* Target.
|
|
* Selector.
|
|
* TextQuoteSelector.
|
|
* TextPositionSelector.
|
|
|
|
Recommended approach:
|
|
|
|
```text
|
|
Internal model:
|
|
optimized for citation-evidence workflows
|
|
|
|
Import/export mapping:
|
|
W3C Web Annotation-compatible JSON where practical
|
|
```
|
|
|
|
This avoids forcing JSON-LD complexity into every internal operation while preserving standards compatibility.
|
|
|
|
### 14.2 Web Components
|
|
|
|
Citation presentation should be embeddable through web components where possible:
|
|
|
|
```html
|
|
<citation-card evidence-item-id="ev_123"></citation-card>
|
|
<citation-context-link annotation-id="ann_456"></citation-context-link>
|
|
<citation-document-viewer document-id="doc_123"></citation-document-viewer>
|
|
```
|
|
|
|
### 14.3 URL Deep Links
|
|
|
|
The system should provide stable internal URLs such as:
|
|
|
|
```text
|
|
/viewer?document=doc_123&annotation=ann_456
|
|
/workspace/collections/col_123/documents/doc_123?evidence=ev_456
|
|
```
|
|
|
|
For public HTML documents, optional browser text fragments may be generated as export aids, but should not be the only internal anchoring mechanism.
|
|
|
|
---
|
|
|
|
## 15. Security Architecture
|
|
|
|
Security principles:
|
|
|
|
1. Treat imported documents as untrusted input.
|
|
2. Sanitize imported HTML.
|
|
3. Avoid executing document scripts.
|
|
4. Isolate document rendering where needed.
|
|
5. Do not send private document text to external services without explicit user permission.
|
|
6. Make external source lookup configurable.
|
|
7. Preserve access control boundaries around collections and documents.
|
|
|
|
Important areas:
|
|
|
|
```text
|
|
HTML sanitization
|
|
PDF processing safety
|
|
external URL fetching
|
|
object storage access
|
|
annotation visibility
|
|
collection permissions
|
|
agent/tool permissions
|
|
```
|
|
|
|
For MVP, single-user/local security is sufficient, but the model should not block later multi-user permissions.
|
|
|
|
---
|
|
|
|
## 16. Suggested Initial Technical Stack
|
|
|
|
### 16.1 Frontend
|
|
|
|
```text
|
|
TypeScript
|
|
React for first application shell
|
|
PDF.js or react-pdf-highlighter-plus for PDF MVP
|
|
unified / remark / rehype for Markdown rendering
|
|
DOMPurify for HTML sanitization
|
|
SVG overlay for visual guides
|
|
CSS Custom Highlight API with fallback for HTML/Markdown highlighting
|
|
```
|
|
|
|
### 16.2 Backend / Local Service
|
|
|
|
```text
|
|
Node.js or Python service for initial ingestion
|
|
PostgreSQL for server mode
|
|
SQLite for local-first mode
|
|
Filesystem or object storage for document blobs
|
|
Meilisearch or Typesense for search if needed early
|
|
```
|
|
|
|
### 16.3 Document Processing
|
|
|
|
```text
|
|
PDF.js text extraction for browser-side PDF workflows
|
|
Apache Tika or similar for broader server-side extraction later
|
|
Tesseract OCR for scanned documents later
|
|
```
|
|
|
|
### 16.4 Packaging Direction
|
|
|
|
Prefer TypeScript-first packages for the core web-facing model and UI integration.
|
|
|
|
A later backend may be polyglot, but the browser-facing contracts should remain TypeScript-native.
|
|
|
|
---
|
|
|
|
## 17. MVP Implementation Plan
|
|
|
|
### Phase 1: Core Model and PDF Review
|
|
|
|
Deliverables:
|
|
|
|
* Basic `Document`, `Annotation`, `EvidenceItem` model.
|
|
* PDF viewer integration.
|
|
* Text selection capture.
|
|
* Highlight creation.
|
|
* Commentary entry.
|
|
* Evidence sidebar.
|
|
* Click evidence to reopen context.
|
|
* Markdown/HTML citation card export.
|
|
|
|
Subsystems involved:
|
|
|
|
```text
|
|
citation-engine
|
|
citation-work
|
|
evidence-anchor
|
|
evidence-source
|
|
citation-evidence
|
|
```
|
|
|
|
### Phase 2: Evidence Binding and Form Mode
|
|
|
|
Deliverables:
|
|
|
|
* Simple form definition model.
|
|
* Evidence links to form fields.
|
|
* Evidence chips on fields.
|
|
* Activate field to focus evidence.
|
|
* Evidence switcher.
|
|
* Active state synchronization.
|
|
* Initial SVG visual guide overlay.
|
|
|
|
Subsystems involved:
|
|
|
|
```text
|
|
evidence-binder
|
|
citation-engine
|
|
citation-work
|
|
evidence-anchor
|
|
citation-evidence
|
|
```
|
|
|
|
### Phase 3: Markdown and HTML Documents
|
|
|
|
Deliverables:
|
|
|
|
* Markdown rendering adapter.
|
|
* HTML rendering adapter.
|
|
* DOM text selection capture.
|
|
* Text quote and text position selectors.
|
|
* Highlighting in non-paginated documents.
|
|
* Reuse evidence sidebar and binding workflows.
|
|
|
|
Subsystems involved:
|
|
|
|
```text
|
|
evidence-source
|
|
evidence-anchor
|
|
citation-work
|
|
citation-engine
|
|
```
|
|
|
|
### Phase 4: Local Citation Recovery
|
|
|
|
Deliverables:
|
|
|
|
* Recovery input UI.
|
|
* Local document search.
|
|
* Exact quote match.
|
|
* Fuzzy quote match.
|
|
* Candidate passage confirmation.
|
|
* Create annotation from confirmed match.
|
|
|
|
Subsystems involved:
|
|
|
|
```text
|
|
evidence-source
|
|
evidence-anchor
|
|
citation-engine
|
|
citation-work
|
|
```
|
|
|
|
---
|
|
|
|
## 18. Architectural Decisions to Make Early
|
|
|
|
### ADR-001: Internal model vs. native W3C Web Annotation
|
|
|
|
Recommendation:
|
|
|
|
Use an internal model optimized for citation-evidence, with W3C-compatible import/export mapping.
|
|
|
|
Reason:
|
|
|
|
The product needs evidence binding, form synchronization, recovery states, and citation cards, which go beyond the basic web annotation model.
|
|
|
|
### ADR-002: React-first vs. Web-component-first
|
|
|
|
Recommendation:
|
|
|
|
Build the first application in React, but keep core model and viewer adapter contracts framework-neutral. Add web components for citation cards and context links early.
|
|
|
|
Reason:
|
|
|
|
React accelerates MVP UI development, while framework-neutral contracts protect reuse.
|
|
|
|
### ADR-003: Local-first vs. server-first storage
|
|
|
|
Recommendation:
|
|
|
|
Design persistence interfaces from the beginning. Implement local-first storage first if the target is personal/agentic workflows; implement PostgreSQL-backed storage when collaboration or server deployment becomes necessary.
|
|
|
|
### ADR-004: PDF.js direct vs. react-pdf-highlighter-plus
|
|
|
|
Recommendation:
|
|
|
|
Use react-pdf-highlighter-plus for initial speed if it satisfies selector and rendering needs. Keep an abstraction boundary so the PDF viewer can be replaced with direct PDF.js integration later.
|
|
|
|
### ADR-005: Citation recovery scope
|
|
|
|
Recommendation:
|
|
|
|
Start with local document library recovery. Add external source lookup only after the local anchoring and quote matching pipeline is reliable.
|
|
|
|
---
|
|
|
|
## 19. Risks and Mitigations
|
|
|
|
| Risk | Impact | Mitigation |
|
|
| ---------------------------------------------------- | -----: | --------------------------------------------------------------- |
|
|
| PDF text extraction is inconsistent across documents | High | Store both visual and text selectors; support manual correction |
|
|
| Highlight coordinates break with zoom/layout | High | Use normalized coordinates and viewer-independent selectors |
|
|
| Imported HTML executes unsafe content | High | Sanitize and sandbox HTML rendering |
|
|
| Citation recovery finds wrong passage | Medium | Require user confirmation for fuzzy or external matches |
|
|
| Too many repos create coordination overhead | Medium | Keep domain model in citation-engine and define clear contracts |
|
|
| Viewer library constraints leak into domain model | High | Enforce adapter boundary and selector abstraction |
|
|
| Form binding becomes too domain-specific | Medium | Model generic EvidenceTargets and target adapters |
|
|
| Search/indexing becomes heavy too early | Medium | Begin local/simple; add dedicated search service later |
|
|
|
|
---
|
|
|
|
## 20. First Reference Scenario
|
|
|
|
The first end-to-end reference scenario should be:
|
|
|
|
```text
|
|
1. User creates a collection named “Application Evidence”.
|
|
2. User uploads a PDF.
|
|
3. User selects a passage and adds commentary.
|
|
4. System creates an annotation and evidence item.
|
|
5. User opens a form next to the PDF.
|
|
6. User links the evidence item to a form field.
|
|
7. User focuses the field.
|
|
8. System highlights the field, evidence item, and source passage.
|
|
9. System draws a guide from field to evidence to source passage.
|
|
10. User exports the evidence as a Markdown citation card.
|
|
```
|
|
|
|
This scenario exercises the essential product value without requiring external source lookup or advanced collaboration.
|
|
|
|
---
|
|
|
|
## 21. Summary
|
|
|
|
The architecture of **citation-evidence** should be organized around reusable evidence objects, not only document annotations.
|
|
|
|
The core design is:
|
|
|
|
```text
|
|
Source Document
|
|
→ Document Representation
|
|
→ Durable Annotation Anchor
|
|
→ Evidence Item with Commentary
|
|
→ Evidence Link to Field / Claim / Requirement
|
|
→ Portable Citation Card
|
|
→ Reopenable Source Context
|
|
```
|
|
|
|
The subsystem repositories provide a clean separation of responsibilities:
|
|
|
|
```text
|
|
citation-engine owns the domain and APIs
|
|
evidence-anchor owns selector creation, resolution, and highlighting
|
|
evidence-source owns ingestion, extraction, and recovery
|
|
citation-work owns review workflows
|
|
evidence-binder owns evidence-to-target binding
|
|
citation-evidence owns the integrated product shell
|
|
```
|
|
|
|
This gives the project a practical MVP path while preserving enough architectural clarity to grow into a reusable infrastructure layer for evidence-backed information work.
|
|
|