# Architecture Overview: citation-evidence

## 1. Purpose

This document describes the initial architecture for **citation-evidence**, a modular evidence workspace for capturing, managing, presenting, and reopening citations across PDFs and other document formats.

The architecture is designed to support three primary product modes:

1. **Document Review** — add documents to a collection, mark passages, comment on them, and create reusable evidence items.
2. **Evidence-Backed Forms** — display documents next to forms, bind evidence to fields, and navigate from field to cited source context.
3. **Citation Recovery** — start from an external citation, quote, or source clue, find the digital source if available, locate the cited passage, and create an annotation.

The system should remain viewer-independent, format-neutral, and suitable for future agentic workflows.

---

## 2. Architectural Summary

At its core, **citation-evidence** separates five concerns:

```text
Document Source
  The original PDF, Markdown, HTML, web page, scan, or other document.

Document Representation
  A normalized, searchable, addressable representation derived from the source.

Annotation Anchor
  A durable technical reference to a passage inside a representation.

Evidence Item
  A meaningful evidence object built from one or more annotations and commentary.

Evidence Binding
  A connection between evidence and a structured target such as a form field, claim, requirement, or decision.
```

The high-level architecture is:

```text
┌─────────────────────────────────────────────────────────────────────┐
│ citation-evidence                                                    │
│ Umbrella app, workspace shell, integration, demos, docs              │
└─────────────────────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────────┐
│ citation-engine                                                      │
│ Core domain model, APIs, persistence contracts, citation rendering   │
└─────────────────────────────────────────────────────────────────────┘
             │
 ┌───────────┼───────────────────────┬───────────────────────┐
 ▼           ▼                       ▼                       ▼
┌───────────────┐        ┌────────────────┐       ┌────────────────┐
│ evidence-     │        │ evidence-      │       │ citation-       │
│ source        │        │ anchor         │       │ work            │
│ ingestion,    │        │ selectors,     │       │ review UI,      │
│ extraction,   │        │ resolving,     │       │ collections,    │
│ recovery      │        │ highlighting   │       │ annotation UX   │
└───────────────┘        └────────────────┘       └────────────────┘
             │                       │                       │
             └───────────────┬───────┴───────────────┬───────┘
                             ▼                       ▼
                    ┌────────────────┐       ┌────────────────┐
                    │ evidence-      │       │ viewer adapters │
                    │ binder         │       │ PDF / HTML / MD │
                    │ field/claim    │       │ and later more  │
                    │ evidence links │       │                │
                    └────────────────┘       └────────────────┘
```

---

## 3. Repository and Subsystem Boundaries

### 3.1 citation-evidence

**Role:** Umbrella product and integration repository.

This repository ties the subsystem implementations together and provides the reference product experience.

Responsibilities:

* Workspace shell.
* Cross-subsystem integration.
* Reference web application.
* Demo scenarios.
* Product documentation.
* System-level tests.
* Example deployments.
* Developer onboarding.

Should contain:

```text
citation-evidence/
  README.md
  INTENT.md
  ARCHITECTURE.md
  PRODUCT_REQUIREMENTS.md
  apps/
    workspace-demo/
  docs/
    concepts/
    decisions/
    examples/
  integration-tests/
  docker-compose.yml
```

Should not contain:

* The low-level anchoring algorithms.
* The complete document ingestion implementation.
* The full domain engine implementation.
* Viewer-specific internals except as integration examples.

---

### 3.2 citation-engine

**Role:** Core domain engine and service layer.

This is the conceptual center of the system. It owns the stable domain model and the API contracts used by the other subsystems.

Responsibilities:

* Core domain model.
* Document, annotation, evidence, and binding APIs.
* Persistence interfaces.
* Citation card rendering contracts.
* Markdown and HTML export logic.
* W3C Web Annotation-compatible mapping.
* Event model.
* Orchestration between source, anchor, work, and binder subsystems.

Key concepts owned:

```text
Document
DocumentRepresentation
Annotation
Selector
EvidenceItem
EvidenceLink
EvidenceSet
CitationCard
CitationRecoveryAttempt
```

Suggested package structure:

```text
citation-engine/
  packages/
    model/
    api-contracts/
    persistence/
    citation-rendering/
    events/
    w3c-mapping/
  docs/
  tests/
```

Primary interfaces:

```ts
type DocumentId = string;
type AnnotationId = string;
type EvidenceItemId = string;
type EvidenceLinkId = string;

interface CitationEngine {
  documents: DocumentService;
  annotations: AnnotationService;
  evidence: EvidenceService;
  bindings: EvidenceBindingService;
  rendering: CitationRenderingService;
}
```

---

### 3.3 evidence-anchor

**Role:** Format-neutral anchoring, selector resolution, and highlighting contract.

This repository is responsible for making citations durable and reopenable.

Responsibilities:

* Selector model.
* Text quote selectors.
* Text position selectors.
* PDF page/rectangle selectors.
* DOM/structural selectors.
* Selector creation from user selections.
* Selector resolution against document representations.
* Fuzzy re-anchoring.
* Highlight rendering contract.
* Orphaned annotation detection.

Key architectural rule:

**No citation should depend on a single visual coordinate system only.**

The subsystem should store redundant selectors where possible:

```text
PDF citation:
  - exact quote
  - prefix/suffix
  - page number
  - normalized page rectangles
  - page-local text offsets
  - global canonical text offsets

HTML/Markdown citation:
  - exact quote
  - prefix/suffix
  - canonical text offsets
  - DOM range or structural path
  - heading/section context
```

Suggested package structure:

```text
evidence-anchor/
  packages/
    selectors/
    resolver/
    fuzzy-match/
    highlight-contract/
    pdf-selectors/
    dom-selectors/
  docs/
  tests/
```

Core interface:

```ts
interface AnchorAdapter {
  createSelectors(selection: SelectionCapture): Promise<Selector[]>;
  resolveSelectors(
    representation: DocumentRepresentation,
    selectors: Selector[]
  ): Promise<AnchorResolution>;
  renderHighlight(
    target: ResolvedAnchorTarget,
    options?: HighlightRenderOptions
  ): Promise<void>;
  scrollToTarget(
    target: ResolvedAnchorTarget,
    options?: ScrollToTargetOptions
  ): Promise<void>;
}
```

Resolution result:

```ts
type AnchorResolution = {
  status: "resolved" | "ambiguous" | "unresolved" | "stale";
  confidence: number;
  candidates: ResolvedAnchorTarget[];
  usedSelectorTypes: string[];
  warnings?: string[];
};
```

---

### 3.4 evidence-source

**Role:** Document ingestion, source metadata, full-text extraction, and citation recovery.

This repository turns raw sources into usable document representations and supports the process of recovering cited passages from external references.

Responsibilities:

* Document import.
* Source URI handling.
* Metadata extraction.
* Fingerprinting.
* Text extraction.
* PDF text extraction pipeline.
* Markdown normalization.
* HTML normalization and sanitization.
* Optional OCR integration later.
* Local source matching.
* External source discovery hooks.
* Citation recovery attempts.

Suggested package structure:

```text
evidence-source/
  packages/
    ingest-core/
    fingerprinting/
    metadata/
    extract-pdf/
    extract-markdown/
    extract-html/
    source-lookup/
    citation-recovery/
  docs/
  tests/
```

Core ingestion pipeline:

```text
Raw Source
  → identify media type
  → compute fingerprint
  → extract metadata
  → extract canonical text
  → build format-specific maps
  → persist Document + DocumentRepresentation
```

PDF representation should include:

```text
page count
page text
global canonical text
page-local offset map
text item map
page dimensions
optional normalized rectangles for selections
```

Markdown/HTML representation should include:

```text
canonical text
DOM or AST structure
heading map
offset-to-node map
source line map where available
sanitized render output
```

Citation recovery pipeline:

```text
Citation clue / quote / reference
  → parse clue
  → search local library
  → search configured external sources if allowed
  → identify candidate documents
  → extract/index candidate text
  → exact quote search
  → fuzzy quote search
  → present candidates
  → user confirms
  → create annotation + evidence item
```

---

### 3.5 citation-work

**Role:** Review workspace and annotation user experience.

This repository provides the user-facing workflows for reviewing document collections and creating evidence from selected passages.

Responsibilities:

* Document collection UI.
* Review queue.
* Document viewer composition.
* Annotation creation UX.
* Evidence sidebar.
* Review state management.
* Tagging and filtering.
* Navigation between evidence items and source context.

Suggested package structure:

```text
citation-work/
  packages/
    review-ui/
    collection-ui/
    evidence-sidebar/
    annotation-toolbar/
    viewer-shell/
    review-state/
  docs/
  tests/
```

Core UI layout:

```text
┌─────────────────────────────────────────────────────────────┐
│ Collection / Review Header                                  │
├───────────────┬───────────────────────────────┬─────────────┤
│ Document List  │ Document Viewer                │ Evidence    │
│ / Filters      │ PDF / HTML / Markdown          │ Sidebar     │
└───────────────┴───────────────────────────────┴─────────────┘
```

Review states:

```text
unreviewed
in-review
marked
relevant
rejected
needs-follow-up
cited
verified
```

Evidence states:

```text
candidate
confirmed
rejected
needs-check
strong-support
weak-support
contradicts
```

---

### 3.6 evidence-binder

**Role:** Binding evidence to structured targets such as form fields, claims, requirements, decisions, or document sections.

This repository provides the graph-like layer between evidence and the things it supports.

Responsibilities:

* Evidence-to-field links.
* Evidence-to-claim links.
* Evidence-to-requirement links.
* Evidence sets.
* Relation types.
* Form synchronization state.
* Active field/evidence/annotation state.
* Visual guide model.
* Evidence completeness indicators.

Suggested package structure:

```text
evidence-binder/
  packages/
    binding-model/
    form-evidence-state/
    evidence-switcher/
    visual-guide-overlay/
    target-adapters/
  docs/
  tests/
```

Core model:

```ts
type EvidenceTargetType =
  | "form-field"
  | "claim"
  | "requirement"
  | "decision"
  | "document-section";

type EvidenceRelation =
  | "supports"
  | "contradicts"
  | "explains"
  | "source-for"
  | "qualifies";

interface EvidenceLink {
  id: string;
  evidenceItemId: string;
  targetType: EvidenceTargetType;
  targetId: string;
  relation: EvidenceRelation;
  confidence?: number;
  status?: "candidate" | "confirmed" | "rejected" | "needs-check";
}
```

Evidence form UI model:

```text
Form Field Activated
  → evidence-binder loads linked EvidenceSet
  → citation-engine resolves active evidence
  → evidence-anchor scrolls document viewer to annotation
  → visual-guide-overlay connects field, evidence card, and highlight
```

Visual guide architecture:

```text
Element Registry
  field target rect
  evidence card rect
  annotation highlight rect

Guide Overlay
  SVG line or curve from field to evidence card
  SVG line or curve from evidence card to annotation
  active state updates on scroll, resize, focus, and evidence switch
```

---

## 4. Core Domain Model

### 4.1 Document

A source object known to the system.

```ts
interface Document {
  id: string;
  title?: string;
  uri?: string;
  mediaType: string;
  fingerprint?: string;
  version?: string;
  createdAt: string;
  updatedAt: string;
  metadata?: Record<string, unknown>;
}
```

### 4.2 DocumentRepresentation

A normalized representation generated from a document source.

```ts
interface DocumentRepresentation {
  id: string;
  documentId: string;
  representationType:
    | "pdf-text"
    | "html-dom"
    | "markdown-rendered"
    | "plain-text"
    | "ocr-text";
  contentHash: string;
  canonicalText?: string;
  pageMap?: PageMap;
  structureMap?: StructureMap;
  offsetMap?: OffsetMap;
  generatedAt: string;
}
```

### 4.3 Selector

A technical locator for a document passage.

```ts
type Selector =
  | TextQuoteSelector
  | TextPositionSelector
  | PdfRectSelector
  | DomRangeSelector
  | StructuralSelector;
```

Recommended selector redundancy:

```text
Always capture:
  - exact quote
  - prefix/suffix context

Capture when available:
  - canonical text offsets
  - PDF page/rectangles
  - DOM range
  - structural path
  - heading context
```

### 4.4 Annotation

A technical mark on a document range.

```ts
interface Annotation {
  id: string;
  documentId: string;
  representationId?: string;
  selectors: Selector[];
  quote?: string;
  note?: string;
  createdBy?: string;
  createdAt: string;
  updatedAt: string;
}
```

### 4.5 EvidenceItem

A meaningful evidence object built from one or more annotations.

```ts
interface EvidenceItem {
  id: string;
  annotationIds: string[];
  title?: string;
  commentary?: string;
  status: "candidate" | "confirmed" | "rejected" | "needs-check";
  confidence?: number;
  tags?: string[];
  createdBy?: string;
  createdAt: string;
  updatedAt: string;
}
```

### 4.6 EvidenceSet

A group of evidence items connected to a target or topic.

```ts
interface EvidenceSet {
  id: string;
  label?: string;
  targetType?: string;
  targetId?: string;
  evidenceItemIds: string[];
  activeEvidenceItemId?: string;
}
```

### 4.7 CitationCard

A presentable rendering of an evidence item.

```ts
interface CitationCard {
  id: string;
  evidenceItemId: string;
  quote: string;
  sourceLabel: string;
  commentary?: string;
  openContextUrl?: string;
  format: "html" | "markdown" | "web-component";
}
```

---

## 5. Viewer Adapter Architecture

The system must not hard-code one viewer implementation into the citation model.

Each document format should be supported through a viewer adapter.

```ts
interface DocumentViewerAdapter {
  mediaTypes: string[];

  load(document: Document, representation?: DocumentRepresentation): Promise<void>;

  getCurrentSelection(): Promise<SelectionCapture | null>;

  createSelectorsFromSelection(
    selection: SelectionCapture
  ): Promise<Selector[]>;

  resolveSelectors(
    selectors: Selector[]
  ): Promise<AnchorResolution>;

  scrollToResolvedTarget(
    target: ResolvedAnchorTarget,
    options?: {
      center?: boolean;
      behavior?: "auto" | "smooth";
    }
  ): Promise<void>;

  renderHighlight(
    target: ResolvedAnchorTarget,
    options?: HighlightRenderOptions
  ): Promise<void>;

  getHighlightClientRects(
    annotationId: string
  ): Promise<DOMRect[]>;
}
```

Initial adapters:

```text
PDFViewerAdapter
  PDF.js / react-pdf-highlighter-plus based

HtmlViewerAdapter
  sanitized HTML, DOM selection, DOM ranges

MarkdownViewerAdapter
  markdown → HTML rendering, DOM selection, optional source-map support
```

Future adapters:

```text
DocxViewerAdapter
EpubViewerAdapter
ImageOcrViewerAdapter
PlainTextViewerAdapter
```

---

## 6. Data Flow: Document Review

```text
User adds document
  → evidence-source imports source
  → document fingerprint is computed
  → document metadata is extracted
  → document representation is generated
  → citation-engine stores Document and DocumentRepresentation
  → citation-work displays document
  → user selects passage
  → viewer adapter captures selection
  → evidence-anchor creates selectors
  → citation-engine creates Annotation
  → user adds commentary
  → citation-engine creates EvidenceItem
  → citation-work shows item in evidence sidebar
```

Result:

```text
Document + Representation + Annotation + EvidenceItem
```

---

## 7. Data Flow: Reopen Citation Context

```text
User clicks citation or evidence item
  → citation-engine loads EvidenceItem
  → citation-engine loads Annotation
  → citation-engine loads Document and Representation
  → viewer adapter opens document if needed
  → evidence-anchor resolves selectors
  → viewer adapter scrolls target into center
  → viewer adapter renders highlight
  → citation-work/evidence-binder shows active state
```

Resolution strategy:

```text
1. Try exact representation/version match.
2. Try position selector.
3. Verify exact quote.
4. Try PDF page/rectangle selector if PDF.
5. Try text quote selector with prefix/suffix.
6. Try fuzzy quote matching.
7. If multiple matches, rank by structural/page context.
8. If unresolved, mark annotation as orphaned.
```

---

## 8. Data Flow: Evidence-Backed Form Field

```text
User focuses form field
  → evidence-binder identifies EvidenceSet for field
  → evidence-binder selects active EvidenceItem
  → citation-engine loads annotation and source context
  → viewer adapter resolves and scrolls to annotation
  → evidence sidebar highlights active evidence item
  → form field shows active evidence state
  → visual guide overlay connects field, evidence, and highlight
```

Evidence switch:

```text
User selects next evidence item
  → activeEvidenceItemId changes
  → annotation is resolved
  → viewer scrolls to new passage
  → guide overlay updates
```

---

## 9. Data Flow: Citation Recovery

```text
User enters citation clue / quote / source reference
  → evidence-source parses clue
  → search local document library
  → rank local candidates
  → if allowed, search configured external sources
  → fetch/load candidate representation where permitted
  → exact quote search
  → fuzzy quote search
  → show candidate passages
  → user confirms passage
  → evidence-anchor creates selectors
  → citation-engine creates Annotation
  → citation-engine creates EvidenceItem
  → optional: evidence-binder links item to target
```

Recovery states:

```text
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
manual-confirmation-needed
annotation-created
```

---

## 10. Persistence Architecture

The architecture should support multiple persistence modes.

### 10.1 Local-First Development Mode

Suitable for early MVPs and personal use.

```text
SQLite / DuckDB / local filesystem
  documents stored as files
  metadata stored in SQLite
  extracted text cached locally
  annotations stored as JSON or relational rows
```

Advantages:

* Simple setup.
* Good for CLI and desktop-like workflows.
* Agent-friendly.
* Easy to version and inspect.

### 10.2 Web Application Mode

Suitable for team or server deployment.

```text
Object storage
  original documents

PostgreSQL
  documents
  representations
  annotations
  evidence items
  evidence links

Search index
  full-text and quote search
```

Recommended baseline:

```text
PostgreSQL
  canonical metadata and relationships

Object storage / filesystem
  document blobs and generated representations

Meilisearch / Typesense / OpenSearch
  full-text document and evidence search
```

### 10.3 Persistence Boundaries

`citation-engine` should define persistence interfaces.

Concrete storage implementations should be replaceable.

```ts
interface AnnotationRepository {
  create(annotation: Annotation): Promise<Annotation>;
  get(id: string): Promise<Annotation | null>;
  listByDocument(documentId: string): Promise<Annotation[]>;
  update(annotation: Annotation): Promise<Annotation>;
}
```

---

## 11. Search and Indexing Architecture

Search is needed for:

* Finding documents.
* Finding evidence items.
* Searching within a document.
* Citation recovery.
* Fuzzy re-anchoring.

Index types:

```text
Document metadata index
  title, author, source URI, document type, collection

Full-text document index
  canonical text, page text, section text

Evidence index
  quote, commentary, tags, target links

Anchor recovery index
  n-grams, quote fragments, prefix/suffix context
```

For the MVP, local full-text search may be enough.

Later, source recovery and large document collections will benefit from a dedicated search service.

---

## 12. UI Architecture

### 12.1 Review Workspace

```text
┌─────────────────────────────────────────────────────────────┐
│ Workspace Header                                             │
├───────────────┬───────────────────────────────┬─────────────┤
│ Collection     │ Document Viewer                │ Evidence    │
│ Navigation     │                               │ Sidebar     │
└───────────────┴───────────────────────────────┴─────────────┘
```

Primary interactions:

* Select text.
* Create annotation.
* Add commentary.
* Tag evidence.
* Click evidence to reopen context.
* Filter by status/tag/document.

### 12.2 Evidence Form Workspace

```text
┌───────────────────────────────┬─────────────────────────────┐
│ Structured Form                │ Document Viewer              │
│                               │                             │
│ Field A                        │ Active citation highlight    │
│  evidence chips                │                             │
│ Field B                        │                             │
│  evidence chips                │                             │
├───────────────────────────────┴─────────────────────────────┤
│ Optional Evidence Tray / Active Citation Details              │
└───────────────────────────────────────────────────────────────┘
```

Visual guide overlay:

```text
field element → evidence chip/card → document highlight
```

The overlay should be independent from both the form renderer and document viewer.

### 12.3 Citation Recovery Workspace

```text
┌─────────────────────────────────────────────────────────────┐
│ Citation / Quote Input                                      │
├──────────────────────┬──────────────────────────────────────┤
│ Candidate Sources     │ Candidate Passages                   │
├──────────────────────┴──────────────────────────────────────┤
│ Confirm / Create Annotation                                  │
└─────────────────────────────────────────────────────────────┘
```

---

## 13. Event Model

Subsystems should communicate through explicit domain events where useful.

Examples:

```text
DocumentImported
DocumentRepresentationGenerated
AnnotationCreated
AnnotationResolved
AnnotationResolutionFailed
EvidenceItemCreated
EvidenceItemLinked
EvidenceItemActivated
FormFieldActivated
CitationCardRendered
CitationRecoveryStarted
CitationRecoveryCandidateFound
CitationRecoveryConfirmed
```

Example event:

```ts
interface EvidenceItemActivatedEvent {
  type: "EvidenceItemActivated";
  evidenceItemId: string;
  source?: "sidebar" | "form-field" | "citation-card";
  targetContext?: {
    type: "form-field" | "claim" | "requirement";
    id: string;
  };
}
```

Events should be useful both for UI synchronization and later automation/agent workflows.

---

## 14. External Standards and Compatibility

The architecture should align with existing standards where practical.

### 14.1 W3C Web Annotation

Use W3C Web Annotation concepts for:

* Annotation.
* Body.
* Target.
* Selector.
* TextQuoteSelector.
* TextPositionSelector.

Recommended approach:

```text
Internal model:
  optimized for citation-evidence workflows

Import/export mapping:
  W3C Web Annotation-compatible JSON where practical
```

This avoids forcing JSON-LD complexity into every internal operation while preserving standards compatibility.

### 14.2 Web Components

Citation presentation should be embeddable through web components where possible:

```html
<citation-card evidence-item-id="ev_123"></citation-card>
<citation-context-link annotation-id="ann_456"></citation-context-link>
<citation-document-viewer document-id="doc_123"></citation-document-viewer>
```

### 14.3 URL Deep Links

The system should provide stable internal URLs such as:

```text
/viewer?document=doc_123&annotation=ann_456
/workspace/collections/col_123/documents/doc_123?evidence=ev_456
```

For public HTML documents, optional browser text fragments may be generated as export aids, but should not be the only internal anchoring mechanism.

---

## 15. Security Architecture

Security principles:

1. Treat imported documents as untrusted input.
2. Sanitize imported HTML.
3. Avoid executing document scripts.
4. Isolate document rendering where needed.
5. Do not send private document text to external services without explicit user permission.
6. Make external source lookup configurable.
7. Preserve access control boundaries around collections and documents.

Important areas:

```text
HTML sanitization
PDF processing safety
external URL fetching
object storage access
annotation visibility
collection permissions
agent/tool permissions
```

For MVP, single-user/local security is sufficient, but the model should not block later multi-user permissions.

---

## 16. Suggested Initial Technical Stack

### 16.1 Frontend

```text
TypeScript
React for first application shell
PDF.js or react-pdf-highlighter-plus for PDF MVP
unified / remark / rehype for Markdown rendering
DOMPurify for HTML sanitization
SVG overlay for visual guides
CSS Custom Highlight API with fallback for HTML/Markdown highlighting
```

### 16.2 Backend / Local Service

```text
Node.js or Python service for initial ingestion
PostgreSQL for server mode
SQLite for local-first mode
Filesystem or object storage for document blobs
Meilisearch or Typesense for search if needed early
```

### 16.3 Document Processing

```text
PDF.js text extraction for browser-side PDF workflows
Apache Tika or similar for broader server-side extraction later
Tesseract OCR for scanned documents later
```

### 16.4 Packaging Direction

Prefer TypeScript-first packages for the core web-facing model and UI integration.

A later backend may be polyglot, but the browser-facing contracts should remain TypeScript-native.

---

## 17. MVP Implementation Plan

### Phase 1: Core Model and PDF Review

Deliverables:

* Basic `Document`, `Annotation`, `EvidenceItem` model.
* PDF viewer integration.
* Text selection capture.
* Highlight creation.
* Commentary entry.
* Evidence sidebar.
* Click evidence to reopen context.
* Markdown/HTML citation card export.

Subsystems involved:

```text
citation-engine
citation-work
evidence-anchor
evidence-source
citation-evidence
```

### Phase 2: Evidence Binding and Form Mode

Deliverables:

* Simple form definition model.
* Evidence links to form fields.
* Evidence chips on fields.
* Activate field to focus evidence.
* Evidence switcher.
* Active state synchronization.
* Initial SVG visual guide overlay.

Subsystems involved:

```text
evidence-binder
citation-engine
citation-work
evidence-anchor
citation-evidence
```

### Phase 3: Markdown and HTML Documents

Deliverables:

* Markdown rendering adapter.
* HTML rendering adapter.
* DOM text selection capture.
* Text quote and text position selectors.
* Highlighting in non-paginated documents.
* Reuse evidence sidebar and binding workflows.

Subsystems involved:

```text
evidence-source
evidence-anchor
citation-work
citation-engine
```

### Phase 4: Local Citation Recovery

Deliverables:

* Recovery input UI.
* Local document search.
* Exact quote match.
* Fuzzy quote match.
* Candidate passage confirmation.
* Create annotation from confirmed match.

Subsystems involved:

```text
evidence-source
evidence-anchor
citation-engine
citation-work
```

---

## 18. Architectural Decisions to Make Early

### ADR-001: Internal model vs. native W3C Web Annotation

Recommendation:

Use an internal model optimized for citation-evidence, with W3C-compatible import/export mapping.

Reason:

The product needs evidence binding, form synchronization, recovery states, and citation cards, which go beyond the basic web annotation model.

### ADR-002: React-first vs. Web-component-first

Recommendation:

Build the first application in React, but keep core model and viewer adapter contracts framework-neutral. Add web components for citation cards and context links early.

Reason:

React accelerates MVP UI development, while framework-neutral contracts protect reuse.

### ADR-003: Local-first vs. server-first storage

Recommendation:

Design persistence interfaces from the beginning. Implement local-first storage first if the target is personal/agentic workflows; implement PostgreSQL-backed storage when collaboration or server deployment becomes necessary.

### ADR-004: PDF.js direct vs. react-pdf-highlighter-plus

Recommendation:

Use react-pdf-highlighter-plus for initial speed if it satisfies selector and rendering needs. Keep an abstraction boundary so the PDF viewer can be replaced with direct PDF.js integration later.

### ADR-005: Citation recovery scope

Recommendation:

Start with local document library recovery. Add external source lookup only after the local anchoring and quote matching pipeline is reliable.

---

## 19. Risks and Mitigations

| Risk                                                 | Impact | Mitigation                                                      |
| ---------------------------------------------------- | -----: | --------------------------------------------------------------- |
| PDF text extraction is inconsistent across documents |   High | Store both visual and text selectors; support manual correction |
| Highlight coordinates break with zoom/layout         |   High | Use normalized coordinates and viewer-independent selectors     |
| Imported HTML executes unsafe content                |   High | Sanitize and sandbox HTML rendering                             |
| Citation recovery finds wrong passage                | Medium | Require user confirmation for fuzzy or external matches         |
| Too many repos create coordination overhead          | Medium | Keep domain model in citation-engine and define clear contracts |
| Viewer library constraints leak into domain model    |   High | Enforce adapter boundary and selector abstraction               |
| Form binding becomes too domain-specific             | Medium | Model generic EvidenceTargets and target adapters               |
| Search/indexing becomes heavy too early              | Medium | Begin local/simple; add dedicated search service later          |

---

## 20. First Reference Scenario

The first end-to-end reference scenario should be:

```text
1. User creates a collection named “Application Evidence”.
2. User uploads a PDF.
3. User selects a passage and adds commentary.
4. System creates an annotation and evidence item.
5. User opens a form next to the PDF.
6. User links the evidence item to a form field.
7. User focuses the field.
8. System highlights the field, evidence item, and source passage.
9. System draws a guide from field to evidence to source passage.
10. User exports the evidence as a Markdown citation card.
```

This scenario exercises the essential product value without requiring external source lookup or advanced collaboration.

---

## 21. Summary

The architecture of **citation-evidence** should be organized around reusable evidence objects, not only document annotations.

The core design is:

```text
Source Document
  → Document Representation
  → Durable Annotation Anchor
  → Evidence Item with Commentary
  → Evidence Link to Field / Claim / Requirement
  → Portable Citation Card
  → Reopenable Source Context
```

The subsystem repositories provide a clean separation of responsibilities:

```text
citation-engine   owns the domain and APIs
evidence-anchor   owns selector creation, resolution, and highlighting
evidence-source   owns ingestion, extraction, and recovery
citation-work     owns review workflows
evidence-binder   owns evidence-to-target binding
citation-evidence owns the integrated product shell
```

This gives the project a practical MVP path while preserving enough architectural clarity to grow into a reusable infrastructure layer for evidence-backed information work.