evidence-source/INTENT.md

# INTENT

## Purpose

This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.

**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.

It is responsible for answering the source-side questions:

> What is this document?
> How can we extract usable text and structure from it?
> How can we find or recover a cited source passage?

---

## Primary Utility

The repository provides the source pipeline for citation-evidence.

It should make it possible to:

- import documents into a collection or workspace,
- identify document type and media type,
- compute stable document fingerprints,
- extract document metadata,
- extract canonical text,
- create document representations for PDFs, Markdown, HTML, and later other formats,
- build maps between text, pages, sections, and rendered views,
- support local full-text search,
- support source lookup and citation recovery,
- provide the document representations needed by **evidence-anchor** and **citation-work**.

This repository turns documents into evidence-ready sources.

---

## Intended Users

Primary users of this repository are developers and agents implementing source handling for citation-evidence.

They include:

- developers building document import workflows,
- developers building review collections,
- developers implementing PDF, Markdown, and HTML source handling,
- developers implementing citation recovery,
- developers integrating local or external source libraries,
- coding agents that need structured access to document text and metadata.

End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.

---

## Strategic Role

The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.

Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.

**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.

It enables the flow:

```text
Raw Source
  → Document Identity
  → Metadata
  → Canonical Text
  → Document Representation
  → Searchable Source
  → Anchorable Evidence Context
````

---

## Core Concept

The core concept of this repository is the **document representation**.

A document representation is a normalized, searchable, addressable view of a source document.

For a PDF, a representation may include:

```text
document fingerprint
metadata
page count
page text
global canonical text
page-local offset map
text item map
page dimensions
source-to-rendering hints
```

For Markdown or HTML, a representation may include:

```text
canonical text
rendered HTML
sanitized content
heading map
section map
DOM or AST structure
offset-to-node map
source line map where available
```

These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.

---

## Scope

This repository should own:

* document import workflows,
* document source identification,
* media type detection,
* document fingerprinting,
* source URI handling,
* metadata extraction,
* canonical text extraction,
* PDF text extraction,
* Markdown normalization,
* HTML normalization and sanitization,
* document representation generation,
* representation caching,
* local source search support,
* quote search support,
* citation clue parsing,
* local citation recovery,
* external source discovery hooks,
* recovery state tracking,
* privacy boundaries for source lookup.

It should provide the source-side capabilities consumed by:

* **citation-engine** for creating `Document` and `DocumentRepresentation` records,
* **evidence-anchor** for selector creation and resolution,
* **citation-work** for document review workflows,
* **evidence-binder** when evidence needs source context,
* **citation-evidence** for the integrated product experience.

---

## Out of Scope

This repository should not own the broader evidence domain or user workflows.

Specifically, it should not own:

* the canonical evidence domain model,
* persistence policy beyond source and representation storage contracts,
* low-level anchor resolution algorithms,
* visual highlight rendering,
* review workspace UI,
* form-field binding semantics,
* visual guide overlay behavior,
* citation card rendering,
* application shell and deployment,
* final human validation of evidence quality.

Those responsibilities belong to the appropriate citation-evidence subsystem repositories.

---

## Architectural Position

```text
citation-evidence
  integrated product shell

citation-engine
  core domain model, services, persistence contracts

evidence-source
  document ingestion, extraction, metadata, representations, citation recovery

evidence-anchor
  selectors, anchor resolution, re-anchoring, highlighting contracts

citation-work
  review workspace and annotation UX

evidence-binder
  evidence-to-target binding and active evidence state
```

**evidence-source** should provide document representations, not define what evidence means.

It should feed reliable source material into the rest of the system.

---

## Primary Workflows

### 1. Import Document

A user or system adds a source document.

```text
Add Source
  → Identify Media Type
  → Compute Fingerprint
  → Extract Metadata
  → Extract Text
  → Build Representation
  → Register Document
```

### 2. Generate PDF Representation

A PDF is converted into a representation suitable for review and anchoring.

```text
PDF Source
  → Load PDF
  → Extract Page Text
  → Normalize Text
  → Build Page Map
  → Build Offset Map
  → Store Representation
```

### 3. Generate Markdown / HTML Representation

A Markdown or HTML source is converted into a normalized rendered and searchable representation.

```text
Markdown / HTML Source
  → Parse / Sanitize
  → Render if needed
  → Extract Canonical Text
  → Build Heading / Section Map
  → Build Offset Map
  → Store Representation
```

### 4. Search Local Sources

A user or subsystem searches available source material.

```text
Search Query / Quote
  → Search Metadata
  → Search Full Text
  → Return Candidate Documents / Passages
```

### 5. Recover Citation

A user provides a citation, quote, or source clue.

```text
Citation Clue
  → Parse Source Metadata
  → Search Local Library
  → Optionally Search Configured External Sources
  → Load Candidate Source
  → Search Exact Quote
  → Search Fuzzy Quote
  → Present Candidate Passages
  → User Confirms
  → Create Source Context for Annotation
```

---

## Initial Source Types

The first version should support or prepare for:

```text
PDF
Markdown
HTML
plain text
remote URL references
```

Later versions may support:

```text
DOCX
EPUB
scanned image documents
OCR-derived text
IIIF resources
TEI XML
structured datasets with source passages
```

---

## Citation Recovery States

Citation recovery should be modeled explicitly.

Initial recovery states may include:

```text
created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed
```

The system should distinguish between finding a source and finding the exact cited passage.

---

## Privacy and Source Lookup Principles

Source lookup can create privacy risks.

The repository should follow these principles:

* search local sources first,
* make external lookup explicit and configurable,
* avoid sending private document text to external services by default,
* record which external services were queried,
* distinguish public metadata lookup from full-text upload,
* allow deployments to disable external lookup completely,
* prefer deterministic local processing where possible.

External source discovery should be an extension point, not an unavoidable default behavior.

---

## Design Principles

### Source Identity First

Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.

### Canonical Text Matters

Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.

### Representation Is Not Source

The original source and generated representation are different things.

The system should preserve this distinction.

### Local Before External

Citation recovery should search local documents before looking elsewhere.

### Human Confirmation

Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.

### Format-Aware, Model-Neutral

The repository should understand document formats but should not own the broader evidence model.

### Cache Expensive Work

Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.

### Agent-Friendly Output

Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.

---

## Expected Dependencies

This repository is expected to depend on shared types and service contracts from:

```text
citation-engine
  Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
```

It may be consumed by:

```text
citation-work
  to load reviewable documents and document representations

evidence-anchor
  to resolve selectors against extracted representations

evidence-binder
  to retrieve source context for linked evidence

citation-evidence
  to provide integrated import and recovery workflows
```

It should avoid depending on review UI or form-binding implementation details.

---

## First Useful Version

A first useful version of **evidence-source** should provide:

* source import interface,
* media type detection,
* document fingerprinting,
* basic metadata extraction,
* PDF text extraction,
* Markdown text extraction,
* HTML sanitization and text extraction,
* canonical text normalization,
* document representation generation,
* simple local quote search,
* recovery attempt model or contract,
* examples showing how a document becomes a representation usable by **evidence-anchor**.

The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.

---

## Success Criteria

The repository is successful when another subsystem can use it to:

1. import a source document,
2. identify and fingerprint it,
3. extract useful metadata,
4. generate canonical text,
5. generate a document representation,
6. search the source text,
7. provide representation data to **evidence-anchor**,
8. support a local citation recovery attempt from a quote or citation clue.

A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.

---

## Repository Character

This repository should be:

* source-focused,
* ingestion-oriented,
* privacy-conscious,
* format-aware,
* representation-centered,
* cache-friendly,
* suitable for local-first and server-side use,
* explicit about uncertainty in citation recovery,
* careful not to absorb review or binding responsibilities.

---

## MVP Coordination — Code Lives Upstream

During the umbrella-first MVP phase (decided 2026-05-24), **the source code
for this subsystem does not live in this repository yet**. It lives in the
umbrella repo at `citation-evidence/src/source/`.

This INTENT.md documents the *intended* responsibilities and boundaries.
When the ingestion and representation interfaces have stabilized through
actual MVP use, the corresponding code extracts into this repository.

**Shared contracts** (Document and DocumentRepresentation shapes,
CitationRecoveryAttempt state enum, canonical text normalization, allowed
dependency edges) are maintained in the umbrella repo:

* `citation-evidence/wiki/SharedContracts.md`
* `citation-evidence/wiki/DependencyMap.md`
* `citation-evidence/docs/decisions/` (ADRs)

This subsystem's eventual code must not contradict those documents. Changes
to shared contracts happen in the umbrella, not here.

Under the dependency map, **`evidence-source` may depend only on
`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
"could a selector resolve here?", the answer travels through events, not
direct calls.

---

## Guiding Statement

**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**