Files
evidence-source/INTENT.md
tegwick d8a08d6032 Add MVP Coordination section: code lives in citation-evidence umbrella during MVP
Documents the umbrella-first MVP decision (2026-05-24). This repo remains
INTENT-only until the ingestion and representation interfaces stabilize
through real product use. Reaffirms: source depends only on engine, not on
anchor — coordination between them flows through events.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 16:51:06 +02:00

493 lines
13 KiB
Markdown

# INTENT
## Purpose
This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.
**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
It is responsible for answering the source-side questions:
> What is this document?
> How can we extract usable text and structure from it?
> How can we find or recover a cited source passage?
---
## Primary Utility
The repository provides the source pipeline for citation-evidence.
It should make it possible to:
- import documents into a collection or workspace,
- identify document type and media type,
- compute stable document fingerprints,
- extract document metadata,
- extract canonical text,
- create document representations for PDFs, Markdown, HTML, and later other formats,
- build maps between text, pages, sections, and rendered views,
- support local full-text search,
- support source lookup and citation recovery,
- provide the document representations needed by **evidence-anchor** and **citation-work**.
This repository turns documents into evidence-ready sources.
---
## Intended Users
Primary users of this repository are developers and agents implementing source handling for citation-evidence.
They include:
- developers building document import workflows,
- developers building review collections,
- developers implementing PDF, Markdown, and HTML source handling,
- developers implementing citation recovery,
- developers integrating local or external source libraries,
- coding agents that need structured access to document text and metadata.
End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
---
## Strategic Role
The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.
Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
It enables the flow:
```text
Raw Source
→ Document Identity
→ Metadata
→ Canonical Text
→ Document Representation
→ Searchable Source
→ Anchorable Evidence Context
````
---
## Core Concept
The core concept of this repository is the **document representation**.
A document representation is a normalized, searchable, addressable view of a source document.
For a PDF, a representation may include:
```text
document fingerprint
metadata
page count
page text
global canonical text
page-local offset map
text item map
page dimensions
source-to-rendering hints
```
For Markdown or HTML, a representation may include:
```text
canonical text
rendered HTML
sanitized content
heading map
section map
DOM or AST structure
offset-to-node map
source line map where available
```
These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.
---
## Scope
This repository should own:
* document import workflows,
* document source identification,
* media type detection,
* document fingerprinting,
* source URI handling,
* metadata extraction,
* canonical text extraction,
* PDF text extraction,
* Markdown normalization,
* HTML normalization and sanitization,
* document representation generation,
* representation caching,
* local source search support,
* quote search support,
* citation clue parsing,
* local citation recovery,
* external source discovery hooks,
* recovery state tracking,
* privacy boundaries for source lookup.
It should provide the source-side capabilities consumed by:
* **citation-engine** for creating `Document` and `DocumentRepresentation` records,
* **evidence-anchor** for selector creation and resolution,
* **citation-work** for document review workflows,
* **evidence-binder** when evidence needs source context,
* **citation-evidence** for the integrated product experience.
---
## Out of Scope
This repository should not own the broader evidence domain or user workflows.
Specifically, it should not own:
* the canonical evidence domain model,
* persistence policy beyond source and representation storage contracts,
* low-level anchor resolution algorithms,
* visual highlight rendering,
* review workspace UI,
* form-field binding semantics,
* visual guide overlay behavior,
* citation card rendering,
* application shell and deployment,
* final human validation of evidence quality.
Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
---
## Architectural Position
```text
citation-evidence
integrated product shell
citation-engine
core domain model, services, persistence contracts
evidence-source
document ingestion, extraction, metadata, representations, citation recovery
evidence-anchor
selectors, anchor resolution, re-anchoring, highlighting contracts
citation-work
review workspace and annotation UX
evidence-binder
evidence-to-target binding and active evidence state
```
**evidence-source** should provide document representations, not define what evidence means.
It should feed reliable source material into the rest of the system.
---
## Primary Workflows
### 1. Import Document
A user or system adds a source document.
```text
Add Source
→ Identify Media Type
→ Compute Fingerprint
→ Extract Metadata
→ Extract Text
→ Build Representation
→ Register Document
```
### 2. Generate PDF Representation
A PDF is converted into a representation suitable for review and anchoring.
```text
PDF Source
→ Load PDF
→ Extract Page Text
→ Normalize Text
→ Build Page Map
→ Build Offset Map
→ Store Representation
```
### 3. Generate Markdown / HTML Representation
A Markdown or HTML source is converted into a normalized rendered and searchable representation.
```text
Markdown / HTML Source
→ Parse / Sanitize
→ Render if needed
→ Extract Canonical Text
→ Build Heading / Section Map
→ Build Offset Map
→ Store Representation
```
### 4. Search Local Sources
A user or subsystem searches available source material.
```text
Search Query / Quote
→ Search Metadata
→ Search Full Text
→ Return Candidate Documents / Passages
```
### 5. Recover Citation
A user provides a citation, quote, or source clue.
```text
Citation Clue
→ Parse Source Metadata
→ Search Local Library
→ Optionally Search Configured External Sources
→ Load Candidate Source
→ Search Exact Quote
→ Search Fuzzy Quote
→ Present Candidate Passages
→ User Confirms
→ Create Source Context for Annotation
```
---
## Initial Source Types
The first version should support or prepare for:
```text
PDF
Markdown
HTML
plain text
remote URL references
```
Later versions may support:
```text
DOCX
EPUB
scanned image documents
OCR-derived text
IIIF resources
TEI XML
structured datasets with source passages
```
---
## Citation Recovery States
Citation recovery should be modeled explicitly.
Initial recovery states may include:
```text
created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed
```
The system should distinguish between finding a source and finding the exact cited passage.
---
## Privacy and Source Lookup Principles
Source lookup can create privacy risks.
The repository should follow these principles:
* search local sources first,
* make external lookup explicit and configurable,
* avoid sending private document text to external services by default,
* record which external services were queried,
* distinguish public metadata lookup from full-text upload,
* allow deployments to disable external lookup completely,
* prefer deterministic local processing where possible.
External source discovery should be an extension point, not an unavoidable default behavior.
---
## Design Principles
### Source Identity First
Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
### Canonical Text Matters
Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
### Representation Is Not Source
The original source and generated representation are different things.
The system should preserve this distinction.
### Local Before External
Citation recovery should search local documents before looking elsewhere.
### Human Confirmation
Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
### Format-Aware, Model-Neutral
The repository should understand document formats but should not own the broader evidence model.
### Cache Expensive Work
Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
### Agent-Friendly Output
Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
---
## Expected Dependencies
This repository is expected to depend on shared types and service contracts from:
```text
citation-engine
Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
```
It may be consumed by:
```text
citation-work
to load reviewable documents and document representations
evidence-anchor
to resolve selectors against extracted representations
evidence-binder
to retrieve source context for linked evidence
citation-evidence
to provide integrated import and recovery workflows
```
It should avoid depending on review UI or form-binding implementation details.
---
## First Useful Version
A first useful version of **evidence-source** should provide:
* source import interface,
* media type detection,
* document fingerprinting,
* basic metadata extraction,
* PDF text extraction,
* Markdown text extraction,
* HTML sanitization and text extraction,
* canonical text normalization,
* document representation generation,
* simple local quote search,
* recovery attempt model or contract,
* examples showing how a document becomes a representation usable by **evidence-anchor**.
The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
---
## Success Criteria
The repository is successful when another subsystem can use it to:
1. import a source document,
2. identify and fingerprint it,
3. extract useful metadata,
4. generate canonical text,
5. generate a document representation,
6. search the source text,
7. provide representation data to **evidence-anchor**,
8. support a local citation recovery attempt from a quote or citation clue.
A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
---
## Repository Character
This repository should be:
* source-focused,
* ingestion-oriented,
* privacy-conscious,
* format-aware,
* representation-centered,
* cache-friendly,
* suitable for local-first and server-side use,
* explicit about uncertainty in citation recovery,
* careful not to absorb review or binding responsibilities.
---
## MVP Coordination — Code Lives Upstream
During the umbrella-first MVP phase (decided 2026-05-24), **the source code
for this subsystem does not live in this repository yet**. It lives in the
umbrella repo at `citation-evidence/src/source/`.
This INTENT.md documents the *intended* responsibilities and boundaries.
When the ingestion and representation interfaces have stabilized through
actual MVP use, the corresponding code extracts into this repository.
**Shared contracts** (Document and DocumentRepresentation shapes,
CitationRecoveryAttempt state enum, canonical text normalization, allowed
dependency edges) are maintained in the umbrella repo:
* `citation-evidence/wiki/SharedContracts.md`
* `citation-evidence/wiki/DependencyMap.md`
* `citation-evidence/docs/decisions/` (ADRs)
This subsystem's eventual code must not contradict those documents. Changes
to shared contracts happen in the umbrella, not here.
Under the dependency map, **`evidence-source` may depend only on
`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
"could a selector resolve here?", the answer travels through events, not
direct calls.
---
## Guiding Statement
**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**