Documents the umbrella-first MVP decision (2026-05-24). This repo remains INTENT-only until the ingestion and representation interfaces stabilize through real product use. Reaffirms: source depends only on engine, not on anchor — coordination between them flows through events. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
13 KiB
INTENT
Purpose
This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the citation-evidence ecosystem.
evidence-source turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
It is responsible for answering the source-side questions:
What is this document?
How can we extract usable text and structure from it?
How can we find or recover a cited source passage?
Primary Utility
The repository provides the source pipeline for citation-evidence.
It should make it possible to:
- import documents into a collection or workspace,
- identify document type and media type,
- compute stable document fingerprints,
- extract document metadata,
- extract canonical text,
- create document representations for PDFs, Markdown, HTML, and later other formats,
- build maps between text, pages, sections, and rendered views,
- support local full-text search,
- support source lookup and citation recovery,
- provide the document representations needed by evidence-anchor and citation-work.
This repository turns documents into evidence-ready sources.
Intended Users
Primary users of this repository are developers and agents implementing source handling for citation-evidence.
They include:
- developers building document import workflows,
- developers building review collections,
- developers implementing PDF, Markdown, and HTML source handling,
- developers implementing citation recovery,
- developers integrating local or external source libraries,
- coding agents that need structured access to document text and metadata.
End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
Strategic Role
The strategic role of evidence-source is to make source documents usable as reliable evidence substrates.
Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
evidence-source creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
It enables the flow:
Raw Source
→ Document Identity
→ Metadata
→ Canonical Text
→ Document Representation
→ Searchable Source
→ Anchorable Evidence Context
Core Concept
The core concept of this repository is the document representation.
A document representation is a normalized, searchable, addressable view of a source document.
For a PDF, a representation may include:
document fingerprint
metadata
page count
page text
global canonical text
page-local offset map
text item map
page dimensions
source-to-rendering hints
For Markdown or HTML, a representation may include:
canonical text
rendered HTML
sanitized content
heading map
section map
DOM or AST structure
offset-to-node map
source line map where available
These representations allow evidence-anchor to create and resolve selectors and allow citation-work to display and search documents efficiently.
Scope
This repository should own:
- document import workflows,
- document source identification,
- media type detection,
- document fingerprinting,
- source URI handling,
- metadata extraction,
- canonical text extraction,
- PDF text extraction,
- Markdown normalization,
- HTML normalization and sanitization,
- document representation generation,
- representation caching,
- local source search support,
- quote search support,
- citation clue parsing,
- local citation recovery,
- external source discovery hooks,
- recovery state tracking,
- privacy boundaries for source lookup.
It should provide the source-side capabilities consumed by:
- citation-engine for creating
DocumentandDocumentRepresentationrecords, - evidence-anchor for selector creation and resolution,
- citation-work for document review workflows,
- evidence-binder when evidence needs source context,
- citation-evidence for the integrated product experience.
Out of Scope
This repository should not own the broader evidence domain or user workflows.
Specifically, it should not own:
- the canonical evidence domain model,
- persistence policy beyond source and representation storage contracts,
- low-level anchor resolution algorithms,
- visual highlight rendering,
- review workspace UI,
- form-field binding semantics,
- visual guide overlay behavior,
- citation card rendering,
- application shell and deployment,
- final human validation of evidence quality.
Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
Architectural Position
citation-evidence
integrated product shell
citation-engine
core domain model, services, persistence contracts
evidence-source
document ingestion, extraction, metadata, representations, citation recovery
evidence-anchor
selectors, anchor resolution, re-anchoring, highlighting contracts
citation-work
review workspace and annotation UX
evidence-binder
evidence-to-target binding and active evidence state
evidence-source should provide document representations, not define what evidence means.
It should feed reliable source material into the rest of the system.
Primary Workflows
1. Import Document
A user or system adds a source document.
Add Source
→ Identify Media Type
→ Compute Fingerprint
→ Extract Metadata
→ Extract Text
→ Build Representation
→ Register Document
2. Generate PDF Representation
A PDF is converted into a representation suitable for review and anchoring.
PDF Source
→ Load PDF
→ Extract Page Text
→ Normalize Text
→ Build Page Map
→ Build Offset Map
→ Store Representation
3. Generate Markdown / HTML Representation
A Markdown or HTML source is converted into a normalized rendered and searchable representation.
Markdown / HTML Source
→ Parse / Sanitize
→ Render if needed
→ Extract Canonical Text
→ Build Heading / Section Map
→ Build Offset Map
→ Store Representation
4. Search Local Sources
A user or subsystem searches available source material.
Search Query / Quote
→ Search Metadata
→ Search Full Text
→ Return Candidate Documents / Passages
5. Recover Citation
A user provides a citation, quote, or source clue.
Citation Clue
→ Parse Source Metadata
→ Search Local Library
→ Optionally Search Configured External Sources
→ Load Candidate Source
→ Search Exact Quote
→ Search Fuzzy Quote
→ Present Candidate Passages
→ User Confirms
→ Create Source Context for Annotation
Initial Source Types
The first version should support or prepare for:
PDF
Markdown
HTML
plain text
remote URL references
Later versions may support:
DOCX
EPUB
scanned image documents
OCR-derived text
IIIF resources
TEI XML
structured datasets with source passages
Citation Recovery States
Citation recovery should be modeled explicitly.
Initial recovery states may include:
created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed
The system should distinguish between finding a source and finding the exact cited passage.
Privacy and Source Lookup Principles
Source lookup can create privacy risks.
The repository should follow these principles:
- search local sources first,
- make external lookup explicit and configurable,
- avoid sending private document text to external services by default,
- record which external services were queried,
- distinguish public metadata lookup from full-text upload,
- allow deployments to disable external lookup completely,
- prefer deterministic local processing where possible.
External source discovery should be an extension point, not an unavoidable default behavior.
Design Principles
Source Identity First
Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
Canonical Text Matters
Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
Representation Is Not Source
The original source and generated representation are different things.
The system should preserve this distinction.
Local Before External
Citation recovery should search local documents before looking elsewhere.
Human Confirmation
Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
Format-Aware, Model-Neutral
The repository should understand document formats but should not own the broader evidence model.
Cache Expensive Work
Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
Agent-Friendly Output
Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
Expected Dependencies
This repository is expected to depend on shared types and service contracts from:
citation-engine
Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
It may be consumed by:
citation-work
to load reviewable documents and document representations
evidence-anchor
to resolve selectors against extracted representations
evidence-binder
to retrieve source context for linked evidence
citation-evidence
to provide integrated import and recovery workflows
It should avoid depending on review UI or form-binding implementation details.
First Useful Version
A first useful version of evidence-source should provide:
- source import interface,
- media type detection,
- document fingerprinting,
- basic metadata extraction,
- PDF text extraction,
- Markdown text extraction,
- HTML sanitization and text extraction,
- canonical text normalization,
- document representation generation,
- simple local quote search,
- recovery attempt model or contract,
- examples showing how a document becomes a representation usable by evidence-anchor.
The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
Success Criteria
The repository is successful when another subsystem can use it to:
- import a source document,
- identify and fingerprint it,
- extract useful metadata,
- generate canonical text,
- generate a document representation,
- search the source text,
- provide representation data to evidence-anchor,
- support a local citation recovery attempt from a quote or citation clue.
A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
Repository Character
This repository should be:
- source-focused,
- ingestion-oriented,
- privacy-conscious,
- format-aware,
- representation-centered,
- cache-friendly,
- suitable for local-first and server-side use,
- explicit about uncertainty in citation recovery,
- careful not to absorb review or binding responsibilities.
MVP Coordination — Code Lives Upstream
During the umbrella-first MVP phase (decided 2026-05-24), the source code
for this subsystem does not live in this repository yet. It lives in the
umbrella repo at citation-evidence/src/source/.
This INTENT.md documents the intended responsibilities and boundaries. When the ingestion and representation interfaces have stabilized through actual MVP use, the corresponding code extracts into this repository.
Shared contracts (Document and DocumentRepresentation shapes, CitationRecoveryAttempt state enum, canonical text normalization, allowed dependency edges) are maintained in the umbrella repo:
citation-evidence/wiki/SharedContracts.mdcitation-evidence/wiki/DependencyMap.mdcitation-evidence/docs/decisions/(ADRs)
This subsystem's eventual code must not contradict those documents. Changes to shared contracts happen in the umbrella, not here.
Under the dependency map, evidence-source may depend only on
citation-engine — not on evidence-anchor. When ingestion needs to know
"could a selector resolve here?", the answer travels through events, not
direct calls.
Guiding Statement
evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.