# INTENT ## Purpose This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem. **evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows. It is responsible for answering the source-side questions: > What is this document? > How can we extract usable text and structure from it? > How can we find or recover a cited source passage? --- ## Primary Utility The repository provides the source pipeline for citation-evidence. It should make it possible to: - import documents into a collection or workspace, - identify document type and media type, - compute stable document fingerprints, - extract document metadata, - extract canonical text, - create document representations for PDFs, Markdown, HTML, and later other formats, - build maps between text, pages, sections, and rendered views, - support local full-text search, - support source lookup and citation recovery, - provide the document representations needed by **evidence-anchor** and **citation-work**. This repository turns documents into evidence-ready sources. --- ## Intended Users Primary users of this repository are developers and agents implementing source handling for citation-evidence. They include: - developers building document import workflows, - developers building review collections, - developers implementing PDF, Markdown, and HTML source handling, - developers implementing citation recovery, - developers integrating local or external source libraries, - coding agents that need structured access to document text and metadata. End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation. --- ## Strategic Role The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates. Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile. **evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats. It enables the flow: ```text Raw Source → Document Identity → Metadata → Canonical Text → Document Representation → Searchable Source → Anchorable Evidence Context ```` --- ## Core Concept The core concept of this repository is the **document representation**. A document representation is a normalized, searchable, addressable view of a source document. For a PDF, a representation may include: ```text document fingerprint metadata page count page text global canonical text page-local offset map text item map page dimensions source-to-rendering hints ``` For Markdown or HTML, a representation may include: ```text canonical text rendered HTML sanitized content heading map section map DOM or AST structure offset-to-node map source line map where available ``` These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently. --- ## Scope This repository should own: * document import workflows, * document source identification, * media type detection, * document fingerprinting, * source URI handling, * metadata extraction, * canonical text extraction, * PDF text extraction, * Markdown normalization, * HTML normalization and sanitization, * document representation generation, * representation caching, * local source search support, * quote search support, * citation clue parsing, * local citation recovery, * external source discovery hooks, * recovery state tracking, * privacy boundaries for source lookup. It should provide the source-side capabilities consumed by: * **citation-engine** for creating `Document` and `DocumentRepresentation` records, * **evidence-anchor** for selector creation and resolution, * **citation-work** for document review workflows, * **evidence-binder** when evidence needs source context, * **citation-evidence** for the integrated product experience. --- ## Out of Scope This repository should not own the broader evidence domain or user workflows. Specifically, it should not own: * the canonical evidence domain model, * persistence policy beyond source and representation storage contracts, * low-level anchor resolution algorithms, * visual highlight rendering, * review workspace UI, * form-field binding semantics, * visual guide overlay behavior, * citation card rendering, * application shell and deployment, * final human validation of evidence quality. Those responsibilities belong to the appropriate citation-evidence subsystem repositories. --- ## Architectural Position ```text citation-evidence integrated product shell citation-engine core domain model, services, persistence contracts evidence-source document ingestion, extraction, metadata, representations, citation recovery evidence-anchor selectors, anchor resolution, re-anchoring, highlighting contracts citation-work review workspace and annotation UX evidence-binder evidence-to-target binding and active evidence state ``` **evidence-source** should provide document representations, not define what evidence means. It should feed reliable source material into the rest of the system. --- ## Primary Workflows ### 1. Import Document A user or system adds a source document. ```text Add Source → Identify Media Type → Compute Fingerprint → Extract Metadata → Extract Text → Build Representation → Register Document ``` ### 2. Generate PDF Representation A PDF is converted into a representation suitable for review and anchoring. ```text PDF Source → Load PDF → Extract Page Text → Normalize Text → Build Page Map → Build Offset Map → Store Representation ``` ### 3. Generate Markdown / HTML Representation A Markdown or HTML source is converted into a normalized rendered and searchable representation. ```text Markdown / HTML Source → Parse / Sanitize → Render if needed → Extract Canonical Text → Build Heading / Section Map → Build Offset Map → Store Representation ``` ### 4. Search Local Sources A user or subsystem searches available source material. ```text Search Query / Quote → Search Metadata → Search Full Text → Return Candidate Documents / Passages ``` ### 5. Recover Citation A user provides a citation, quote, or source clue. ```text Citation Clue → Parse Source Metadata → Search Local Library → Optionally Search Configured External Sources → Load Candidate Source → Search Exact Quote → Search Fuzzy Quote → Present Candidate Passages → User Confirms → Create Source Context for Annotation ``` --- ## Initial Source Types The first version should support or prepare for: ```text PDF Markdown HTML plain text remote URL references ``` Later versions may support: ```text DOCX EPUB scanned image documents OCR-derived text IIIF resources TEI XML structured datasets with source passages ``` --- ## Citation Recovery States Citation recovery should be modeled explicitly. Initial recovery states may include: ```text created source-found-fulltext source-found-preview-only source-found-metadata-only source-not-found quote-found quote-not-found candidate-passages-found manual-confirmation-needed confirmed annotation-created failed ``` The system should distinguish between finding a source and finding the exact cited passage. --- ## Privacy and Source Lookup Principles Source lookup can create privacy risks. The repository should follow these principles: * search local sources first, * make external lookup explicit and configurable, * avoid sending private document text to external services by default, * record which external services were queried, * distinguish public metadata lookup from full-text upload, * allow deployments to disable external lookup completely, * prefer deterministic local processing where possible. External source discovery should be an extension point, not an unavoidable default behavior. --- ## Design Principles ### Source Identity First Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint. ### Canonical Text Matters Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable. ### Representation Is Not Source The original source and generated representation are different things. The system should preserve this distinction. ### Local Before External Citation recovery should search local documents before looking elsewhere. ### Human Confirmation Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists. ### Format-Aware, Model-Neutral The repository should understand document formats but should not own the broader evidence model. ### Cache Expensive Work Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version. ### Agent-Friendly Output Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain. --- ## Expected Dependencies This repository is expected to depend on shared types and service contracts from: ```text citation-engine Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts ``` It may be consumed by: ```text citation-work to load reviewable documents and document representations evidence-anchor to resolve selectors against extracted representations evidence-binder to retrieve source context for linked evidence citation-evidence to provide integrated import and recovery workflows ``` It should avoid depending on review UI or form-binding implementation details. --- ## First Useful Version A first useful version of **evidence-source** should provide: * source import interface, * media type detection, * document fingerprinting, * basic metadata extraction, * PDF text extraction, * Markdown text extraction, * HTML sanitization and text extraction, * canonical text normalization, * document representation generation, * simple local quote search, * recovery attempt model or contract, * examples showing how a document becomes a representation usable by **evidence-anchor**. The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern. --- ## Success Criteria The repository is successful when another subsystem can use it to: 1. import a source document, 2. identify and fingerprint it, 3. extract useful metadata, 4. generate canonical text, 5. generate a document representation, 6. search the source text, 7. provide representation data to **evidence-anchor**, 8. support a local citation recovery attempt from a quote or citation clue. A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources. --- ## Repository Character This repository should be: * source-focused, * ingestion-oriented, * privacy-conscious, * format-aware, * representation-centered, * cache-friendly, * suitable for local-first and server-side use, * explicit about uncertainty in citation recovery, * careful not to absorb review or binding responsibilities. --- ## MVP Coordination — Code Lives Upstream During the umbrella-first MVP phase (decided 2026-05-24), **the source code for this subsystem does not live in this repository yet**. It lives in the umbrella repo at `citation-evidence/src/source/`. This INTENT.md documents the *intended* responsibilities and boundaries. When the ingestion and representation interfaces have stabilized through actual MVP use, the corresponding code extracts into this repository. **Shared contracts** (Document and DocumentRepresentation shapes, CitationRecoveryAttempt state enum, canonical text normalization, allowed dependency edges) are maintained in the umbrella repo: * `citation-evidence/wiki/SharedContracts.md` * `citation-evidence/wiki/DependencyMap.md` * `citation-evidence/docs/decisions/` (ADRs) This subsystem's eventual code must not contradict those documents. Changes to shared contracts happen in the umbrella, not here. Under the dependency map, **`evidence-source` may depend only on `citation-engine`** — not on `evidence-anchor`. When ingestion needs to know "could a selector resolve here?", the answer travels through events, not direct calls. --- ## Guiding Statement **evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**