From d8a08d60328333c76ba88d1694ca2a016ca578c0 Mon Sep 17 00:00:00 2001 From: tegwick Date: Sun, 24 May 2026 16:51:06 +0200 Subject: [PATCH] Add MVP Coordination section: code lives in citation-evidence umbrella during MVP MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents the umbrella-first MVP decision (2026-05-24). This repo remains INTENT-only until the ingestion and representation interfaces stabilize through real product use. Reaffirms: source depends only on engine, not on anchor — coordination between them flows through events. Co-Authored-By: Claude Opus 4.7 --- INTENT.md | 492 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 492 insertions(+) create mode 100644 INTENT.md diff --git a/INTENT.md b/INTENT.md new file mode 100644 index 0000000..e231bc1 --- /dev/null +++ b/INTENT.md @@ -0,0 +1,492 @@ +# INTENT + +## Purpose + +This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem. + +**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows. + +It is responsible for answering the source-side questions: + +> What is this document? +> How can we extract usable text and structure from it? +> How can we find or recover a cited source passage? + +--- + +## Primary Utility + +The repository provides the source pipeline for citation-evidence. + +It should make it possible to: + +- import documents into a collection or workspace, +- identify document type and media type, +- compute stable document fingerprints, +- extract document metadata, +- extract canonical text, +- create document representations for PDFs, Markdown, HTML, and later other formats, +- build maps between text, pages, sections, and rendered views, +- support local full-text search, +- support source lookup and citation recovery, +- provide the document representations needed by **evidence-anchor** and **citation-work**. + +This repository turns documents into evidence-ready sources. + +--- + +## Intended Users + +Primary users of this repository are developers and agents implementing source handling for citation-evidence. + +They include: + +- developers building document import workflows, +- developers building review collections, +- developers implementing PDF, Markdown, and HTML source handling, +- developers implementing citation recovery, +- developers integrating local or external source libraries, +- coding agents that need structured access to document text and metadata. + +End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation. + +--- + +## Strategic Role + +The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates. + +Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile. + +**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats. + +It enables the flow: + +```text +Raw Source + → Document Identity + → Metadata + → Canonical Text + → Document Representation + → Searchable Source + → Anchorable Evidence Context +```` + +--- + +## Core Concept + +The core concept of this repository is the **document representation**. + +A document representation is a normalized, searchable, addressable view of a source document. + +For a PDF, a representation may include: + +```text +document fingerprint +metadata +page count +page text +global canonical text +page-local offset map +text item map +page dimensions +source-to-rendering hints +``` + +For Markdown or HTML, a representation may include: + +```text +canonical text +rendered HTML +sanitized content +heading map +section map +DOM or AST structure +offset-to-node map +source line map where available +``` + +These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently. + +--- + +## Scope + +This repository should own: + +* document import workflows, +* document source identification, +* media type detection, +* document fingerprinting, +* source URI handling, +* metadata extraction, +* canonical text extraction, +* PDF text extraction, +* Markdown normalization, +* HTML normalization and sanitization, +* document representation generation, +* representation caching, +* local source search support, +* quote search support, +* citation clue parsing, +* local citation recovery, +* external source discovery hooks, +* recovery state tracking, +* privacy boundaries for source lookup. + +It should provide the source-side capabilities consumed by: + +* **citation-engine** for creating `Document` and `DocumentRepresentation` records, +* **evidence-anchor** for selector creation and resolution, +* **citation-work** for document review workflows, +* **evidence-binder** when evidence needs source context, +* **citation-evidence** for the integrated product experience. + +--- + +## Out of Scope + +This repository should not own the broader evidence domain or user workflows. + +Specifically, it should not own: + +* the canonical evidence domain model, +* persistence policy beyond source and representation storage contracts, +* low-level anchor resolution algorithms, +* visual highlight rendering, +* review workspace UI, +* form-field binding semantics, +* visual guide overlay behavior, +* citation card rendering, +* application shell and deployment, +* final human validation of evidence quality. + +Those responsibilities belong to the appropriate citation-evidence subsystem repositories. + +--- + +## Architectural Position + +```text +citation-evidence + integrated product shell + +citation-engine + core domain model, services, persistence contracts + +evidence-source + document ingestion, extraction, metadata, representations, citation recovery + +evidence-anchor + selectors, anchor resolution, re-anchoring, highlighting contracts + +citation-work + review workspace and annotation UX + +evidence-binder + evidence-to-target binding and active evidence state +``` + +**evidence-source** should provide document representations, not define what evidence means. + +It should feed reliable source material into the rest of the system. + +--- + +## Primary Workflows + +### 1. Import Document + +A user or system adds a source document. + +```text +Add Source + → Identify Media Type + → Compute Fingerprint + → Extract Metadata + → Extract Text + → Build Representation + → Register Document +``` + +### 2. Generate PDF Representation + +A PDF is converted into a representation suitable for review and anchoring. + +```text +PDF Source + → Load PDF + → Extract Page Text + → Normalize Text + → Build Page Map + → Build Offset Map + → Store Representation +``` + +### 3. Generate Markdown / HTML Representation + +A Markdown or HTML source is converted into a normalized rendered and searchable representation. + +```text +Markdown / HTML Source + → Parse / Sanitize + → Render if needed + → Extract Canonical Text + → Build Heading / Section Map + → Build Offset Map + → Store Representation +``` + +### 4. Search Local Sources + +A user or subsystem searches available source material. + +```text +Search Query / Quote + → Search Metadata + → Search Full Text + → Return Candidate Documents / Passages +``` + +### 5. Recover Citation + +A user provides a citation, quote, or source clue. + +```text +Citation Clue + → Parse Source Metadata + → Search Local Library + → Optionally Search Configured External Sources + → Load Candidate Source + → Search Exact Quote + → Search Fuzzy Quote + → Present Candidate Passages + → User Confirms + → Create Source Context for Annotation +``` + +--- + +## Initial Source Types + +The first version should support or prepare for: + +```text +PDF +Markdown +HTML +plain text +remote URL references +``` + +Later versions may support: + +```text +DOCX +EPUB +scanned image documents +OCR-derived text +IIIF resources +TEI XML +structured datasets with source passages +``` + +--- + +## Citation Recovery States + +Citation recovery should be modeled explicitly. + +Initial recovery states may include: + +```text +created +source-found-fulltext +source-found-preview-only +source-found-metadata-only +source-not-found +quote-found +quote-not-found +candidate-passages-found +manual-confirmation-needed +confirmed +annotation-created +failed +``` + +The system should distinguish between finding a source and finding the exact cited passage. + +--- + +## Privacy and Source Lookup Principles + +Source lookup can create privacy risks. + +The repository should follow these principles: + +* search local sources first, +* make external lookup explicit and configurable, +* avoid sending private document text to external services by default, +* record which external services were queried, +* distinguish public metadata lookup from full-text upload, +* allow deployments to disable external lookup completely, +* prefer deterministic local processing where possible. + +External source discovery should be an extension point, not an unavoidable default behavior. + +--- + +## Design Principles + +### Source Identity First + +Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint. + +### Canonical Text Matters + +Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable. + +### Representation Is Not Source + +The original source and generated representation are different things. + +The system should preserve this distinction. + +### Local Before External + +Citation recovery should search local documents before looking elsewhere. + +### Human Confirmation + +Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists. + +### Format-Aware, Model-Neutral + +The repository should understand document formats but should not own the broader evidence model. + +### Cache Expensive Work + +Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version. + +### Agent-Friendly Output + +Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain. + +--- + +## Expected Dependencies + +This repository is expected to depend on shared types and service contracts from: + +```text +citation-engine + Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts +``` + +It may be consumed by: + +```text +citation-work + to load reviewable documents and document representations + +evidence-anchor + to resolve selectors against extracted representations + +evidence-binder + to retrieve source context for linked evidence + +citation-evidence + to provide integrated import and recovery workflows +``` + +It should avoid depending on review UI or form-binding implementation details. + +--- + +## First Useful Version + +A first useful version of **evidence-source** should provide: + +* source import interface, +* media type detection, +* document fingerprinting, +* basic metadata extraction, +* PDF text extraction, +* Markdown text extraction, +* HTML sanitization and text extraction, +* canonical text normalization, +* document representation generation, +* simple local quote search, +* recovery attempt model or contract, +* examples showing how a document becomes a representation usable by **evidence-anchor**. + +The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern. + +--- + +## Success Criteria + +The repository is successful when another subsystem can use it to: + +1. import a source document, +2. identify and fingerprint it, +3. extract useful metadata, +4. generate canonical text, +5. generate a document representation, +6. search the source text, +7. provide representation data to **evidence-anchor**, +8. support a local citation recovery attempt from a quote or citation clue. + +A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources. + +--- + +## Repository Character + +This repository should be: + +* source-focused, +* ingestion-oriented, +* privacy-conscious, +* format-aware, +* representation-centered, +* cache-friendly, +* suitable for local-first and server-side use, +* explicit about uncertainty in citation recovery, +* careful not to absorb review or binding responsibilities. + +--- + +## MVP Coordination — Code Lives Upstream + +During the umbrella-first MVP phase (decided 2026-05-24), **the source code +for this subsystem does not live in this repository yet**. It lives in the +umbrella repo at `citation-evidence/src/source/`. + +This INTENT.md documents the *intended* responsibilities and boundaries. +When the ingestion and representation interfaces have stabilized through +actual MVP use, the corresponding code extracts into this repository. + +**Shared contracts** (Document and DocumentRepresentation shapes, +CitationRecoveryAttempt state enum, canonical text normalization, allowed +dependency edges) are maintained in the umbrella repo: + +* `citation-evidence/wiki/SharedContracts.md` +* `citation-evidence/wiki/DependencyMap.md` +* `citation-evidence/docs/decisions/` (ADRs) + +This subsystem's eventual code must not contradict those documents. Changes +to shared contracts happen in the umbrella, not here. + +Under the dependency map, **`evidence-source` may depend only on +`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know +"could a selector resolve here?", the answer travels through events, not +direct calls. + +--- + +## Guiding Statement + +**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.** +