From d8a08d60328333c76ba88d1694ca2a016ca578c0 Mon Sep 17 00:00:00 2001
From: tegwick <bernd.worsch@gmail.com>
Date: Sun, 24 May 2026 16:51:06 +0200
Subject: [PATCH] Add MVP Coordination section: code lives in citation-evidence
 umbrella during MVP
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Documents the umbrella-first MVP decision (2026-05-24). This repo remains
INTENT-only until the ingestion and representation interfaces stabilize
through real product use. Reaffirms: source depends only on engine, not on
anchor — coordination between them flows through events.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 INTENT.md | 492 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 492 insertions(+)
 create mode 100644 INTENT.md

diff --git a/INTENT.md b/INTENT.md
new file mode 100644
index 0000000..e231bc1
--- /dev/null
+++ b/INTENT.md
@@ -0,0 +1,492 @@
+# INTENT
+
+## Purpose
+
+This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.
+
+**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
+
+It is responsible for answering the source-side questions:
+
+> What is this document?  
+> How can we extract usable text and structure from it?  
+> How can we find or recover a cited source passage?
+
+---
+
+## Primary Utility
+
+The repository provides the source pipeline for citation-evidence.
+
+It should make it possible to:
+
+- import documents into a collection or workspace,
+- identify document type and media type,
+- compute stable document fingerprints,
+- extract document metadata,
+- extract canonical text,
+- create document representations for PDFs, Markdown, HTML, and later other formats,
+- build maps between text, pages, sections, and rendered views,
+- support local full-text search,
+- support source lookup and citation recovery,
+- provide the document representations needed by **evidence-anchor** and **citation-work**.
+
+This repository turns documents into evidence-ready sources.
+
+---
+
+## Intended Users
+
+Primary users of this repository are developers and agents implementing source handling for citation-evidence.
+
+They include:
+
+- developers building document import workflows,
+- developers building review collections,
+- developers implementing PDF, Markdown, and HTML source handling,
+- developers implementing citation recovery,
+- developers integrating local or external source libraries,
+- coding agents that need structured access to document text and metadata.
+
+End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
+
+---
+
+## Strategic Role
+
+The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.
+
+Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
+
+**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
+
+It enables the flow:
+
+```text
+Raw Source
+  → Document Identity
+  → Metadata
+  → Canonical Text
+  → Document Representation
+  → Searchable Source
+  → Anchorable Evidence Context
+````
+
+---
+
+## Core Concept
+
+The core concept of this repository is the **document representation**.
+
+A document representation is a normalized, searchable, addressable view of a source document.
+
+For a PDF, a representation may include:
+
+```text
+document fingerprint
+metadata
+page count
+page text
+global canonical text
+page-local offset map
+text item map
+page dimensions
+source-to-rendering hints
+```
+
+For Markdown or HTML, a representation may include:
+
+```text
+canonical text
+rendered HTML
+sanitized content
+heading map
+section map
+DOM or AST structure
+offset-to-node map
+source line map where available
+```
+
+These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.
+
+---
+
+## Scope
+
+This repository should own:
+
+* document import workflows,
+* document source identification,
+* media type detection,
+* document fingerprinting,
+* source URI handling,
+* metadata extraction,
+* canonical text extraction,
+* PDF text extraction,
+* Markdown normalization,
+* HTML normalization and sanitization,
+* document representation generation,
+* representation caching,
+* local source search support,
+* quote search support,
+* citation clue parsing,
+* local citation recovery,
+* external source discovery hooks,
+* recovery state tracking,
+* privacy boundaries for source lookup.
+
+It should provide the source-side capabilities consumed by:
+
+* **citation-engine** for creating `Document` and `DocumentRepresentation` records,
+* **evidence-anchor** for selector creation and resolution,
+* **citation-work** for document review workflows,
+* **evidence-binder** when evidence needs source context,
+* **citation-evidence** for the integrated product experience.
+
+---
+
+## Out of Scope
+
+This repository should not own the broader evidence domain or user workflows.
+
+Specifically, it should not own:
+
+* the canonical evidence domain model,
+* persistence policy beyond source and representation storage contracts,
+* low-level anchor resolution algorithms,
+* visual highlight rendering,
+* review workspace UI,
+* form-field binding semantics,
+* visual guide overlay behavior,
+* citation card rendering,
+* application shell and deployment,
+* final human validation of evidence quality.
+
+Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
+
+---
+
+## Architectural Position
+
+```text
+citation-evidence
+  integrated product shell
+
+citation-engine
+  core domain model, services, persistence contracts
+
+evidence-source
+  document ingestion, extraction, metadata, representations, citation recovery
+
+evidence-anchor
+  selectors, anchor resolution, re-anchoring, highlighting contracts
+
+citation-work
+  review workspace and annotation UX
+
+evidence-binder
+  evidence-to-target binding and active evidence state
+```
+
+**evidence-source** should provide document representations, not define what evidence means.
+
+It should feed reliable source material into the rest of the system.
+
+---
+
+## Primary Workflows
+
+### 1. Import Document
+
+A user or system adds a source document.
+
+```text
+Add Source
+  → Identify Media Type
+  → Compute Fingerprint
+  → Extract Metadata
+  → Extract Text
+  → Build Representation
+  → Register Document
+```
+
+### 2. Generate PDF Representation
+
+A PDF is converted into a representation suitable for review and anchoring.
+
+```text
+PDF Source
+  → Load PDF
+  → Extract Page Text
+  → Normalize Text
+  → Build Page Map
+  → Build Offset Map
+  → Store Representation
+```
+
+### 3. Generate Markdown / HTML Representation
+
+A Markdown or HTML source is converted into a normalized rendered and searchable representation.
+
+```text
+Markdown / HTML Source
+  → Parse / Sanitize
+  → Render if needed
+  → Extract Canonical Text
+  → Build Heading / Section Map
+  → Build Offset Map
+  → Store Representation
+```
+
+### 4. Search Local Sources
+
+A user or subsystem searches available source material.
+
+```text
+Search Query / Quote
+  → Search Metadata
+  → Search Full Text
+  → Return Candidate Documents / Passages
+```
+
+### 5. Recover Citation
+
+A user provides a citation, quote, or source clue.
+
+```text
+Citation Clue
+  → Parse Source Metadata
+  → Search Local Library
+  → Optionally Search Configured External Sources
+  → Load Candidate Source
+  → Search Exact Quote
+  → Search Fuzzy Quote
+  → Present Candidate Passages
+  → User Confirms
+  → Create Source Context for Annotation
+```
+
+---
+
+## Initial Source Types
+
+The first version should support or prepare for:
+
+```text
+PDF
+Markdown
+HTML
+plain text
+remote URL references
+```
+
+Later versions may support:
+
+```text
+DOCX
+EPUB
+scanned image documents
+OCR-derived text
+IIIF resources
+TEI XML
+structured datasets with source passages
+```
+
+---
+
+## Citation Recovery States
+
+Citation recovery should be modeled explicitly.
+
+Initial recovery states may include:
+
+```text
+created
+source-found-fulltext
+source-found-preview-only
+source-found-metadata-only
+source-not-found
+quote-found
+quote-not-found
+candidate-passages-found
+manual-confirmation-needed
+confirmed
+annotation-created
+failed
+```
+
+The system should distinguish between finding a source and finding the exact cited passage.
+
+---
+
+## Privacy and Source Lookup Principles
+
+Source lookup can create privacy risks.
+
+The repository should follow these principles:
+
+* search local sources first,
+* make external lookup explicit and configurable,
+* avoid sending private document text to external services by default,
+* record which external services were queried,
+* distinguish public metadata lookup from full-text upload,
+* allow deployments to disable external lookup completely,
+* prefer deterministic local processing where possible.
+
+External source discovery should be an extension point, not an unavoidable default behavior.
+
+---
+
+## Design Principles
+
+### Source Identity First
+
+Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
+
+### Canonical Text Matters
+
+Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
+
+### Representation Is Not Source
+
+The original source and generated representation are different things.
+
+The system should preserve this distinction.
+
+### Local Before External
+
+Citation recovery should search local documents before looking elsewhere.
+
+### Human Confirmation
+
+Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
+
+### Format-Aware, Model-Neutral
+
+The repository should understand document formats but should not own the broader evidence model.
+
+### Cache Expensive Work
+
+Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
+
+### Agent-Friendly Output
+
+Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
+
+---
+
+## Expected Dependencies
+
+This repository is expected to depend on shared types and service contracts from:
+
+```text
+citation-engine
+  Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
+```
+
+It may be consumed by:
+
+```text
+citation-work
+  to load reviewable documents and document representations
+
+evidence-anchor
+  to resolve selectors against extracted representations
+
+evidence-binder
+  to retrieve source context for linked evidence
+
+citation-evidence
+  to provide integrated import and recovery workflows
+```
+
+It should avoid depending on review UI or form-binding implementation details.
+
+---
+
+## First Useful Version
+
+A first useful version of **evidence-source** should provide:
+
+* source import interface,
+* media type detection,
+* document fingerprinting,
+* basic metadata extraction,
+* PDF text extraction,
+* Markdown text extraction,
+* HTML sanitization and text extraction,
+* canonical text normalization,
+* document representation generation,
+* simple local quote search,
+* recovery attempt model or contract,
+* examples showing how a document becomes a representation usable by **evidence-anchor**.
+
+The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
+
+---
+
+## Success Criteria
+
+The repository is successful when another subsystem can use it to:
+
+1. import a source document,
+2. identify and fingerprint it,
+3. extract useful metadata,
+4. generate canonical text,
+5. generate a document representation,
+6. search the source text,
+7. provide representation data to **evidence-anchor**,
+8. support a local citation recovery attempt from a quote or citation clue.
+
+A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
+
+---
+
+## Repository Character
+
+This repository should be:
+
+* source-focused,
+* ingestion-oriented,
+* privacy-conscious,
+* format-aware,
+* representation-centered,
+* cache-friendly,
+* suitable for local-first and server-side use,
+* explicit about uncertainty in citation recovery,
+* careful not to absorb review or binding responsibilities.
+
+---
+
+## MVP Coordination — Code Lives Upstream
+
+During the umbrella-first MVP phase (decided 2026-05-24), **the source code
+for this subsystem does not live in this repository yet**. It lives in the
+umbrella repo at `citation-evidence/src/source/`.
+
+This INTENT.md documents the *intended* responsibilities and boundaries.
+When the ingestion and representation interfaces have stabilized through
+actual MVP use, the corresponding code extracts into this repository.
+
+**Shared contracts** (Document and DocumentRepresentation shapes,
+CitationRecoveryAttempt state enum, canonical text normalization, allowed
+dependency edges) are maintained in the umbrella repo:
+
+* `citation-evidence/wiki/SharedContracts.md`
+* `citation-evidence/wiki/DependencyMap.md`
+* `citation-evidence/docs/decisions/` (ADRs)
+
+This subsystem's eventual code must not contradict those documents. Changes
+to shared contracts happen in the umbrella, not here.
+
+Under the dependency map, **`evidence-source` may depend only on
+`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
+"could a selector resolve here?", the answer travels through events, not
+direct calls.
+
+---
+
+## Guiding Statement
+
+**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**
+