generated from coulomb/repo-seed
Add MVP Coordination section: code lives in citation-evidence umbrella during MVP
Documents the umbrella-first MVP decision (2026-05-24). This repo remains INTENT-only until the ingestion and representation interfaces stabilize through real product use. Reaffirms: source depends only on engine, not on anchor — coordination between them flows through events. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
492
INTENT.md
Normal file
492
INTENT.md
Normal file
@@ -0,0 +1,492 @@
|
||||
# INTENT
|
||||
|
||||
## Purpose
|
||||
|
||||
This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the **citation-evidence** ecosystem.
|
||||
|
||||
**evidence-source** turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.
|
||||
|
||||
It is responsible for answering the source-side questions:
|
||||
|
||||
> What is this document?
|
||||
> How can we extract usable text and structure from it?
|
||||
> How can we find or recover a cited source passage?
|
||||
|
||||
---
|
||||
|
||||
## Primary Utility
|
||||
|
||||
The repository provides the source pipeline for citation-evidence.
|
||||
|
||||
It should make it possible to:
|
||||
|
||||
- import documents into a collection or workspace,
|
||||
- identify document type and media type,
|
||||
- compute stable document fingerprints,
|
||||
- extract document metadata,
|
||||
- extract canonical text,
|
||||
- create document representations for PDFs, Markdown, HTML, and later other formats,
|
||||
- build maps between text, pages, sections, and rendered views,
|
||||
- support local full-text search,
|
||||
- support source lookup and citation recovery,
|
||||
- provide the document representations needed by **evidence-anchor** and **citation-work**.
|
||||
|
||||
This repository turns documents into evidence-ready sources.
|
||||
|
||||
---
|
||||
|
||||
## Intended Users
|
||||
|
||||
Primary users of this repository are developers and agents implementing source handling for citation-evidence.
|
||||
|
||||
They include:
|
||||
|
||||
- developers building document import workflows,
|
||||
- developers building review collections,
|
||||
- developers implementing PDF, Markdown, and HTML source handling,
|
||||
- developers implementing citation recovery,
|
||||
- developers integrating local or external source libraries,
|
||||
- coding agents that need structured access to document text and metadata.
|
||||
|
||||
End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.
|
||||
|
||||
---
|
||||
|
||||
## Strategic Role
|
||||
|
||||
The strategic role of **evidence-source** is to make source documents usable as reliable evidence substrates.
|
||||
|
||||
Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.
|
||||
|
||||
**evidence-source** creates the normalized source representations that allow the rest of the system to operate consistently across document formats.
|
||||
|
||||
It enables the flow:
|
||||
|
||||
```text
|
||||
Raw Source
|
||||
→ Document Identity
|
||||
→ Metadata
|
||||
→ Canonical Text
|
||||
→ Document Representation
|
||||
→ Searchable Source
|
||||
→ Anchorable Evidence Context
|
||||
````
|
||||
|
||||
---
|
||||
|
||||
## Core Concept
|
||||
|
||||
The core concept of this repository is the **document representation**.
|
||||
|
||||
A document representation is a normalized, searchable, addressable view of a source document.
|
||||
|
||||
For a PDF, a representation may include:
|
||||
|
||||
```text
|
||||
document fingerprint
|
||||
metadata
|
||||
page count
|
||||
page text
|
||||
global canonical text
|
||||
page-local offset map
|
||||
text item map
|
||||
page dimensions
|
||||
source-to-rendering hints
|
||||
```
|
||||
|
||||
For Markdown or HTML, a representation may include:
|
||||
|
||||
```text
|
||||
canonical text
|
||||
rendered HTML
|
||||
sanitized content
|
||||
heading map
|
||||
section map
|
||||
DOM or AST structure
|
||||
offset-to-node map
|
||||
source line map where available
|
||||
```
|
||||
|
||||
These representations allow **evidence-anchor** to create and resolve selectors and allow **citation-work** to display and search documents efficiently.
|
||||
|
||||
---
|
||||
|
||||
## Scope
|
||||
|
||||
This repository should own:
|
||||
|
||||
* document import workflows,
|
||||
* document source identification,
|
||||
* media type detection,
|
||||
* document fingerprinting,
|
||||
* source URI handling,
|
||||
* metadata extraction,
|
||||
* canonical text extraction,
|
||||
* PDF text extraction,
|
||||
* Markdown normalization,
|
||||
* HTML normalization and sanitization,
|
||||
* document representation generation,
|
||||
* representation caching,
|
||||
* local source search support,
|
||||
* quote search support,
|
||||
* citation clue parsing,
|
||||
* local citation recovery,
|
||||
* external source discovery hooks,
|
||||
* recovery state tracking,
|
||||
* privacy boundaries for source lookup.
|
||||
|
||||
It should provide the source-side capabilities consumed by:
|
||||
|
||||
* **citation-engine** for creating `Document` and `DocumentRepresentation` records,
|
||||
* **evidence-anchor** for selector creation and resolution,
|
||||
* **citation-work** for document review workflows,
|
||||
* **evidence-binder** when evidence needs source context,
|
||||
* **citation-evidence** for the integrated product experience.
|
||||
|
||||
---
|
||||
|
||||
## Out of Scope
|
||||
|
||||
This repository should not own the broader evidence domain or user workflows.
|
||||
|
||||
Specifically, it should not own:
|
||||
|
||||
* the canonical evidence domain model,
|
||||
* persistence policy beyond source and representation storage contracts,
|
||||
* low-level anchor resolution algorithms,
|
||||
* visual highlight rendering,
|
||||
* review workspace UI,
|
||||
* form-field binding semantics,
|
||||
* visual guide overlay behavior,
|
||||
* citation card rendering,
|
||||
* application shell and deployment,
|
||||
* final human validation of evidence quality.
|
||||
|
||||
Those responsibilities belong to the appropriate citation-evidence subsystem repositories.
|
||||
|
||||
---
|
||||
|
||||
## Architectural Position
|
||||
|
||||
```text
|
||||
citation-evidence
|
||||
integrated product shell
|
||||
|
||||
citation-engine
|
||||
core domain model, services, persistence contracts
|
||||
|
||||
evidence-source
|
||||
document ingestion, extraction, metadata, representations, citation recovery
|
||||
|
||||
evidence-anchor
|
||||
selectors, anchor resolution, re-anchoring, highlighting contracts
|
||||
|
||||
citation-work
|
||||
review workspace and annotation UX
|
||||
|
||||
evidence-binder
|
||||
evidence-to-target binding and active evidence state
|
||||
```
|
||||
|
||||
**evidence-source** should provide document representations, not define what evidence means.
|
||||
|
||||
It should feed reliable source material into the rest of the system.
|
||||
|
||||
---
|
||||
|
||||
## Primary Workflows
|
||||
|
||||
### 1. Import Document
|
||||
|
||||
A user or system adds a source document.
|
||||
|
||||
```text
|
||||
Add Source
|
||||
→ Identify Media Type
|
||||
→ Compute Fingerprint
|
||||
→ Extract Metadata
|
||||
→ Extract Text
|
||||
→ Build Representation
|
||||
→ Register Document
|
||||
```
|
||||
|
||||
### 2. Generate PDF Representation
|
||||
|
||||
A PDF is converted into a representation suitable for review and anchoring.
|
||||
|
||||
```text
|
||||
PDF Source
|
||||
→ Load PDF
|
||||
→ Extract Page Text
|
||||
→ Normalize Text
|
||||
→ Build Page Map
|
||||
→ Build Offset Map
|
||||
→ Store Representation
|
||||
```
|
||||
|
||||
### 3. Generate Markdown / HTML Representation
|
||||
|
||||
A Markdown or HTML source is converted into a normalized rendered and searchable representation.
|
||||
|
||||
```text
|
||||
Markdown / HTML Source
|
||||
→ Parse / Sanitize
|
||||
→ Render if needed
|
||||
→ Extract Canonical Text
|
||||
→ Build Heading / Section Map
|
||||
→ Build Offset Map
|
||||
→ Store Representation
|
||||
```
|
||||
|
||||
### 4. Search Local Sources
|
||||
|
||||
A user or subsystem searches available source material.
|
||||
|
||||
```text
|
||||
Search Query / Quote
|
||||
→ Search Metadata
|
||||
→ Search Full Text
|
||||
→ Return Candidate Documents / Passages
|
||||
```
|
||||
|
||||
### 5. Recover Citation
|
||||
|
||||
A user provides a citation, quote, or source clue.
|
||||
|
||||
```text
|
||||
Citation Clue
|
||||
→ Parse Source Metadata
|
||||
→ Search Local Library
|
||||
→ Optionally Search Configured External Sources
|
||||
→ Load Candidate Source
|
||||
→ Search Exact Quote
|
||||
→ Search Fuzzy Quote
|
||||
→ Present Candidate Passages
|
||||
→ User Confirms
|
||||
→ Create Source Context for Annotation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Initial Source Types
|
||||
|
||||
The first version should support or prepare for:
|
||||
|
||||
```text
|
||||
PDF
|
||||
Markdown
|
||||
HTML
|
||||
plain text
|
||||
remote URL references
|
||||
```
|
||||
|
||||
Later versions may support:
|
||||
|
||||
```text
|
||||
DOCX
|
||||
EPUB
|
||||
scanned image documents
|
||||
OCR-derived text
|
||||
IIIF resources
|
||||
TEI XML
|
||||
structured datasets with source passages
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Citation Recovery States
|
||||
|
||||
Citation recovery should be modeled explicitly.
|
||||
|
||||
Initial recovery states may include:
|
||||
|
||||
```text
|
||||
created
|
||||
source-found-fulltext
|
||||
source-found-preview-only
|
||||
source-found-metadata-only
|
||||
source-not-found
|
||||
quote-found
|
||||
quote-not-found
|
||||
candidate-passages-found
|
||||
manual-confirmation-needed
|
||||
confirmed
|
||||
annotation-created
|
||||
failed
|
||||
```
|
||||
|
||||
The system should distinguish between finding a source and finding the exact cited passage.
|
||||
|
||||
---
|
||||
|
||||
## Privacy and Source Lookup Principles
|
||||
|
||||
Source lookup can create privacy risks.
|
||||
|
||||
The repository should follow these principles:
|
||||
|
||||
* search local sources first,
|
||||
* make external lookup explicit and configurable,
|
||||
* avoid sending private document text to external services by default,
|
||||
* record which external services were queried,
|
||||
* distinguish public metadata lookup from full-text upload,
|
||||
* allow deployments to disable external lookup completely,
|
||||
* prefer deterministic local processing where possible.
|
||||
|
||||
External source discovery should be an extension point, not an unavoidable default behavior.
|
||||
|
||||
---
|
||||
|
||||
## Design Principles
|
||||
|
||||
### Source Identity First
|
||||
|
||||
Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.
|
||||
|
||||
### Canonical Text Matters
|
||||
|
||||
Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.
|
||||
|
||||
### Representation Is Not Source
|
||||
|
||||
The original source and generated representation are different things.
|
||||
|
||||
The system should preserve this distinction.
|
||||
|
||||
### Local Before External
|
||||
|
||||
Citation recovery should search local documents before looking elsewhere.
|
||||
|
||||
### Human Confirmation
|
||||
|
||||
Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.
|
||||
|
||||
### Format-Aware, Model-Neutral
|
||||
|
||||
The repository should understand document formats but should not own the broader evidence model.
|
||||
|
||||
### Cache Expensive Work
|
||||
|
||||
Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.
|
||||
|
||||
### Agent-Friendly Output
|
||||
|
||||
Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.
|
||||
|
||||
---
|
||||
|
||||
## Expected Dependencies
|
||||
|
||||
This repository is expected to depend on shared types and service contracts from:
|
||||
|
||||
```text
|
||||
citation-engine
|
||||
Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts
|
||||
```
|
||||
|
||||
It may be consumed by:
|
||||
|
||||
```text
|
||||
citation-work
|
||||
to load reviewable documents and document representations
|
||||
|
||||
evidence-anchor
|
||||
to resolve selectors against extracted representations
|
||||
|
||||
evidence-binder
|
||||
to retrieve source context for linked evidence
|
||||
|
||||
citation-evidence
|
||||
to provide integrated import and recovery workflows
|
||||
```
|
||||
|
||||
It should avoid depending on review UI or form-binding implementation details.
|
||||
|
||||
---
|
||||
|
||||
## First Useful Version
|
||||
|
||||
A first useful version of **evidence-source** should provide:
|
||||
|
||||
* source import interface,
|
||||
* media type detection,
|
||||
* document fingerprinting,
|
||||
* basic metadata extraction,
|
||||
* PDF text extraction,
|
||||
* Markdown text extraction,
|
||||
* HTML sanitization and text extraction,
|
||||
* canonical text normalization,
|
||||
* document representation generation,
|
||||
* simple local quote search,
|
||||
* recovery attempt model or contract,
|
||||
* examples showing how a document becomes a representation usable by **evidence-anchor**.
|
||||
|
||||
The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
The repository is successful when another subsystem can use it to:
|
||||
|
||||
1. import a source document,
|
||||
2. identify and fingerprint it,
|
||||
3. extract useful metadata,
|
||||
4. generate canonical text,
|
||||
5. generate a document representation,
|
||||
6. search the source text,
|
||||
7. provide representation data to **evidence-anchor**,
|
||||
8. support a local citation recovery attempt from a quote or citation clue.
|
||||
|
||||
A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.
|
||||
|
||||
---
|
||||
|
||||
## Repository Character
|
||||
|
||||
This repository should be:
|
||||
|
||||
* source-focused,
|
||||
* ingestion-oriented,
|
||||
* privacy-conscious,
|
||||
* format-aware,
|
||||
* representation-centered,
|
||||
* cache-friendly,
|
||||
* suitable for local-first and server-side use,
|
||||
* explicit about uncertainty in citation recovery,
|
||||
* careful not to absorb review or binding responsibilities.
|
||||
|
||||
---
|
||||
|
||||
## MVP Coordination — Code Lives Upstream
|
||||
|
||||
During the umbrella-first MVP phase (decided 2026-05-24), **the source code
|
||||
for this subsystem does not live in this repository yet**. It lives in the
|
||||
umbrella repo at `citation-evidence/src/source/`.
|
||||
|
||||
This INTENT.md documents the *intended* responsibilities and boundaries.
|
||||
When the ingestion and representation interfaces have stabilized through
|
||||
actual MVP use, the corresponding code extracts into this repository.
|
||||
|
||||
**Shared contracts** (Document and DocumentRepresentation shapes,
|
||||
CitationRecoveryAttempt state enum, canonical text normalization, allowed
|
||||
dependency edges) are maintained in the umbrella repo:
|
||||
|
||||
* `citation-evidence/wiki/SharedContracts.md`
|
||||
* `citation-evidence/wiki/DependencyMap.md`
|
||||
* `citation-evidence/docs/decisions/` (ADRs)
|
||||
|
||||
This subsystem's eventual code must not contradict those documents. Changes
|
||||
to shared contracts happen in the umbrella, not here.
|
||||
|
||||
Under the dependency map, **`evidence-source` may depend only on
|
||||
`citation-engine`** — not on `evidence-anchor`. When ingestion needs to know
|
||||
"could a selector resolve here?", the answer travels through events, not
|
||||
direct calls.
|
||||
|
||||
---
|
||||
|
||||
## Guiding Statement
|
||||
|
||||
**evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.**
|
||||
|
||||
Reference in New Issue
Block a user