Files

tegwick d8a08d6032 Add MVP Coordination section: code lives in citation-evidence umbrella during MVP

Documents the umbrella-first MVP decision (2026-05-24). This repo remains
INTENT-only until the ingestion and representation interfaces stabilize
through real product use. Reaffirms: source depends only on engine, not on
anchor — coordination between them flows through events.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-24 16:51:06 +02:00

13 KiB

Raw Blame History

INTENT

Purpose

This repository exists to provide the document source, ingestion, extraction, metadata, and citation recovery layer for the citation-evidence ecosystem.

evidence-source turns raw documents and source clues into usable, searchable, addressable document representations that can support annotations, evidence items, citation recovery, and source-backed workflows.

It is responsible for answering the source-side questions:

What is this document?
How can we extract usable text and structure from it?
How can we find or recover a cited source passage?

Primary Utility

The repository provides the source pipeline for citation-evidence.

It should make it possible to:

import documents into a collection or workspace,
identify document type and media type,
compute stable document fingerprints,
extract document metadata,
extract canonical text,
create document representations for PDFs, Markdown, HTML, and later other formats,
build maps between text, pages, sections, and rendered views,
support local full-text search,
support source lookup and citation recovery,
provide the document representations needed by evidence-anchor and citation-work.

This repository turns documents into evidence-ready sources.

Intended Users

Primary users of this repository are developers and agents implementing source handling for citation-evidence.

They include:

developers building document import workflows,
developers building review collections,
developers implementing PDF, Markdown, and HTML source handling,
developers implementing citation recovery,
developers integrating local or external source libraries,
coding agents that need structured access to document text and metadata.

End users should experience this repository indirectly whenever they add a document, search source text, or recover a citation.

Strategic Role

The strategic role of evidence-source is to make source documents usable as reliable evidence substrates.

Without this repository, the system would depend on whatever a viewer happens to show at runtime. That would make citation capture, re-opening, search, and recovery fragile.

evidence-source creates the normalized source representations that allow the rest of the system to operate consistently across document formats.

It enables the flow:

Raw Source
  → Document Identity
  → Metadata
  → Canonical Text
  → Document Representation
  → Searchable Source
  → Anchorable Evidence Context

Core Concept

The core concept of this repository is the document representation.

A document representation is a normalized, searchable, addressable view of a source document.

For a PDF, a representation may include:

document fingerprint
metadata
page count
page text
global canonical text
page-local offset map
text item map
page dimensions
source-to-rendering hints

For Markdown or HTML, a representation may include:

canonical text
rendered HTML
sanitized content
heading map
section map
DOM or AST structure
offset-to-node map
source line map where available

These representations allow evidence-anchor to create and resolve selectors and allow citation-work to display and search documents efficiently.

Scope

This repository should own:

document import workflows,
document source identification,
media type detection,
document fingerprinting,
source URI handling,
metadata extraction,
canonical text extraction,
PDF text extraction,
Markdown normalization,
HTML normalization and sanitization,
document representation generation,
representation caching,
local source search support,
quote search support,
citation clue parsing,
local citation recovery,
external source discovery hooks,
recovery state tracking,
privacy boundaries for source lookup.

It should provide the source-side capabilities consumed by:

citation-engine for creating Document and DocumentRepresentation records,
evidence-anchor for selector creation and resolution,
citation-work for document review workflows,
evidence-binder when evidence needs source context,
citation-evidence for the integrated product experience.

Out of Scope

This repository should not own the broader evidence domain or user workflows.

Specifically, it should not own:

the canonical evidence domain model,
persistence policy beyond source and representation storage contracts,
low-level anchor resolution algorithms,
visual highlight rendering,
review workspace UI,
form-field binding semantics,
visual guide overlay behavior,
citation card rendering,
application shell and deployment,
final human validation of evidence quality.

Those responsibilities belong to the appropriate citation-evidence subsystem repositories.

Architectural Position

citation-evidence
  integrated product shell

citation-engine
  core domain model, services, persistence contracts

evidence-source
  document ingestion, extraction, metadata, representations, citation recovery

evidence-anchor
  selectors, anchor resolution, re-anchoring, highlighting contracts

citation-work
  review workspace and annotation UX

evidence-binder
  evidence-to-target binding and active evidence state

evidence-source should provide document representations, not define what evidence means.

It should feed reliable source material into the rest of the system.

Primary Workflows

1. Import Document

A user or system adds a source document.

Add Source
  → Identify Media Type
  → Compute Fingerprint
  → Extract Metadata
  → Extract Text
  → Build Representation
  → Register Document

2. Generate PDF Representation

A PDF is converted into a representation suitable for review and anchoring.

PDF Source
  → Load PDF
  → Extract Page Text
  → Normalize Text
  → Build Page Map
  → Build Offset Map
  → Store Representation

3. Generate Markdown / HTML Representation

A Markdown or HTML source is converted into a normalized rendered and searchable representation.

Markdown / HTML Source
  → Parse / Sanitize
  → Render if needed
  → Extract Canonical Text
  → Build Heading / Section Map
  → Build Offset Map
  → Store Representation

4. Search Local Sources

A user or subsystem searches available source material.

Search Query / Quote
  → Search Metadata
  → Search Full Text
  → Return Candidate Documents / Passages

5. Recover Citation

A user provides a citation, quote, or source clue.

Citation Clue
  → Parse Source Metadata
  → Search Local Library
  → Optionally Search Configured External Sources
  → Load Candidate Source
  → Search Exact Quote
  → Search Fuzzy Quote
  → Present Candidate Passages
  → User Confirms
  → Create Source Context for Annotation

Initial Source Types

The first version should support or prepare for:

PDF
Markdown
HTML
plain text
remote URL references

Later versions may support:

DOCX
EPUB
scanned image documents
OCR-derived text
IIIF resources
TEI XML
structured datasets with source passages

Citation Recovery States

Citation recovery should be modeled explicitly.

Initial recovery states may include:

created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed

The system should distinguish between finding a source and finding the exact cited passage.

Privacy and Source Lookup Principles

Source lookup can create privacy risks.

The repository should follow these principles:

search local sources first,
make external lookup explicit and configurable,
avoid sending private document text to external services by default,
record which external services were queried,
distinguish public metadata lookup from full-text upload,
allow deployments to disable external lookup completely,
prefer deterministic local processing where possible.

External source discovery should be an extension point, not an unavoidable default behavior.

Design Principles

Source Identity First

Every imported document should receive a stable identity based on available metadata, source URI, and fingerprint.

Canonical Text Matters

Anchoring and search depend on canonical text. The repository should make text normalization explicit and repeatable.

Representation Is Not Source

The original source and generated representation are different things.

The system should preserve this distinction.

Local Before External

Citation recovery should search local documents before looking elsewhere.

Human Confirmation

Recovered citations should not silently become confirmed evidence. Candidate matches should be presented for confirmation when uncertainty exists.

Format-Aware, Model-Neutral

The repository should understand document formats but should not own the broader evidence model.

Cache Expensive Work

Text extraction, fingerprinting, and representation generation should be cacheable by source fingerprint and version.

Agent-Friendly Output

Extracted metadata, representations, and recovery candidates should be structured enough for agents to inspect, rank, and explain.

Expected Dependencies

This repository is expected to depend on shared types and service contracts from:

citation-engine
  Document, DocumentRepresentation, CitationRecoveryAttempt, source-related contracts

It may be consumed by:

citation-work
  to load reviewable documents and document representations

evidence-anchor
  to resolve selectors against extracted representations

evidence-binder
  to retrieve source context for linked evidence

citation-evidence
  to provide integrated import and recovery workflows

It should avoid depending on review UI or form-binding implementation details.

First Useful Version

A first useful version of evidence-source should provide:

source import interface,
media type detection,
document fingerprinting,
basic metadata extraction,
PDF text extraction,
Markdown text extraction,
HTML sanitization and text extraction,
canonical text normalization,
document representation generation,
simple local quote search,
recovery attempt model or contract,
examples showing how a document becomes a representation usable by evidence-anchor.

The first version does not need full external source discovery or OCR, but it should establish the ingestion and representation pattern.

Success Criteria

The repository is successful when another subsystem can use it to:

import a source document,
identify and fingerprint it,
extract useful metadata,
generate canonical text,
generate a document representation,
search the source text,
provide representation data to evidence-anchor,
support a local citation recovery attempt from a quote or citation clue.

A developer or coding agent should be able to understand from this repository how raw documents become evidence-ready sources.

Repository Character

This repository should be:

source-focused,
ingestion-oriented,
privacy-conscious,
format-aware,
representation-centered,
cache-friendly,
suitable for local-first and server-side use,
explicit about uncertainty in citation recovery,
careful not to absorb review or binding responsibilities.

MVP Coordination — Code Lives Upstream

During the umbrella-first MVP phase (decided 2026-05-24), the source code for this subsystem does not live in this repository yet. It lives in the umbrella repo at citation-evidence/src/source/.

This INTENT.md documents the intended responsibilities and boundaries. When the ingestion and representation interfaces have stabilized through actual MVP use, the corresponding code extracts into this repository.

Shared contracts (Document and DocumentRepresentation shapes, CitationRecoveryAttempt state enum, canonical text normalization, allowed dependency edges) are maintained in the umbrella repo:

citation-evidence/wiki/SharedContracts.md
citation-evidence/wiki/DependencyMap.md
citation-evidence/docs/decisions/ (ADRs)

This subsystem's eventual code must not contradict those documents. Changes to shared contracts happen in the umbrella, not here.

Under the dependency map, evidence-source may depend only on citation-engine — not on evidence-anchor. When ingestion needs to know "could a selector resolve here?", the answer travels through events, not direct calls.

Guiding Statement

evidence-source exists to turn documents and citation clues into reliable, searchable, anchorable source context.

13 KiB Raw Blame History