Files
citation-evidence/wiki/SharedContracts.md
tegwick d06a456c2a Establish shared-contracts home, dependency map, MVP workplans, and umbrella-first strategy
- INTENT.md: declare umbrella as the home for shared contracts; document
  umbrella-first MVP decision (code lives here until subsystems stabilize)
- wiki/SharedContracts.md: vocabulary, state enums, relation types,
  selector taxonomy, event vocabulary, viewer adapter contract,
  canonical text normalization, rect-registry contract
- wiki/DependencyMap.md: allowed dependency edges; folder layout +
  lint-rule strategy during umbrella-first phase
- history/2026-05-24-initial-assessment.md: alignment review, technical
  risks, and the umbrella-first pivot rationale
- workplans/CE-WP-0001..0004: four ralph-compatible workplans covering
  foundations, PDF review slice, form binding + visual guide, and
  citation card export — implementing PRD §20 end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 16:42:25 +02:00

11 KiB

Shared Contracts — citation-evidence

This document is the single source of truth for everything that more than one subsystem in the citation-evidence ecosystem must agree on:

  • the vocabulary (entity names and what they mean),
  • the canonical state enums for entities that flow across repo boundaries,
  • the relation type vocabulary,
  • the selector type taxonomy,
  • the event type vocabulary,
  • the ownership rules for shared types versus shared behavior.

The five sister repos (citation-engine, evidence-anchor, evidence-source, citation-work, evidence-binder) defer to this document. When their INTENT.md files refer to "shared contracts", they mean this file.

During the umbrella-first MVP phase, the TypeScript implementations of these contracts live in citation-evidence/src/shared/ and are imported by the per-subsystem code under citation-evidence/src/{engine,anchor,source,work,binder}/. When a subsystem extracts to its own repo, it takes its slice of the shared types with it — but this document remains the canonical vocabulary.


1. Vocabulary

These nine entities are the vocabulary every subsystem uses.

Entity One-line definition Owner (post-extraction)
Document An identified source object: PDF, Markdown, HTML, scan, etc. citation-engine
DocumentRepresentation A normalized, addressable view of a document (canonical text, page map, structure). citation-engine
Selector A technical locator for a passage inside a representation. citation-engine (types) / evidence-anchor (behavior)
Annotation A technical mark on a document range, expressed as one or more selectors plus quote text. citation-engine
EvidenceItem A meaningful evidence object built from one or more annotations, with commentary and status. citation-engine
EvidenceSet An ordered group of evidence items associated with a target or topic. citation-engine (type) / evidence-binder (behavior)
EvidenceLink A relation between an EvidenceItem and a structured target (form field, claim, requirement, …). citation-engine (type) / evidence-binder (behavior)
CitationCard A renderable, exportable presentation of an evidence item. citation-engine
CitationRecoveryAttempt A traceable attempt to locate a cited passage from an external clue. citation-engine (type) / evidence-source (behavior)

Ownership rule: types and interfaces flow downward from citation-engine; behavior flows upward into the specialised repos. Where the table shows a split, the engine repo holds the data shape and the other repo holds the algorithms and lifecycle.


2. Canonical state enums

These enums are the authoritative values. Subsystems must not invent local variants without updating this document first.

2.1 Annotation.resolutionStatus

resolved          — selectors located the passage with high confidence
ambiguous         — multiple plausible candidates found
unresolved        — no plausible candidate found
stale             — representation has changed since selectors were stored

2.2 EvidenceItem.status

candidate         — captured but not yet vetted
confirmed         — verified by a user as useful evidence
rejected          — explicitly discarded
needs-check       — flagged for review

Note: earlier subsystem drafts introduced strong-support, weak-support, and contradicts on the item. Those concepts now live on the link, not the item — see §2.4.

2.3 Document.reviewStatus (when used by citation-work)

unreviewed
in-review
relevant
rejected
needs-follow-up
cited
verified

citation-work may treat any of these as the active state; the canonical storage lives on the Document record in citation-engine.

2.4 EvidenceLink.status (per target)

no-evidence
candidate
confirmed
conflicting
insufficient
verified

no-evidence is a derived state computed when a target has zero links; it is not stored on a link itself.

2.5 EvidenceLink.relation

supports
contradicts
explains
qualifies
source-for
context-for

This is the closed vocabulary for the MVP. Adding a relation requires updating this document and the EvidenceLink schema together.

2.6 CitationRecoveryAttempt.state

created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed

3. Selector taxonomy

A Selector is a discriminated union of:

TextQuoteSelector         exact quote + prefix/suffix context
TextPositionSelector      canonical text start/end offsets
PdfRectSelector           page number + normalized page rectangles
PdfPageTextSelector       page number + page-local text offsets
DomRangeSelector          DOM path + range offsets (HTML/Markdown)
StructuralSelector        heading/section/AST path
FragmentSelector          exported fragment / deep link (export-only)

Selector redundancy rule: when an annotation is created, the system stores all selector types that are available for that document representation, not just one. Resolution tries them in order of expected confidence and stops at the first high-confidence match.

W3C Web Annotation mapping uses these same concepts but as JSON-LD; the mapping is documented separately (see ADR-0003 — pending).


4. Event vocabulary

Events are the primary integration mechanism between subsystems. The closed event vocabulary for the MVP is:

DocumentImported
DocumentRepresentationGenerated
AnnotationCreated
AnnotationResolved
AnnotationResolutionFailed
EvidenceItemCreated
EvidenceItemUpdated
EvidenceLinkCreated
EvidenceLinkUpdated
EvidenceItemActivated
FormFieldActivated
CitationCardRendered
CitationRecoveryStarted
CitationRecoveryCandidateFound
CitationRecoveryConfirmed

Subsystems must emit these events through a shared event bus owned by citation-engine. Subsystems may listen to any event but must not invent event types without updating this document.


5. Viewer adapter contract

Viewer adapters are the bridge between a document format and the rest of the system. They are owned by evidence-anchor as far as the contract goes; concrete adapters may live in either evidence-anchor or evidence-source depending on whether the heavy lifting is selector logic or document representation logic.

interface DocumentViewerAdapter {
  mediaTypes: string[];
  load(document: Document, representation?: DocumentRepresentation): Promise<void>;
  getCurrentSelection(): Promise<SelectionCapture | null>;
  createSelectorsFromSelection(selection: SelectionCapture): Promise<Selector[]>;
  resolveSelectors(selectors: Selector[]): Promise<AnchorResolution>;
  scrollToResolvedTarget(target: ResolvedAnchorTarget, opts?: { center?: boolean; behavior?: "auto"|"smooth" }): Promise<void>;
  renderHighlight(target: ResolvedAnchorTarget, opts?: HighlightRenderOptions): Promise<void>;
  getHighlightClientRects(annotationId: string): Promise<DOMRect[]>;
}

MVP delivers a single PDFViewerAdapter. HTML and Markdown adapters are deferred.


6. Canonical text normalization

All text-based selectors and quote matching depend on a deterministic normalization function. The MVP normalization is:

  1. Unicode NFC normalization.
  2. Replace all line-ending sequences with \n.
  3. Collapse runs of horizontal whitespace into a single space.
  4. Strip soft hyphens (U+00AD).
  5. Preserve paragraph boundaries (double \n).

This function is versioned. Stored selectors record the normalization version they were created against. Changing the function later requires either backwards-compatible behavior or a re-anchoring migration.

The reference implementation lives in citation-evidence/src/shared/text/normalize.ts.


7. Visual guide rect registry

The visual-guide overlay (form field → evidence card → source highlight) requires DOM rects from three independently-rendered subsystems. The contract is a rect registry owned by evidence-binder:

interface RectRegistry {
  register(kind: "field" | "evidence-card" | "highlight", id: string, getRect: () => DOMRect | null): () => void;
  getRect(kind: "field" | "evidence-card" | "highlight", id: string): DOMRect | null;
  subscribe(listener: (event: RectRegistryEvent) => void): () => void;
}

Each renderer (form, evidence sidebar, viewer adapter) registers a getRect callback. The overlay queries on-demand and re-renders on scroll, resize, focus, and active-evidence change.

This contract MUST be defined and stable before any of the three renderers hardens, or the overlay becomes the system's coupling bottleneck.


8. Ownership rules (the short version)

  1. Types and interfaces flow downward from citation-engine.
  2. Behavior and algorithms live in the specialised repos.
  3. Where a concept appears in both a type and a behavior context (e.g. Selector, EvidenceLink, EvidenceSet, CitationRecoveryAttempt), the engine owns the shape and the specialised repo owns the lifecycle.
  4. The shared event bus is engine-owned; subsystems publish and subscribe but do not extend the event vocabulary unilaterally.
  5. No new enum values, relation types, event types, or selector kinds land in code without first appearing in this document.
  6. During umbrella-first MVP: rules 1-5 are aspirational. We will tolerate small violations in citation-evidence/src/ and reconcile during extraction.

9. Change process

Changes to this document are change to the contract.

  • Small additions (a new enum value, a new event type) can be made in a single PR that updates this doc + the type definitions + at least one consumer.
  • Breaking changes (renaming an entity, removing a state, changing an ownership split) require a short ADR in docs/decisions/ and a heads-up progress event on the state-hub.

10. Pending ADRs that will affect this document

These are listed in docs/decisions/ once written. Until then the document reflects the current best understanding from the architecture overview.

  • ADR-0001 — Umbrella-first MVP strategy (decided 2026-05-24, this session).
  • ADR-0002 — Monorepo vs polyrepo packaging (pending).
  • ADR-0003 — W3C Web Annotation: lossy mapping vs round-trip guarantee (pending).
  • ADR-0004 — PDF viewer library choice: react-pdf-highlighter-plus vs PDF.js direct (pending).
  • ADR-0005 — Persistence: local-first SQLite vs Postgres from day one (pending).
  • ADR-0006 — Selector ownership split (types in engine, algorithms in anchor) (pending — implied here).