Files
citation-evidence/wiki/SharedContracts.md
tegwick 779ae0d317 Implement CE-WP-0005 T01-T08: demo app — sessions, uploads, ZIP archive
Turn the MVP into a self-contained demo. Users now:
  1. Land on an empty-state and create a named session.
  2. Drag-drop or pick arbitrary PDFs into that session.
  3. Annotate, build evidence, link to form fields — all session-scoped.
  4. Export the whole session as a single .zip archive (manifest +
     per-document PDFs).
  5. Import a .zip back — into a new session, or merged into an
     existing one (documents deduped by SHA-256 fingerprint;
     annotations/evidence/links added additively).

Architecture:
- New shared types: SessionId, Session, SessionArchiveManifest +
  parseSessionArchiveManifest with schema-version validation.
- SessionService (engine/services/sessions.ts) handles lifecycle
  (create/rename/delete/setActive) + emits 4 new events through its
  own bus; SharedContracts.md §4 lists the additions.
- SessionProvider (work/SessionContext.tsx) owns the cross-session
  state: service, per-session PdfByteStore registry, per-session
  version counter that drives EngineProvider remounts after imports.
- EngineProvider becomes session-aware (sessionId prop drives per-
  session localStorage keys). Bumping engineRevision after
  restoreFromStorage forces consumers to re-render so restored repos
  show up immediately.
- PdfByteStore (source/pdf/byte-store.ts) holds Uint8Array bytes per
  document and mints blob URLs; ingestPdfFromFile is the upload
  entry-point that wraps the existing ingestPdf pipeline.
- ADR-0008 locks the ZIP layout (manifest.json + documents/<id>.pdf),
  the manifest schema (schemaVersion 1), and the merge-on-collision
  policy. JSZip is the only new dependency.
- App.tsx restructured: SessionProvider at the root, EngineProvider
  keyed by ${sessionId}:${version}, hash routing #/s/<id>[/forms/demo],
  SessionMenu top-bar, CreateFirstSession empty state.
- New DocumentRemoved event for per-document delete cleanup in
  CollectionList; engine.documents.remove() is the new service method.

Tests:
- Unit: 16 SessionService lifecycle + persistence tests;
  per-session snapshot round-trip; PdfByteStore + ingestPdfFromFile;
  SessionArchive parser; exportSessionZip + importSessionZip with
  create + merge + corrupt-archive paths.
- DOM: UploadDropzone, session-scoped CollectionList delete,
  SessionMenu create/switch/rename, routing parser.
- E2E: tests/integration/session-export-reimport.dom.test.tsx walks
  the full create → annotate → export → reimport flow and asserts
  the additive merge (deduped doc + doubled evidence rows).
- Legacy E2Es updated to use a seed-session helper instead of the
  removed fixture-button flow.

Known limitation (documented in ADR-0008): re-importing your own
freshly-exported ZIP creates duplicate annotations. Forward pointer
left for an importBundleId follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 14:57:28 +02:00

12 KiB

Shared Contracts — citation-evidence

This document is the single source of truth for everything that more than one subsystem in the citation-evidence ecosystem must agree on:

  • the vocabulary (entity names and what they mean),
  • the canonical state enums for entities that flow across repo boundaries,
  • the relation type vocabulary,
  • the selector type taxonomy,
  • the event type vocabulary,
  • the ownership rules for shared types versus shared behavior.

The five sister repos (citation-engine, evidence-anchor, evidence-source, citation-work, evidence-binder) defer to this document. When their INTENT.md files refer to "shared contracts", they mean this file.

During the umbrella-first MVP phase, the TypeScript implementations of these contracts live in citation-evidence/src/shared/ and are imported by the per-subsystem code under citation-evidence/src/{engine,anchor,source,work,binder}/. When a subsystem extracts to its own repo, it takes its slice of the shared types with it — but this document remains the canonical vocabulary.


1. Vocabulary

These nine entities are the vocabulary every subsystem uses.

Entity One-line definition Owner (post-extraction)
Document An identified source object: PDF, Markdown, HTML, scan, etc. citation-engine
DocumentRepresentation A normalized, addressable view of a document (canonical text, page map, structure). citation-engine
Selector A technical locator for a passage inside a representation. citation-engine (types) / evidence-anchor (behavior)
Annotation A technical mark on a document range, expressed as one or more selectors plus quote text. citation-engine
EvidenceItem A meaningful evidence object built from one or more annotations, with commentary and status. citation-engine
EvidenceSet An ordered group of evidence items associated with a target or topic. citation-engine (type) / evidence-binder (behavior)
EvidenceLink A relation between an EvidenceItem and a structured target (form field, claim, requirement, …). citation-engine (type) / evidence-binder (behavior)
CitationCard A renderable, exportable presentation of an evidence item. citation-engine
CitationRecoveryAttempt A traceable attempt to locate a cited passage from an external clue. citation-engine (type) / evidence-source (behavior)

Ownership rule: types and interfaces flow downward from citation-engine; behavior flows upward into the specialised repos. Where the table shows a split, the engine repo holds the data shape and the other repo holds the algorithms and lifecycle.


2. Canonical state enums

These enums are the authoritative values. Subsystems must not invent local variants without updating this document first.

2.1 Annotation.resolutionStatus

resolved          — selectors located the passage with high confidence
ambiguous         — multiple plausible candidates found
unresolved        — no plausible candidate found
stale             — representation has changed since selectors were stored

2.2 EvidenceItem.status

candidate         — captured but not yet vetted
confirmed         — verified by a user as useful evidence
rejected          — explicitly discarded
needs-check       — flagged for review

Note: earlier subsystem drafts introduced strong-support, weak-support, and contradicts on the item. Those concepts now live on the link, not the item — see §2.4.

2.3 Document.reviewStatus (when used by citation-work)

unreviewed
in-review
relevant
rejected
needs-follow-up
cited
verified

citation-work may treat any of these as the active state; the canonical storage lives on the Document record in citation-engine.

2.4 EvidenceLink.status (per target)

no-evidence
candidate
confirmed
conflicting
insufficient
verified

no-evidence is a derived state computed when a target has zero links; it is not stored on a link itself.

2.5 EvidenceLink.relation

supports
contradicts
explains
qualifies
source-for
context-for

This is the closed vocabulary for the MVP. Adding a relation requires updating this document and the EvidenceLink schema together.

2.6 CitationRecoveryAttempt.state

created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed

3. Selector taxonomy

A Selector is a discriminated union of:

TextQuoteSelector         exact quote + prefix/suffix context
TextPositionSelector      canonical text start/end offsets
PdfRectSelector           page number + normalized page rectangles
PdfPageTextSelector       page number + page-local text offsets
DomRangeSelector          DOM path + range offsets (HTML/Markdown)
StructuralSelector        heading/section/AST path
FragmentSelector          exported fragment / deep link (export-only)

Selector redundancy rule: when an annotation is created, the system stores all selector types that are available for that document representation, not just one. Resolution tries them in order of expected confidence and stops at the first high-confidence match.

W3C Web Annotation mapping uses these same concepts but as JSON-LD; the mapping is documented separately (see ADR-0003 — pending).


4. Event vocabulary

Events are the primary integration mechanism between subsystems. The closed event vocabulary for the MVP is:

DocumentImported
DocumentRepresentationGenerated
DocumentRemoved
AnnotationCreated
AnnotationResolved
AnnotationResolutionFailed
EvidenceItemCreated
EvidenceItemUpdated
EvidenceLinkCreated
EvidenceLinkUpdated
EvidenceItemActivated
FormFieldActivated
CitationCardRendered
CitationRecoveryStarted
CitationRecoveryCandidateFound
CitationRecoveryConfirmed
SessionCreated
SessionRenamed
SessionDeleted
SessionActivated

The Session* events live on the cross-session session bus (the SessionService's own EventBus instance — see CE-WP-0005). The remaining events live on the per-session engine bus and are scoped to whatever session is currently active.

Subsystems must emit these events through a shared event bus owned by citation-engine. Subsystems may listen to any event but must not invent event types without updating this document.


5. Viewer adapter contract

Viewer adapters are the bridge between a document format and the rest of the system. They are owned by evidence-anchor as far as the contract goes; concrete adapters may live in either evidence-anchor or evidence-source depending on whether the heavy lifting is selector logic or document representation logic.

interface DocumentViewerAdapter {
  mediaTypes: string[];
  load(document: Document, representation?: DocumentRepresentation): Promise<void>;
  getCurrentSelection(): Promise<SelectionCapture | null>;
  createSelectorsFromSelection(selection: SelectionCapture): Promise<Selector[]>;
  resolveSelectors(selectors: Selector[]): Promise<AnchorResolution>;
  scrollToResolvedTarget(target: ResolvedAnchorTarget, opts?: { center?: boolean; behavior?: "auto"|"smooth" }): Promise<void>;
  renderHighlight(target: ResolvedAnchorTarget, opts?: HighlightRenderOptions): Promise<void>;
  getHighlightClientRects(annotationId: string): Promise<DOMRect[]>;
}

MVP delivers a single PDFViewerAdapter. HTML and Markdown adapters are deferred.


6. Canonical text normalization

All text-based selectors and quote matching depend on a deterministic normalization function. The MVP normalization is:

  1. Unicode NFC normalization.
  2. Replace all line-ending sequences with \n.
  3. Collapse runs of horizontal whitespace into a single space.
  4. Strip soft hyphens (U+00AD).
  5. Preserve paragraph boundaries (double \n).

This function is versioned. Stored selectors record the normalization version they were created against. Changing the function later requires either backwards-compatible behavior or a re-anchoring migration.

The reference implementation lives in citation-evidence/src/shared/text/normalize.ts.


7. Visual guide rect registry

The visual-guide overlay (form field → evidence card → source highlight) requires DOM rects from three independently-rendered subsystems. The contract is a rect registry owned by evidence-binder:

interface RectRegistry {
  register(kind: "field" | "evidence-card" | "highlight", id: string, getRect: () => DOMRect | null): () => void;
  getRect(kind: "field" | "evidence-card" | "highlight", id: string): DOMRect | null;
  subscribe(listener: (event: RectRegistryEvent) => void): () => void;
}

Each renderer (form, evidence sidebar, viewer adapter) registers a getRect callback. The overlay queries on-demand and re-renders on scroll, resize, focus, and active-evidence change.

This contract MUST be defined and stable before any of the three renderers hardens, or the overlay becomes the system's coupling bottleneck.


8. Ownership rules (the short version)

  1. Types and interfaces flow downward from citation-engine.
  2. Behavior and algorithms live in the specialised repos.
  3. Where a concept appears in both a type and a behavior context (e.g. Selector, EvidenceLink, EvidenceSet, CitationRecoveryAttempt), the engine owns the shape and the specialised repo owns the lifecycle.
  4. The shared event bus is engine-owned; subsystems publish and subscribe but do not extend the event vocabulary unilaterally.
  5. No new enum values, relation types, event types, or selector kinds land in code without first appearing in this document.
  6. During umbrella-first MVP: rules 1-5 are aspirational. We will tolerate small violations in citation-evidence/src/ and reconcile during extraction.

9. Change process

Changes to this document are change to the contract.

  • Small additions (a new enum value, a new event type) can be made in a single PR that updates this doc + the type definitions + at least one consumer.
  • Breaking changes (renaming an entity, removing a state, changing an ownership split) require a short ADR in docs/decisions/ and a heads-up progress event on the state-hub.

10. Pending ADRs that will affect this document

These are listed in docs/decisions/ once written. Until then the document reflects the current best understanding from the architecture overview.

  • ADR-0001 — Umbrella-first MVP strategy (decided 2026-05-24, this session).
  • ADR-0002 — Monorepo vs polyrepo packaging (pending).
  • ADR-0003 — W3C Web Annotation: lossy mapping vs round-trip guarantee (pending).
  • ADR-0004 — PDF viewer library choice: react-pdf-highlighter-plus vs PDF.js direct (pending).
  • ADR-0005 — Persistence: local-first SQLite vs Postgres from day one (pending).
  • ADR-0006 — Selector ownership split (types in engine, algorithms in anchor) (pending — implied here).