generated from coulomb/repo-seed
Turn the MVP into a self-contained demo. Users now:
1. Land on an empty-state and create a named session.
2. Drag-drop or pick arbitrary PDFs into that session.
3. Annotate, build evidence, link to form fields — all session-scoped.
4. Export the whole session as a single .zip archive (manifest +
per-document PDFs).
5. Import a .zip back — into a new session, or merged into an
existing one (documents deduped by SHA-256 fingerprint;
annotations/evidence/links added additively).
Architecture:
- New shared types: SessionId, Session, SessionArchiveManifest +
parseSessionArchiveManifest with schema-version validation.
- SessionService (engine/services/sessions.ts) handles lifecycle
(create/rename/delete/setActive) + emits 4 new events through its
own bus; SharedContracts.md §4 lists the additions.
- SessionProvider (work/SessionContext.tsx) owns the cross-session
state: service, per-session PdfByteStore registry, per-session
version counter that drives EngineProvider remounts after imports.
- EngineProvider becomes session-aware (sessionId prop drives per-
session localStorage keys). Bumping engineRevision after
restoreFromStorage forces consumers to re-render so restored repos
show up immediately.
- PdfByteStore (source/pdf/byte-store.ts) holds Uint8Array bytes per
document and mints blob URLs; ingestPdfFromFile is the upload
entry-point that wraps the existing ingestPdf pipeline.
- ADR-0008 locks the ZIP layout (manifest.json + documents/<id>.pdf),
the manifest schema (schemaVersion 1), and the merge-on-collision
policy. JSZip is the only new dependency.
- App.tsx restructured: SessionProvider at the root, EngineProvider
keyed by ${sessionId}:${version}, hash routing #/s/<id>[/forms/demo],
SessionMenu top-bar, CreateFirstSession empty state.
- New DocumentRemoved event for per-document delete cleanup in
CollectionList; engine.documents.remove() is the new service method.
Tests:
- Unit: 16 SessionService lifecycle + persistence tests;
per-session snapshot round-trip; PdfByteStore + ingestPdfFromFile;
SessionArchive parser; exportSessionZip + importSessionZip with
create + merge + corrupt-archive paths.
- DOM: UploadDropzone, session-scoped CollectionList delete,
SessionMenu create/switch/rename, routing parser.
- E2E: tests/integration/session-export-reimport.dom.test.tsx walks
the full create → annotate → export → reimport flow and asserts
the additive merge (deduped doc + doubled evidence rows).
- Legacy E2Es updated to use a seed-session helper instead of the
removed fixture-button flow.
Known limitation (documented in ADR-0008): re-importing your own
freshly-exported ZIP creates duplicate annotations. Forward pointer
left for an importBundleId follow-up.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
307 lines
12 KiB
Markdown
307 lines
12 KiB
Markdown
# Shared Contracts — citation-evidence
|
|
|
|
This document is the **single source of truth** for everything that more than one
|
|
subsystem in the citation-evidence ecosystem must agree on:
|
|
|
|
- the **vocabulary** (entity names and what they mean),
|
|
- the **canonical state enums** for entities that flow across repo boundaries,
|
|
- the **relation type** vocabulary,
|
|
- the **selector type** taxonomy,
|
|
- the **event type** vocabulary,
|
|
- the **ownership rules** for shared types versus shared behavior.
|
|
|
|
The five sister repos (`citation-engine`, `evidence-anchor`, `evidence-source`,
|
|
`citation-work`, `evidence-binder`) defer to this document. When their
|
|
`INTENT.md` files refer to "shared contracts", they mean this file.
|
|
|
|
During the umbrella-first MVP phase, the **TypeScript implementations** of
|
|
these contracts live in `citation-evidence/src/shared/` and are imported by
|
|
the per-subsystem code under `citation-evidence/src/{engine,anchor,source,work,binder}/`.
|
|
When a subsystem extracts to its own repo, it takes its slice of the shared
|
|
types with it — but this document remains the canonical vocabulary.
|
|
|
|
---
|
|
|
|
## 1. Vocabulary
|
|
|
|
These nine entities are the vocabulary every subsystem uses.
|
|
|
|
| Entity | One-line definition | Owner (post-extraction) |
|
|
|---------------------------|----------------------------------------------------------------------------------------------------|-------------------------|
|
|
| `Document` | An identified source object: PDF, Markdown, HTML, scan, etc. | `citation-engine` |
|
|
| `DocumentRepresentation` | A normalized, addressable view of a document (canonical text, page map, structure). | `citation-engine` |
|
|
| `Selector` | A technical locator for a passage inside a representation. | `citation-engine` (types) / `evidence-anchor` (behavior) |
|
|
| `Annotation` | A technical mark on a document range, expressed as one or more selectors plus quote text. | `citation-engine` |
|
|
| `EvidenceItem` | A meaningful evidence object built from one or more annotations, with commentary and status. | `citation-engine` |
|
|
| `EvidenceSet` | An ordered group of evidence items associated with a target or topic. | `citation-engine` (type) / `evidence-binder` (behavior) |
|
|
| `EvidenceLink` | A relation between an `EvidenceItem` and a structured target (form field, claim, requirement, …). | `citation-engine` (type) / `evidence-binder` (behavior) |
|
|
| `CitationCard` | A renderable, exportable presentation of an evidence item. | `citation-engine` |
|
|
| `CitationRecoveryAttempt` | A traceable attempt to locate a cited passage from an external clue. | `citation-engine` (type) / `evidence-source` (behavior) |
|
|
|
|
**Ownership rule:** *types and interfaces flow downward from `citation-engine`;
|
|
behavior flows upward into the specialised repos*. Where the table shows a
|
|
split, the engine repo holds the data shape and the other repo holds the
|
|
algorithms and lifecycle.
|
|
|
|
---
|
|
|
|
## 2. Canonical state enums
|
|
|
|
These enums are the authoritative values. Subsystems must not invent local
|
|
variants without updating this document first.
|
|
|
|
### 2.1 `Annotation.resolutionStatus`
|
|
|
|
```
|
|
resolved — selectors located the passage with high confidence
|
|
ambiguous — multiple plausible candidates found
|
|
unresolved — no plausible candidate found
|
|
stale — representation has changed since selectors were stored
|
|
```
|
|
|
|
### 2.2 `EvidenceItem.status`
|
|
|
|
```
|
|
candidate — captured but not yet vetted
|
|
confirmed — verified by a user as useful evidence
|
|
rejected — explicitly discarded
|
|
needs-check — flagged for review
|
|
```
|
|
|
|
> **Note:** earlier subsystem drafts introduced `strong-support`, `weak-support`,
|
|
> and `contradicts` on the item. Those concepts now live on the **link**, not
|
|
> the item — see §2.4.
|
|
|
|
### 2.3 `Document.reviewStatus` (when used by `citation-work`)
|
|
|
|
```
|
|
unreviewed
|
|
in-review
|
|
relevant
|
|
rejected
|
|
needs-follow-up
|
|
cited
|
|
verified
|
|
```
|
|
|
|
`citation-work` may treat any of these as the active state; the canonical
|
|
storage lives on the Document record in `citation-engine`.
|
|
|
|
### 2.4 `EvidenceLink.status` (per target)
|
|
|
|
```
|
|
no-evidence
|
|
candidate
|
|
confirmed
|
|
conflicting
|
|
insufficient
|
|
verified
|
|
```
|
|
|
|
`no-evidence` is a *derived* state computed when a target has zero links;
|
|
it is not stored on a link itself.
|
|
|
|
### 2.5 `EvidenceLink.relation`
|
|
|
|
```
|
|
supports
|
|
contradicts
|
|
explains
|
|
qualifies
|
|
source-for
|
|
context-for
|
|
```
|
|
|
|
This is the closed vocabulary for the MVP. Adding a relation requires updating
|
|
this document and the `EvidenceLink` schema together.
|
|
|
|
### 2.6 `CitationRecoveryAttempt.state`
|
|
|
|
```
|
|
created
|
|
source-found-fulltext
|
|
source-found-preview-only
|
|
source-found-metadata-only
|
|
source-not-found
|
|
quote-found
|
|
quote-not-found
|
|
candidate-passages-found
|
|
manual-confirmation-needed
|
|
confirmed
|
|
annotation-created
|
|
failed
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Selector taxonomy
|
|
|
|
A `Selector` is a discriminated union of:
|
|
|
|
```
|
|
TextQuoteSelector exact quote + prefix/suffix context
|
|
TextPositionSelector canonical text start/end offsets
|
|
PdfRectSelector page number + normalized page rectangles
|
|
PdfPageTextSelector page number + page-local text offsets
|
|
DomRangeSelector DOM path + range offsets (HTML/Markdown)
|
|
StructuralSelector heading/section/AST path
|
|
FragmentSelector exported fragment / deep link (export-only)
|
|
```
|
|
|
|
**Selector redundancy rule:** when an annotation is created, the system stores
|
|
*all selector types that are available* for that document representation, not
|
|
just one. Resolution tries them in order of expected confidence and stops at
|
|
the first high-confidence match.
|
|
|
|
W3C Web Annotation mapping uses these same concepts but as JSON-LD; the mapping
|
|
is documented separately (see ADR-0003 — pending).
|
|
|
|
---
|
|
|
|
## 4. Event vocabulary
|
|
|
|
Events are the primary integration mechanism between subsystems. The closed
|
|
event vocabulary for the MVP is:
|
|
|
|
```
|
|
DocumentImported
|
|
DocumentRepresentationGenerated
|
|
DocumentRemoved
|
|
AnnotationCreated
|
|
AnnotationResolved
|
|
AnnotationResolutionFailed
|
|
EvidenceItemCreated
|
|
EvidenceItemUpdated
|
|
EvidenceLinkCreated
|
|
EvidenceLinkUpdated
|
|
EvidenceItemActivated
|
|
FormFieldActivated
|
|
CitationCardRendered
|
|
CitationRecoveryStarted
|
|
CitationRecoveryCandidateFound
|
|
CitationRecoveryConfirmed
|
|
SessionCreated
|
|
SessionRenamed
|
|
SessionDeleted
|
|
SessionActivated
|
|
```
|
|
|
|
The `Session*` events live on the cross-session session bus (the
|
|
SessionService's own EventBus instance — see CE-WP-0005). The remaining
|
|
events live on the per-session engine bus and are scoped to whatever
|
|
session is currently active.
|
|
|
|
Subsystems must emit these events through a shared event bus owned by
|
|
`citation-engine`. Subsystems may listen to any event but must not invent
|
|
event types without updating this document.
|
|
|
|
---
|
|
|
|
## 5. Viewer adapter contract
|
|
|
|
Viewer adapters are the bridge between a document format and the rest of the
|
|
system. They are **owned by `evidence-anchor`** as far as the contract goes;
|
|
concrete adapters may live in either `evidence-anchor` or `evidence-source`
|
|
depending on whether the heavy lifting is selector logic or document
|
|
representation logic.
|
|
|
|
```ts
|
|
interface DocumentViewerAdapter {
|
|
mediaTypes: string[];
|
|
load(document: Document, representation?: DocumentRepresentation): Promise<void>;
|
|
getCurrentSelection(): Promise<SelectionCapture | null>;
|
|
createSelectorsFromSelection(selection: SelectionCapture): Promise<Selector[]>;
|
|
resolveSelectors(selectors: Selector[]): Promise<AnchorResolution>;
|
|
scrollToResolvedTarget(target: ResolvedAnchorTarget, opts?: { center?: boolean; behavior?: "auto"|"smooth" }): Promise<void>;
|
|
renderHighlight(target: ResolvedAnchorTarget, opts?: HighlightRenderOptions): Promise<void>;
|
|
getHighlightClientRects(annotationId: string): Promise<DOMRect[]>;
|
|
}
|
|
```
|
|
|
|
MVP delivers a single `PDFViewerAdapter`. HTML and Markdown adapters are
|
|
deferred.
|
|
|
|
---
|
|
|
|
## 6. Canonical text normalization
|
|
|
|
All text-based selectors and quote matching depend on a deterministic
|
|
normalization function. The MVP normalization is:
|
|
|
|
1. Unicode NFC normalization.
|
|
2. Replace all line-ending sequences with `\n`.
|
|
3. Collapse runs of horizontal whitespace into a single space.
|
|
4. Strip soft hyphens (U+00AD).
|
|
5. Preserve paragraph boundaries (double `\n`).
|
|
|
|
**This function is versioned.** Stored selectors record the normalization
|
|
version they were created against. Changing the function later requires either
|
|
backwards-compatible behavior or a re-anchoring migration.
|
|
|
|
The reference implementation lives in `citation-evidence/src/shared/text/normalize.ts`.
|
|
|
|
---
|
|
|
|
## 7. Visual guide rect registry
|
|
|
|
The visual-guide overlay (form field → evidence card → source highlight)
|
|
requires DOM rects from three independently-rendered subsystems. The contract
|
|
is a **rect registry** owned by `evidence-binder`:
|
|
|
|
```ts
|
|
interface RectRegistry {
|
|
register(kind: "field" | "evidence-card" | "highlight", id: string, getRect: () => DOMRect | null): () => void;
|
|
getRect(kind: "field" | "evidence-card" | "highlight", id: string): DOMRect | null;
|
|
subscribe(listener: (event: RectRegistryEvent) => void): () => void;
|
|
}
|
|
```
|
|
|
|
Each renderer (form, evidence sidebar, viewer adapter) registers a
|
|
`getRect` callback. The overlay queries on-demand and re-renders on scroll,
|
|
resize, focus, and active-evidence change.
|
|
|
|
This contract MUST be defined and stable before any of the three renderers
|
|
hardens, or the overlay becomes the system's coupling bottleneck.
|
|
|
|
---
|
|
|
|
## 8. Ownership rules (the short version)
|
|
|
|
1. **Types and interfaces** flow downward from `citation-engine`.
|
|
2. **Behavior and algorithms** live in the specialised repos.
|
|
3. Where a concept appears in both a type and a behavior context (e.g.
|
|
`Selector`, `EvidenceLink`, `EvidenceSet`, `CitationRecoveryAttempt`),
|
|
the engine owns the shape and the specialised repo owns the lifecycle.
|
|
4. **The shared event bus is engine-owned**; subsystems publish and subscribe
|
|
but do not extend the event vocabulary unilaterally.
|
|
5. **No new enum values, relation types, event types, or selector kinds**
|
|
land in code without first appearing in this document.
|
|
6. During umbrella-first MVP: rules 1-5 are aspirational. We will tolerate
|
|
small violations in `citation-evidence/src/` and reconcile during extraction.
|
|
|
|
---
|
|
|
|
## 9. Change process
|
|
|
|
Changes to this document are change to the contract.
|
|
|
|
- Small additions (a new enum value, a new event type) can be made in a single
|
|
PR that updates this doc + the type definitions + at least one consumer.
|
|
- Breaking changes (renaming an entity, removing a state, changing an
|
|
ownership split) require a short ADR in `docs/decisions/` and a heads-up
|
|
progress event on the state-hub.
|
|
|
|
---
|
|
|
|
## 10. Pending ADRs that will affect this document
|
|
|
|
These are listed in `docs/decisions/` once written. Until then the document
|
|
reflects the current best understanding from the architecture overview.
|
|
|
|
- **ADR-0001** — Umbrella-first MVP strategy (decided 2026-05-24, this session).
|
|
- **ADR-0002** — Monorepo vs polyrepo packaging (pending).
|
|
- **ADR-0003** — W3C Web Annotation: lossy mapping vs round-trip guarantee (pending).
|
|
- **ADR-0004** — PDF viewer library choice: `react-pdf-highlighter-plus` vs PDF.js direct (pending).
|
|
- **ADR-0005** — Persistence: local-first SQLite vs Postgres from day one (pending).
|
|
- **ADR-0006** — Selector ownership split (types in engine, algorithms in anchor) (pending — implied here).
|