Establish shared-contracts home, dependency map, MVP workplans, and umbrella-first strategy

- INTENT.md: declare umbrella as the home for shared contracts; document
  umbrella-first MVP decision (code lives here until subsystems stabilize)
- wiki/SharedContracts.md: vocabulary, state enums, relation types,
  selector taxonomy, event vocabulary, viewer adapter contract,
  canonical text normalization, rect-registry contract
- wiki/DependencyMap.md: allowed dependency edges; folder layout +
  lint-rule strategy during umbrella-first phase
- history/2026-05-24-initial-assessment.md: alignment review, technical
  risks, and the umbrella-first pivot rationale
- workplans/CE-WP-0001..0004: four ralph-compatible workplans covering
  foundations, PDF review slice, form binding + visual guide, and
  citation card export — implementing PRD §20 end-to-end

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-24 16:42:25 +02:00
parent bc95737e6a
commit d06a456c2a
9 changed files with 1597 additions and 0 deletions

155
wiki/DependencyMap.md Normal file
View File

@@ -0,0 +1,155 @@
# Dependency Map — citation-evidence
This document describes the **allowed dependency edges** between the
subsystems of the citation-evidence ecosystem. It is the cycle-prevention
contract.
It complements `SharedContracts.md` (which says *what* is shared) by saying
*who is allowed to depend on whom*.
---
## 1. The rule
> Types flow downward from `citation-engine`. Behavior flows upward into
> specialised repos. No subsystem may import another subsystem's behavior
> unless this map shows an edge.
The umbrella repo `citation-evidence` is allowed to depend on every
subsystem; nothing depends on the umbrella.
---
## 2. Allowed edges
```
┌───────────────────────┐
│ citation-evidence │ (umbrella)
└───────────┬───────────┘
│ depends on
┌──────────────────────────┼────────────────────────────┐
▼ ▼ ▼
┌───────────────┐ ┌────────────────┐ ┌────────────────┐
│ citation- │ │ evidence- │ │ citation- │
│ work │ │ binder │ │ engine │
└──────┬────────┘ └────────┬───────┘ └────────┬───────┘
│ │ │
│ depends on │ depends on │ depends on
│ │ │ (nothing —
▼ ▼ │ leaf node)
┌────────────────┐ ┌────────────────┐ │
│ evidence- │ │ evidence- │ │
│ anchor │ │ anchor │ │
└──────┬─────────┘ └────────┬───────┘ │
│ │ │
│ depends on │ depends on │
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ (citation-engine)
│ evidence- │ │ citation- │
│ source │ │ engine │
└────────┬───────┘ └────────────────┘
│ depends on
┌────────────────┐
│ citation- │
│ engine │
└────────────────┘
```
In tabular form:
| Repo | May depend on | Must not depend on |
|--------------------|--------------------------------------------------------|-----------------------------------------|
| `citation-engine` | (nothing — it is the leaf) | every other subsystem |
| `evidence-anchor` | `citation-engine` | `evidence-source`, `citation-work`, `evidence-binder`, `citation-evidence` |
| `evidence-source` | `citation-engine` | `evidence-anchor`, `citation-work`, `evidence-binder`, `citation-evidence` |
| `evidence-binder` | `citation-engine`, `evidence-anchor` | `evidence-source`, `citation-work`, `citation-evidence` |
| `citation-work` | `citation-engine`, `evidence-anchor`, `evidence-source`| `evidence-binder`, `citation-evidence` |
| `citation-evidence`| all five subsystems | (nothing else in the ecosystem) |
Notes:
- `evidence-source` does NOT depend on `evidence-anchor`. When an ingestion
pipeline needs to know "could a selector resolve here?", the answer comes
through events, not direct calls.
- `citation-work` does NOT depend on `evidence-binder`. Linking evidence to
form fields is a separate workflow; the review workspace should function
without it. A separate "evidence-backed form" application composes work +
binder + engine.
- `evidence-binder` does NOT depend on `evidence-source`. When a binder needs
source context, it asks `evidence-anchor` to resolve the annotation, which
in turn knows nothing about how the document was ingested.
---
## 3. Communication channels
Direct imports are allowed only along the edges above. Where two subsystems
need to coordinate without being allowed to import each other, they use one
of these indirect channels:
| Channel | Owner | Notes |
|---------------------------------|------------------|---------------------------------------------------------|
| Shared event bus | `citation-engine`| Vocabulary frozen in `SharedContracts.md` §4 |
| Shared types package | `citation-engine`| Re-exported through `@citation-evidence/engine` (post-extraction) |
| Rect registry | `evidence-binder`| Used by form UI, evidence sidebar, viewer adapter |
| Persistence interfaces | `citation-engine`| Concrete adapters in subsystems but interfaces in engine|
---
## 4. During umbrella-first MVP
While all code lives in `citation-evidence/src/`, the rule is enforced by
**folder structure** and **lint rules**:
```
citation-evidence/src/
shared/ ← what will become citation-engine (types + contracts)
engine/ ← what will become citation-engine (services)
anchor/ ← what will become evidence-anchor
source/ ← what will become evidence-source
work/ ← what will become citation-work (UI)
binder/ ← what will become evidence-binder
app/ ← the umbrella reference app
```
Lint rule (to be added in WP-0001):
- `engine/` may import only from `shared/`.
- `anchor/` may import only from `shared/`, `engine/`.
- `source/` may import only from `shared/`, `engine/`.
- `binder/` may import only from `shared/`, `engine/`, `anchor/`.
- `work/` may import only from `shared/`, `engine/`, `anchor/`, `source/`.
- `app/` may import from any.
Violating these rules in MVP is a lint error, not a runtime error. When
subsystems extract into their own repos, the lint rule disappears and the
package boundary enforces the same constraint.
---
## 5. Why these rules
1. **`citation-engine` as the leaf** prevents the most common monorepo pathology:
the "core" repo accumulating UI/IO dependencies because it was easier than
inverting a dependency.
2. **`citation-work``evidence-binder`** keeps the review workspace usable
even when there is no form context (e.g. just collecting evidence for a
report).
3. **`evidence-binder``evidence-source`** keeps binding logic from
accidentally caring about ingestion details.
4. **No subsystem depends on `citation-evidence`** — the umbrella is a
composition point, not a library.
---
## 6. Change process
Adding an edge to this map is a change to the contract.
- New edges require a short ADR in `docs/decisions/`.
- Removing an edge requires a refactoring plan (where do consumers go?).
- The MVP itself is an exception: edges that turn out to be wrong during
umbrella-first development are recorded as "deferred reshape" items in the
relevant workplan, not as ADRs.

296
wiki/SharedContracts.md Normal file
View File

@@ -0,0 +1,296 @@
# Shared Contracts — citation-evidence
This document is the **single source of truth** for everything that more than one
subsystem in the citation-evidence ecosystem must agree on:
- the **vocabulary** (entity names and what they mean),
- the **canonical state enums** for entities that flow across repo boundaries,
- the **relation type** vocabulary,
- the **selector type** taxonomy,
- the **event type** vocabulary,
- the **ownership rules** for shared types versus shared behavior.
The five sister repos (`citation-engine`, `evidence-anchor`, `evidence-source`,
`citation-work`, `evidence-binder`) defer to this document. When their
`INTENT.md` files refer to "shared contracts", they mean this file.
During the umbrella-first MVP phase, the **TypeScript implementations** of
these contracts live in `citation-evidence/src/shared/` and are imported by
the per-subsystem code under `citation-evidence/src/{engine,anchor,source,work,binder}/`.
When a subsystem extracts to its own repo, it takes its slice of the shared
types with it — but this document remains the canonical vocabulary.
---
## 1. Vocabulary
These nine entities are the vocabulary every subsystem uses.
| Entity | One-line definition | Owner (post-extraction) |
|---------------------------|----------------------------------------------------------------------------------------------------|-------------------------|
| `Document` | An identified source object: PDF, Markdown, HTML, scan, etc. | `citation-engine` |
| `DocumentRepresentation` | A normalized, addressable view of a document (canonical text, page map, structure). | `citation-engine` |
| `Selector` | A technical locator for a passage inside a representation. | `citation-engine` (types) / `evidence-anchor` (behavior) |
| `Annotation` | A technical mark on a document range, expressed as one or more selectors plus quote text. | `citation-engine` |
| `EvidenceItem` | A meaningful evidence object built from one or more annotations, with commentary and status. | `citation-engine` |
| `EvidenceSet` | An ordered group of evidence items associated with a target or topic. | `citation-engine` (type) / `evidence-binder` (behavior) |
| `EvidenceLink` | A relation between an `EvidenceItem` and a structured target (form field, claim, requirement, …). | `citation-engine` (type) / `evidence-binder` (behavior) |
| `CitationCard` | A renderable, exportable presentation of an evidence item. | `citation-engine` |
| `CitationRecoveryAttempt` | A traceable attempt to locate a cited passage from an external clue. | `citation-engine` (type) / `evidence-source` (behavior) |
**Ownership rule:** *types and interfaces flow downward from `citation-engine`;
behavior flows upward into the specialised repos*. Where the table shows a
split, the engine repo holds the data shape and the other repo holds the
algorithms and lifecycle.
---
## 2. Canonical state enums
These enums are the authoritative values. Subsystems must not invent local
variants without updating this document first.
### 2.1 `Annotation.resolutionStatus`
```
resolved — selectors located the passage with high confidence
ambiguous — multiple plausible candidates found
unresolved — no plausible candidate found
stale — representation has changed since selectors were stored
```
### 2.2 `EvidenceItem.status`
```
candidate — captured but not yet vetted
confirmed — verified by a user as useful evidence
rejected — explicitly discarded
needs-check — flagged for review
```
> **Note:** earlier subsystem drafts introduced `strong-support`, `weak-support`,
> and `contradicts` on the item. Those concepts now live on the **link**, not
> the item — see §2.4.
### 2.3 `Document.reviewStatus` (when used by `citation-work`)
```
unreviewed
in-review
relevant
rejected
needs-follow-up
cited
verified
```
`citation-work` may treat any of these as the active state; the canonical
storage lives on the Document record in `citation-engine`.
### 2.4 `EvidenceLink.status` (per target)
```
no-evidence
candidate
confirmed
conflicting
insufficient
verified
```
`no-evidence` is a *derived* state computed when a target has zero links;
it is not stored on a link itself.
### 2.5 `EvidenceLink.relation`
```
supports
contradicts
explains
qualifies
source-for
context-for
```
This is the closed vocabulary for the MVP. Adding a relation requires updating
this document and the `EvidenceLink` schema together.
### 2.6 `CitationRecoveryAttempt.state`
```
created
source-found-fulltext
source-found-preview-only
source-found-metadata-only
source-not-found
quote-found
quote-not-found
candidate-passages-found
manual-confirmation-needed
confirmed
annotation-created
failed
```
---
## 3. Selector taxonomy
A `Selector` is a discriminated union of:
```
TextQuoteSelector exact quote + prefix/suffix context
TextPositionSelector canonical text start/end offsets
PdfRectSelector page number + normalized page rectangles
PdfPageTextSelector page number + page-local text offsets
DomRangeSelector DOM path + range offsets (HTML/Markdown)
StructuralSelector heading/section/AST path
FragmentSelector exported fragment / deep link (export-only)
```
**Selector redundancy rule:** when an annotation is created, the system stores
*all selector types that are available* for that document representation, not
just one. Resolution tries them in order of expected confidence and stops at
the first high-confidence match.
W3C Web Annotation mapping uses these same concepts but as JSON-LD; the mapping
is documented separately (see ADR-0003 — pending).
---
## 4. Event vocabulary
Events are the primary integration mechanism between subsystems. The closed
event vocabulary for the MVP is:
```
DocumentImported
DocumentRepresentationGenerated
AnnotationCreated
AnnotationResolved
AnnotationResolutionFailed
EvidenceItemCreated
EvidenceItemUpdated
EvidenceLinkCreated
EvidenceLinkUpdated
EvidenceItemActivated
FormFieldActivated
CitationCardRendered
CitationRecoveryStarted
CitationRecoveryCandidateFound
CitationRecoveryConfirmed
```
Subsystems must emit these events through a shared event bus owned by
`citation-engine`. Subsystems may listen to any event but must not invent
event types without updating this document.
---
## 5. Viewer adapter contract
Viewer adapters are the bridge between a document format and the rest of the
system. They are **owned by `evidence-anchor`** as far as the contract goes;
concrete adapters may live in either `evidence-anchor` or `evidence-source`
depending on whether the heavy lifting is selector logic or document
representation logic.
```ts
interface DocumentViewerAdapter {
mediaTypes: string[];
load(document: Document, representation?: DocumentRepresentation): Promise<void>;
getCurrentSelection(): Promise<SelectionCapture | null>;
createSelectorsFromSelection(selection: SelectionCapture): Promise<Selector[]>;
resolveSelectors(selectors: Selector[]): Promise<AnchorResolution>;
scrollToResolvedTarget(target: ResolvedAnchorTarget, opts?: { center?: boolean; behavior?: "auto"|"smooth" }): Promise<void>;
renderHighlight(target: ResolvedAnchorTarget, opts?: HighlightRenderOptions): Promise<void>;
getHighlightClientRects(annotationId: string): Promise<DOMRect[]>;
}
```
MVP delivers a single `PDFViewerAdapter`. HTML and Markdown adapters are
deferred.
---
## 6. Canonical text normalization
All text-based selectors and quote matching depend on a deterministic
normalization function. The MVP normalization is:
1. Unicode NFC normalization.
2. Replace all line-ending sequences with `\n`.
3. Collapse runs of horizontal whitespace into a single space.
4. Strip soft hyphens (U+00AD).
5. Preserve paragraph boundaries (double `\n`).
**This function is versioned.** Stored selectors record the normalization
version they were created against. Changing the function later requires either
backwards-compatible behavior or a re-anchoring migration.
The reference implementation lives in `citation-evidence/src/shared/text/normalize.ts`.
---
## 7. Visual guide rect registry
The visual-guide overlay (form field → evidence card → source highlight)
requires DOM rects from three independently-rendered subsystems. The contract
is a **rect registry** owned by `evidence-binder`:
```ts
interface RectRegistry {
register(kind: "field" | "evidence-card" | "highlight", id: string, getRect: () => DOMRect | null): () => void;
getRect(kind: "field" | "evidence-card" | "highlight", id: string): DOMRect | null;
subscribe(listener: (event: RectRegistryEvent) => void): () => void;
}
```
Each renderer (form, evidence sidebar, viewer adapter) registers a
`getRect` callback. The overlay queries on-demand and re-renders on scroll,
resize, focus, and active-evidence change.
This contract MUST be defined and stable before any of the three renderers
hardens, or the overlay becomes the system's coupling bottleneck.
---
## 8. Ownership rules (the short version)
1. **Types and interfaces** flow downward from `citation-engine`.
2. **Behavior and algorithms** live in the specialised repos.
3. Where a concept appears in both a type and a behavior context (e.g.
`Selector`, `EvidenceLink`, `EvidenceSet`, `CitationRecoveryAttempt`),
the engine owns the shape and the specialised repo owns the lifecycle.
4. **The shared event bus is engine-owned**; subsystems publish and subscribe
but do not extend the event vocabulary unilaterally.
5. **No new enum values, relation types, event types, or selector kinds**
land in code without first appearing in this document.
6. During umbrella-first MVP: rules 1-5 are aspirational. We will tolerate
small violations in `citation-evidence/src/` and reconcile during extraction.
---
## 9. Change process
Changes to this document are change to the contract.
- Small additions (a new enum value, a new event type) can be made in a single
PR that updates this doc + the type definitions + at least one consumer.
- Breaking changes (renaming an entity, removing a state, changing an
ownership split) require a short ADR in `docs/decisions/` and a heads-up
progress event on the state-hub.
---
## 10. Pending ADRs that will affect this document
These are listed in `docs/decisions/` once written. Until then the document
reflects the current best understanding from the architecture overview.
- **ADR-0001** — Umbrella-first MVP strategy (decided 2026-05-24, this session).
- **ADR-0002** — Monorepo vs polyrepo packaging (pending).
- **ADR-0003** — W3C Web Annotation: lossy mapping vs round-trip guarantee (pending).
- **ADR-0004** — PDF viewer library choice: `react-pdf-highlighter-plus` vs PDF.js direct (pending).
- **ADR-0005** — Persistence: local-first SQLite vs Postgres from day one (pending).
- **ADR-0006** — Selector ownership split (types in engine, algorithms in anchor) (pending — implied here).