generated from coulomb/repo-seed
- INTENT.md: declare umbrella as the home for shared contracts; document umbrella-first MVP decision (code lives here until subsystems stabilize) - wiki/SharedContracts.md: vocabulary, state enums, relation types, selector taxonomy, event vocabulary, viewer adapter contract, canonical text normalization, rect-registry contract - wiki/DependencyMap.md: allowed dependency edges; folder layout + lint-rule strategy during umbrella-first phase - history/2026-05-24-initial-assessment.md: alignment review, technical risks, and the umbrella-first pivot rationale - workplans/CE-WP-0001..0004: four ralph-compatible workplans covering foundations, PDF review slice, form binding + visual guide, and citation card export — implementing PRD §20 end-to-end Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
114 lines
9.2 KiB
Markdown
114 lines
9.2 KiB
Markdown
# Initial Assessment — citation-evidence ecosystem
|
|
|
|
**Date:** 2026-05-24
|
|
**Author:** Claude (Opus 4.7), commissioned by Bernd
|
|
**Scope:** Review of `citation-evidence` umbrella PRD and Architecture overview, plus all five sister-repo `INTENT.md` files, for alignment, risk, and recommended approach.
|
|
|
|
---
|
|
|
|
## 1. Overall alignment across the six INTENT.md files
|
|
|
|
The vocabulary is impressively coherent: every repo speaks of
|
|
`Document → DocumentRepresentation → Annotation → Selector → EvidenceItem → EvidenceLink → CitationCard`.
|
|
Each `INTENT.md` follows the same Purpose / Scope / Out-of-Scope / Architectural Position / First-Useful-Version / Success Criteria shape.
|
|
Out-of-scope sections show the authors deliberately *pushing* responsibilities into other repos — a healthy signal.
|
|
|
|
The PRD and Architecture overview in `citation-evidence/wiki/` are also internally consistent: the PRD's functional requirements map cleanly to the architecture's data flows and to subsystem scopes.
|
|
|
|
But the documents were authored in quick succession (all on 2026-05-24, within ~30 minutes of each other based on file timestamps) and **never reconciled against each other**, which created the issues below.
|
|
|
|
## 2. What should be improved
|
|
|
|
### 2.1 Concrete ownership ambiguities to resolve in short ADRs
|
|
|
|
| Concept | Conflict |
|
|
|---|---|
|
|
| **`Selector` types** | `citation-engine` claims it as a "key concept owned"; `evidence-anchor`'s scope lists "selector type definitions". Likely fix: *interfaces* in engine, *creation/resolution/algorithms* in anchor. |
|
|
| **`EvidenceLink` / `EvidenceSet`** | Engine claims both as owned domain types; `evidence-binder` lists "evidence-to-target binding model" and "evidence sets" in scope. Same engine-defines-type / binder-owns-behavior split needed. |
|
|
| **Status enums** | Architecture's `EvidenceItem.status` is `candidate\|confirmed\|rejected\|needs-check`. `citation-work` adds `strong-support\|weak-support\|contradicts`. `evidence-binder` adds *target-specific* states (`conflicting-evidence`, `insufficient-evidence`, `verified`) plus extra relations (`context-for`, `derived-from`, `needs-check`). Three repos inventing overlapping enums. |
|
|
| **Viewer adapters** | Architecture diagram shows them as a separate box, no owner. Adapter methods (`load`, `createSelectorsFromSelection`, `resolveSelectors`, `scrollToResolvedTarget`, `renderHighlight`) straddle `evidence-source` and `evidence-anchor`. Pick one home (likely `evidence-anchor`, with `evidence-source` providing the representation). |
|
|
| **`CitationRecoveryAttempt`** | Type in engine, behavior in `evidence-source` — semantic ownership split that will rot. |
|
|
| **Document review status (FR-006)** | No repo claims it; `citation-work` hints "may later be moved into a shared model". |
|
|
|
|
### 2.2 Repository scaffolding gaps
|
|
|
|
- The umbrella architecture (§3.1) promises `apps/workspace-demo/`, `docs/decisions/`, `integration-tests/`, `docker-compose.yml` — none of this exists yet.
|
|
- All six READMEs are essentially empty (1 line). New contributors and agents won't know where to start.
|
|
- `citation-evidence` is **not registered in the state-hub**. For a project that splits across six repos, you lose central memory of decisions/dependencies/progress without it.
|
|
|
|
### 2.3 Architectural decisions still pending
|
|
|
|
ADR-001 through ADR-005 in the architecture doc are framed as "recommendations" rather than commitments. Each blocks code:
|
|
|
|
- React-first vs web-component-first (drives repo packaging)
|
|
- Local-first vs server-first storage (drives persistence interface shape)
|
|
- W3C internal model vs mapping (drives every type definition)
|
|
- `react-pdf-highlighter-plus` vs PDF.js direct (drives MVP timeline by weeks)
|
|
- Recovery scope local-only vs external
|
|
|
|
### 2.4 Missing cross-repo contract artefacts
|
|
|
|
There is no central dependency map. Each repo says "I expect to depend on X" but nothing names which repo *publishes* the shared types package(s). Pick monorepo (pnpm workspace) vs polyrepo with published `@citation-evidence/engine` npm packages before the first commit of code lands — switching later is painful.
|
|
|
|
## 3. Technical risks to inspect first
|
|
|
|
In rough order of "if this is broken, the architecture doesn't work":
|
|
|
|
1. **PDF canonical-text stability** — the entire selector/anchor model assumes a given PDF + extraction pipeline produces *the same* canonical text each time. PDF.js text extraction has known issues with multi-column layouts, custom-glyph fonts, ligatures, soft hyphens, and reading order. Build a corpus of 15-20 representative PDFs (governmental forms, two-column papers, scanned-then-OCR'd, German umlauts) and confirm round-trip selector resolution before committing to the model.
|
|
|
|
2. **`react-pdf-highlighter-plus` abstraction leakage** — this library is opinionated; wrapping it cleanly while keeping the engine viewer-independent is the central architectural test. Do a focused spike: load PDF → select → store selectors as JSON → reload page → resolve from JSON → highlight. If this leaks PDF.js types into the engine API, the boundary fails on day one.
|
|
|
|
3. **Canonical-text normalization is a silent migration** — every stored annotation's `TextQuoteSelector` / `TextPositionSelector` depends on the *exact* normalization rules used at creation time. Treat normalization as a versioned, deterministic function from day one. If you change Unicode normalization or whitespace handling later, every stored annotation breaks silently.
|
|
|
|
4. **Visual guide overlay coupling** — `evidence-binder` owns the visual-guide *model*, but rendering needs DOM rects from three sources: the form (binder's UI?), the evidence sidebar (`citation-work`), and the document highlight (viewer adapter). Three subsystems contributing rects to one overlay is the highest-coupling part of the system. Define an explicit *rect registry* contract before any of them ships UI.
|
|
|
|
5. **CSS Custom Highlight API support** — architecture mentions it for HTML/Markdown with fallback. Browser support is uneven; the fallback (usually DOM range-based span wrapping) is what will actually run on most users' machines. Verify the fallback path is acceptable, not the optimistic primary.
|
|
|
|
6. **W3C Web Annotation mapping is not free** — JSON-LD selectors can express things your internal model can't (and vice versa). Round-tripping is a research task, not a one-day mapping. Decide whether mapping is "lossy but useful" or "MUST round-trip" before stabilizing types.
|
|
|
|
7. **Multi-repo dependency cycle risk** — engine ↔ anchor (`Selector` ownership), engine ↔ source (`RecoveryAttempt`), engine ↔ binder (`Link`/`Set`) all currently look bidirectional in the INTENT files. Without a strict "types-only flow downward, behavior flows upward" rule, you will hit `npm install` cycles.
|
|
|
|
## 4. Rough approach (original phased plan)
|
|
|
|
**Phase 0 — Foundations (1-2 weeks, no production code)**
|
|
- Register `citation-evidence` as a state-hub domain + register all six repos
|
|
- Write 5-7 micro-ADRs in `citation-evidence/docs/decisions/` resolving the ownership ambiguities above
|
|
- Pick monorepo-vs-polyrepo and pin Node/TS toolchain
|
|
- Assemble a 15-20 PDF test corpus and check it into a fixtures location
|
|
- Write a real README for each repo pointing at INTENT + architecture
|
|
|
|
**Phase 1 — Vertical slice on the easiest format (4-6 weeks)**
|
|
- Engine: TS types + in-memory repos only
|
|
- Anchor: text-quote + text-position selectors, fuzzy match deferred
|
|
- Source: PDF text extraction + fingerprint only
|
|
- Work: one-document UI, sidebar, create annotation, click-to-reopen
|
|
- Umbrella: wire it into a reference app
|
|
- Goal: prove viewer-independence on PDFs end-to-end. No forms, no recovery, no Markdown.
|
|
|
|
**Phase 2 — Evidence binding & form mode (4 weeks)**
|
|
- Binder + visual-guide rect registry
|
|
- One form-schema example with side-by-side viewer
|
|
- This is where the active-state coordination claim gets stress-tested
|
|
|
|
**Phase 3 — Format expansion (4 weeks)**
|
|
- HTML adapter (sanitization + DOM range selectors)
|
|
- Markdown adapter
|
|
- Confirms the format-neutral claim
|
|
|
|
**Phase 4 — Local citation recovery (4 weeks)**
|
|
- Local-library search, exact + fuzzy quote match, confirmation UI
|
|
- Defer external source lookup until local pipeline is reliable
|
|
|
|
## 5. Pivot — umbrella-first MVP (decided 2026-05-24)
|
|
|
|
The user has chosen to **build the MVP entirely inside `citation-evidence`** before segmenting code into the sister repos. The reasoning: get the product working end-to-end with minimal coordination cost, then extract subsystems once the contracts have been validated by actual use.
|
|
|
|
This means:
|
|
|
|
- All MVP source code lives under `citation-evidence/` (likely `src/` partitioned by future-repo names: `engine/`, `anchor/`, `source/`, `work/`, `binder/`).
|
|
- The five sister repos remain as INTENT-only placeholders during MVP — they document the intended boundaries, but code will move in only when a subsystem's contract has stabilized.
|
|
- Interface design is explicitly deferred. Phase-0 ADRs become Phase-N extractions, informed by real friction points.
|
|
- Shared contracts live in `citation-evidence/wiki/SharedContracts.md` and `citation-evidence/wiki/DependencyMap.md`.
|
|
|
|
This trade-off accepts more rework later (when subsystems extract) in exchange for faster MVP velocity now and better-informed boundaries when extraction happens.
|