Files
citation-evidence/docs/decisions/ADR-0008-session-archive-format.md
tegwick 779ae0d317 Implement CE-WP-0005 T01-T08: demo app — sessions, uploads, ZIP archive
Turn the MVP into a self-contained demo. Users now:
  1. Land on an empty-state and create a named session.
  2. Drag-drop or pick arbitrary PDFs into that session.
  3. Annotate, build evidence, link to form fields — all session-scoped.
  4. Export the whole session as a single .zip archive (manifest +
     per-document PDFs).
  5. Import a .zip back — into a new session, or merged into an
     existing one (documents deduped by SHA-256 fingerprint;
     annotations/evidence/links added additively).

Architecture:
- New shared types: SessionId, Session, SessionArchiveManifest +
  parseSessionArchiveManifest with schema-version validation.
- SessionService (engine/services/sessions.ts) handles lifecycle
  (create/rename/delete/setActive) + emits 4 new events through its
  own bus; SharedContracts.md §4 lists the additions.
- SessionProvider (work/SessionContext.tsx) owns the cross-session
  state: service, per-session PdfByteStore registry, per-session
  version counter that drives EngineProvider remounts after imports.
- EngineProvider becomes session-aware (sessionId prop drives per-
  session localStorage keys). Bumping engineRevision after
  restoreFromStorage forces consumers to re-render so restored repos
  show up immediately.
- PdfByteStore (source/pdf/byte-store.ts) holds Uint8Array bytes per
  document and mints blob URLs; ingestPdfFromFile is the upload
  entry-point that wraps the existing ingestPdf pipeline.
- ADR-0008 locks the ZIP layout (manifest.json + documents/<id>.pdf),
  the manifest schema (schemaVersion 1), and the merge-on-collision
  policy. JSZip is the only new dependency.
- App.tsx restructured: SessionProvider at the root, EngineProvider
  keyed by ${sessionId}:${version}, hash routing #/s/<id>[/forms/demo],
  SessionMenu top-bar, CreateFirstSession empty state.
- New DocumentRemoved event for per-document delete cleanup in
  CollectionList; engine.documents.remove() is the new service method.

Tests:
- Unit: 16 SessionService lifecycle + persistence tests;
  per-session snapshot round-trip; PdfByteStore + ingestPdfFromFile;
  SessionArchive parser; exportSessionZip + importSessionZip with
  create + merge + corrupt-archive paths.
- DOM: UploadDropzone, session-scoped CollectionList delete,
  SessionMenu create/switch/rename, routing parser.
- E2E: tests/integration/session-export-reimport.dom.test.tsx walks
  the full create → annotate → export → reimport flow and asserts
  the additive merge (deduped doc + doubled evidence rows).
- Legacy E2Es updated to use a seed-session helper instead of the
  removed fixture-button flow.

Known limitation (documented in ADR-0008): re-importing your own
freshly-exported ZIP creates duplicate annotations. Forward pointer
left for an importBundleId follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 14:57:28 +02:00

135 lines
5.3 KiB
Markdown

# ADR-0008 — Session archive format (ZIP layout, manifest schema, merge policy)
- Status: accepted
- Date: 2026-05-25
- Workplan: CE-WP-0005-T05 (schema), CE-WP-0005-T06 (export),
CE-WP-0005-T07 (import)
- Spec refs: `wiki/ProductRequirementsDocument.md` §20,
`wiki/ArchitectureOverview.md` §3.4, §14.3
## Context
The CE-WP-0005 demo loop ends with a user exporting an entire session
(documents, annotations, evidence, links) into a single `.zip`
archive and importing it back later. The archive needs to be the
**only** persistence mechanism the demo provides beyond a tab close —
no IndexedDB in this workplan — so its shape needs to be locked
before two parallel tasks (T06, T07) and the integration test (T08)
land on top of it.
Three things need a written contract:
1. **ZIP layout** — what files live in the archive, named how.
2. **manifest.json shape** — versioned JSON schema, validated on
import.
3. **Conflict policy** — what happens when an imported session's name
already exists in the receiving repository.
## Decision
### ZIP layout
```
manifest.json
documents/
<documentId>.pdf
```
- `<documentId>` is the engine-minted branded id (`doc_<uuid>`). Using
it as the filename means the manifest's `documentBindings[i]` can
cross-reference the binary file without an additional lookup table.
- Per-representation files (e.g. an extracted-text JSON alongside each
PDF) are intentionally deferred. The canonical text + selectors are
embedded in the engine snapshot inside `manifest.json`, so a
re-import can regenerate everything from the binary.
- Future archive variants (multi-attachment documents, Markdown
documents) extend by adding subdirectories under the archive root.
Importers must ignore unknown top-level entries so older clients
remain compatible with newer archives that add new file types.
### `manifest.json` shape (schemaVersion 1)
```ts
interface SessionArchiveManifest {
schemaVersion: 1;
exportedAt: string; // ISO-8601 UTC timestamp
session: {
id: SessionId; // sess_<uuid>
name: string; // trimmed display name
createdAt: string; // ISO-8601
updatedAt: string; // ISO-8601
};
engine: EngineSnapshot; // shape from src/engine/persistence.ts
documentBindings: Array<{
documentId: DocumentId; // matches the engine's record
filename: string; // original filename from upload
fingerprint: string; // SHA-256 — used by the importer for dedup
}>;
}
```
The `engine` field is the same shape that `captureSnapshot()` produces
in `src/engine/persistence.ts`. Re-using it verbatim keeps the
in-memory ↔ archive round-trip a one-way conversion (snapshot ↔
JSON) instead of growing a parallel schema that would drift.
Unknown fields at the top level **must be preserved** on import (a
future client can write them) but unknown fields inside `session` or
`documentBindings[i]` are dropped — the import constructs typed
domain objects from the validated subset.
### Merge-on-name-collision policy (T07)
When an imported manifest's `session.name` matches an existing
session, the existing session is the **target** (`outcome:
"merged-into"`). Otherwise a fresh session is created with the
imported name (`outcome: "created"`).
Within the target session:
- **Documents** are deduped by `fingerprint` (SHA-256 over the PDF
bytes). If a document with the same fingerprint already exists,
the import keeps the existing `documentId` and records a remap
from the incoming id. The binary file is **skipped** (we already
have the bytes). Otherwise a fresh `documentId` is minted and the
bytes go into the per-session byte store.
- **Annotations**, **evidence items**, and **evidence links** are
imported **additively**: each gets a freshly minted id, with any
`documentId`/`annotationId`/`evidenceItemId` references rewritten
via the remap. No update-in-place, no overwrite-by-id.
#### Known limitation: re-importing your own export duplicates annotations
Because annotations/evidence/links are always added with fresh ids,
re-importing a ZIP you just exported into the same session creates a
second copy of every annotation (the existing PDF bytes dedupe
correctly via fingerprint, but the annotations have nothing to
de-dupe against).
This is intentional for the demo loop and documented here so it's not
mistaken for a bug. A future workplan can introduce an
`importBundleId` field (a UUID minted at export time, stamped onto
the manifest and on every annotation/evidence-link the import
creates) plus a dedupe pass that skips entities already imported
under the same bundle id.
## Consequences
- **One source of truth for the engine snapshot.** Same shape on disk
and in memory; the persistence helpers stay re-usable.
- **Fingerprint-based dedup is byte-stable.** Two users converting
the same PDF end up with identical fingerprints; merging their
archives works as expected.
- **Idempotency is opt-in, not the default.** A user who wants exact
round-trips must use a future `importBundleId` flow, not the basic
T07 import.
- **Forward-compatible additions are cheap.** New top-level keys land
by adding fields; old importers preserve them and new importers
consume them.
## Status
Accepted. The TypeScript types + `parseSessionArchiveManifest` in
`src/shared/session-archive.ts` are the executable contract for
schemaVersion 1.