Files
citation-evidence/docs/decisions/ADR-0008-session-archive-format.md
tegwick 779ae0d317 Implement CE-WP-0005 T01-T08: demo app — sessions, uploads, ZIP archive
Turn the MVP into a self-contained demo. Users now:
  1. Land on an empty-state and create a named session.
  2. Drag-drop or pick arbitrary PDFs into that session.
  3. Annotate, build evidence, link to form fields — all session-scoped.
  4. Export the whole session as a single .zip archive (manifest +
     per-document PDFs).
  5. Import a .zip back — into a new session, or merged into an
     existing one (documents deduped by SHA-256 fingerprint;
     annotations/evidence/links added additively).

Architecture:
- New shared types: SessionId, Session, SessionArchiveManifest +
  parseSessionArchiveManifest with schema-version validation.
- SessionService (engine/services/sessions.ts) handles lifecycle
  (create/rename/delete/setActive) + emits 4 new events through its
  own bus; SharedContracts.md §4 lists the additions.
- SessionProvider (work/SessionContext.tsx) owns the cross-session
  state: service, per-session PdfByteStore registry, per-session
  version counter that drives EngineProvider remounts after imports.
- EngineProvider becomes session-aware (sessionId prop drives per-
  session localStorage keys). Bumping engineRevision after
  restoreFromStorage forces consumers to re-render so restored repos
  show up immediately.
- PdfByteStore (source/pdf/byte-store.ts) holds Uint8Array bytes per
  document and mints blob URLs; ingestPdfFromFile is the upload
  entry-point that wraps the existing ingestPdf pipeline.
- ADR-0008 locks the ZIP layout (manifest.json + documents/<id>.pdf),
  the manifest schema (schemaVersion 1), and the merge-on-collision
  policy. JSZip is the only new dependency.
- App.tsx restructured: SessionProvider at the root, EngineProvider
  keyed by ${sessionId}:${version}, hash routing #/s/<id>[/forms/demo],
  SessionMenu top-bar, CreateFirstSession empty state.
- New DocumentRemoved event for per-document delete cleanup in
  CollectionList; engine.documents.remove() is the new service method.

Tests:
- Unit: 16 SessionService lifecycle + persistence tests;
  per-session snapshot round-trip; PdfByteStore + ingestPdfFromFile;
  SessionArchive parser; exportSessionZip + importSessionZip with
  create + merge + corrupt-archive paths.
- DOM: UploadDropzone, session-scoped CollectionList delete,
  SessionMenu create/switch/rename, routing parser.
- E2E: tests/integration/session-export-reimport.dom.test.tsx walks
  the full create → annotate → export → reimport flow and asserts
  the additive merge (deduped doc + doubled evidence rows).
- Legacy E2Es updated to use a seed-session helper instead of the
  removed fixture-button flow.

Known limitation (documented in ADR-0008): re-importing your own
freshly-exported ZIP creates duplicate annotations. Forward pointer
left for an importBundleId follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 14:57:28 +02:00

5.3 KiB

ADR-0008 — Session archive format (ZIP layout, manifest schema, merge policy)

  • Status: accepted
  • Date: 2026-05-25
  • Workplan: CE-WP-0005-T05 (schema), CE-WP-0005-T06 (export), CE-WP-0005-T07 (import)
  • Spec refs: wiki/ProductRequirementsDocument.md §20, wiki/ArchitectureOverview.md §3.4, §14.3

Context

The CE-WP-0005 demo loop ends with a user exporting an entire session (documents, annotations, evidence, links) into a single .zip archive and importing it back later. The archive needs to be the only persistence mechanism the demo provides beyond a tab close — no IndexedDB in this workplan — so its shape needs to be locked before two parallel tasks (T06, T07) and the integration test (T08) land on top of it.

Three things need a written contract:

  1. ZIP layout — what files live in the archive, named how.
  2. manifest.json shape — versioned JSON schema, validated on import.
  3. Conflict policy — what happens when an imported session's name already exists in the receiving repository.

Decision

ZIP layout

manifest.json
documents/
  <documentId>.pdf
  • <documentId> is the engine-minted branded id (doc_<uuid>). Using it as the filename means the manifest's documentBindings[i] can cross-reference the binary file without an additional lookup table.
  • Per-representation files (e.g. an extracted-text JSON alongside each PDF) are intentionally deferred. The canonical text + selectors are embedded in the engine snapshot inside manifest.json, so a re-import can regenerate everything from the binary.
  • Future archive variants (multi-attachment documents, Markdown documents) extend by adding subdirectories under the archive root. Importers must ignore unknown top-level entries so older clients remain compatible with newer archives that add new file types.

manifest.json shape (schemaVersion 1)

interface SessionArchiveManifest {
  schemaVersion: 1;
  exportedAt: string;          // ISO-8601 UTC timestamp
  session: {
    id: SessionId;             // sess_<uuid>
    name: string;              // trimmed display name
    createdAt: string;         // ISO-8601
    updatedAt: string;         // ISO-8601
  };
  engine: EngineSnapshot;      // shape from src/engine/persistence.ts
  documentBindings: Array<{
    documentId: DocumentId;    // matches the engine's record
    filename: string;          // original filename from upload
    fingerprint: string;       // SHA-256 — used by the importer for dedup
  }>;
}

The engine field is the same shape that captureSnapshot() produces in src/engine/persistence.ts. Re-using it verbatim keeps the in-memory ↔ archive round-trip a one-way conversion (snapshot ↔ JSON) instead of growing a parallel schema that would drift.

Unknown fields at the top level must be preserved on import (a future client can write them) but unknown fields inside session or documentBindings[i] are dropped — the import constructs typed domain objects from the validated subset.

Merge-on-name-collision policy (T07)

When an imported manifest's session.name matches an existing session, the existing session is the target (outcome: "merged-into"). Otherwise a fresh session is created with the imported name (outcome: "created").

Within the target session:

  • Documents are deduped by fingerprint (SHA-256 over the PDF bytes). If a document with the same fingerprint already exists, the import keeps the existing documentId and records a remap from the incoming id. The binary file is skipped (we already have the bytes). Otherwise a fresh documentId is minted and the bytes go into the per-session byte store.
  • Annotations, evidence items, and evidence links are imported additively: each gets a freshly minted id, with any documentId/annotationId/evidenceItemId references rewritten via the remap. No update-in-place, no overwrite-by-id.

Known limitation: re-importing your own export duplicates annotations

Because annotations/evidence/links are always added with fresh ids, re-importing a ZIP you just exported into the same session creates a second copy of every annotation (the existing PDF bytes dedupe correctly via fingerprint, but the annotations have nothing to de-dupe against).

This is intentional for the demo loop and documented here so it's not mistaken for a bug. A future workplan can introduce an importBundleId field (a UUID minted at export time, stamped onto the manifest and on every annotation/evidence-link the import creates) plus a dedupe pass that skips entities already imported under the same bundle id.

Consequences

  • One source of truth for the engine snapshot. Same shape on disk and in memory; the persistence helpers stay re-usable.
  • Fingerprint-based dedup is byte-stable. Two users converting the same PDF end up with identical fingerprints; merging their archives works as expected.
  • Idempotency is opt-in, not the default. A user who wants exact round-trips must use a future importBundleId flow, not the basic T07 import.
  • Forward-compatible additions are cheap. New top-level keys land by adding fields; old importers preserve them and new importers consume them.

Status

Accepted. The TypeScript types + parseSessionArchiveManifest in src/shared/session-archive.ts are the executable contract for schemaVersion 1.