# ADR-0008 — Session archive format (ZIP layout, manifest schema, merge policy)

- Status: accepted
- Date: 2026-05-25
- Workplan: CE-WP-0005-T05 (schema), CE-WP-0005-T06 (export),
  CE-WP-0005-T07 (import)
- Spec refs: `wiki/ProductRequirementsDocument.md` §20,
  `wiki/ArchitectureOverview.md` §3.4, §14.3

## Context

The CE-WP-0005 demo loop ends with a user exporting an entire session
(documents, annotations, evidence, links) into a single `.zip`
archive and importing it back later. The archive needs to be the
**only** persistence mechanism the demo provides beyond a tab close —
no IndexedDB in this workplan — so its shape needs to be locked
before two parallel tasks (T06, T07) and the integration test (T08)
land on top of it.

Three things need a written contract:

1. **ZIP layout** — what files live in the archive, named how.
2. **manifest.json shape** — versioned JSON schema, validated on
   import.
3. **Conflict policy** — what happens when an imported session's name
   already exists in the receiving repository.

## Decision

### ZIP layout

```
manifest.json
documents/
  <documentId>.pdf
```

- `<documentId>` is the engine-minted branded id (`doc_<uuid>`). Using
  it as the filename means the manifest's `documentBindings[i]` can
  cross-reference the binary file without an additional lookup table.
- Per-representation files (e.g. an extracted-text JSON alongside each
  PDF) are intentionally deferred. The canonical text + selectors are
  embedded in the engine snapshot inside `manifest.json`, so a
  re-import can regenerate everything from the binary.
- Future archive variants (multi-attachment documents, Markdown
  documents) extend by adding subdirectories under the archive root.
  Importers must ignore unknown top-level entries so older clients
  remain compatible with newer archives that add new file types.

### `manifest.json` shape (schemaVersion 1)

```ts
interface SessionArchiveManifest {
  schemaVersion: 1;
  exportedAt: string;          // ISO-8601 UTC timestamp
  session: {
    id: SessionId;             // sess_<uuid>
    name: string;              // trimmed display name
    createdAt: string;         // ISO-8601
    updatedAt: string;         // ISO-8601
  };
  engine: EngineSnapshot;      // shape from src/engine/persistence.ts
  documentBindings: Array<{
    documentId: DocumentId;    // matches the engine's record
    filename: string;          // original filename from upload
    fingerprint: string;       // SHA-256 — used by the importer for dedup
  }>;
}
```

The `engine` field is the same shape that `captureSnapshot()` produces
in `src/engine/persistence.ts`. Re-using it verbatim keeps the
in-memory ↔ archive round-trip a one-way conversion (snapshot ↔
JSON) instead of growing a parallel schema that would drift.

Unknown fields at the top level **must be preserved** on import (a
future client can write them) but unknown fields inside `session` or
`documentBindings[i]` are dropped — the import constructs typed
domain objects from the validated subset.

### Merge-on-name-collision policy (T07)

When an imported manifest's `session.name` matches an existing
session, the existing session is the **target** (`outcome:
"merged-into"`). Otherwise a fresh session is created with the
imported name (`outcome: "created"`).

Within the target session:

- **Documents** are deduped by `fingerprint` (SHA-256 over the PDF
  bytes). If a document with the same fingerprint already exists,
  the import keeps the existing `documentId` and records a remap
  from the incoming id. The binary file is **skipped** (we already
  have the bytes). Otherwise a fresh `documentId` is minted and the
  bytes go into the per-session byte store.
- **Annotations**, **evidence items**, and **evidence links** are
  imported **additively**: each gets a freshly minted id, with any
  `documentId`/`annotationId`/`evidenceItemId` references rewritten
  via the remap. No update-in-place, no overwrite-by-id.

#### Known limitation: re-importing your own export duplicates annotations

Because annotations/evidence/links are always added with fresh ids,
re-importing a ZIP you just exported into the same session creates a
second copy of every annotation (the existing PDF bytes dedupe
correctly via fingerprint, but the annotations have nothing to
de-dupe against).

This is intentional for the demo loop and documented here so it's not
mistaken for a bug. A future workplan can introduce an
`importBundleId` field (a UUID minted at export time, stamped onto
the manifest and on every annotation/evidence-link the import
creates) plus a dedupe pass that skips entities already imported
under the same bundle id.

## Consequences

- **One source of truth for the engine snapshot.** Same shape on disk
  and in memory; the persistence helpers stay re-usable.
- **Fingerprint-based dedup is byte-stable.** Two users converting
  the same PDF end up with identical fingerprints; merging their
  archives works as expected.
- **Idempotency is opt-in, not the default.** A user who wants exact
  round-trips must use a future `importBundleId` flow, not the basic
  T07 import.
- **Forward-compatible additions are cheap.** New top-level keys land
  by adding fields; old importers preserve them and new importers
  consume them.

## Status

Accepted. The TypeScript types + `parseSessionArchiveManifest` in
`src/shared/session-archive.ts` are the executable contract for
schemaVersion 1.