# ADR-0001 — Content-Addressed Storage with Dual Digest Status: accepted Date: 2026-05-15 Supersedes: — Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9 ## Context The architecture blueprint as originally drafted addresses stored bytes by logical `(package, relative_path)`. That is sufficient for v1 ingestion but forecloses global deduplication, Merkle integrity proofs, partial replication, federation, and OCI artifact compatibility — all of which the platform ambition requires to remain reachable. Independently, the original blueprint pins SHA-256 as the only file digest. SHA-256 with SHA-NI on modern x86 reaches ~1.5–2 GB/s/core. BLAKE3 on the same hardware reaches 6–10+ GB/s/core, parallelises across cores, and its construction *is* a Merkle tree — package-level integrity becomes free. SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we cannot drop it. ## Decision 1. The canonical storage key for any byte sequence is its content address in the form `:`. Storage backends store and retrieve by this key. `relative_path` is logical metadata recorded in the manifest, not a storage-layer concept. 2. Every `artifact_files` row carries two digest columns: - `digest_primary` — the native digest; default algorithm `blake3`. - `digest_sha256` — always populated for interop, even when `blake3` is the primary. Both are computed in a single ingest pass (one read of the input). 3. The schema also carries a `digest_algorithm` column naming the primary algorithm. Additional algorithms are added by new columns or a side table, never by overloading `digest_primary`. 4. Storage backend object keys are derived from `digest_primary` only. Migrations between primary algorithms are explicit and audited; they are not silent. ## Consequences Positive: - Global deduplication is automatic — two identical files in two packages share one backend object. - Merkle integrity over a package is free with BLAKE3 (use the tree mode). - Federation, partial mirrors, and OCI compatibility (ADR-0006) become reachable without schema migration. - Verification of a single file does not require fetching its package. Negative: - Two digests must be computed per ingest. Mitigated by streaming both through one buffer; the bottleneck is I/O, not hashing. - Reference counting: deletion of an `artifact_file` row cannot unconditionally delete the backend object. A garbage-collector pass reconciles references before deleting bytes. This is correct anyway (deletion should be deliberate, per the blueprint). - Producers requesting "store these N bytes at path P" must understand that their P is logical. This is a documentation problem, not a technical one. ## Implementation notes - v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated; no asm we maintain). - v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython build links against OpenSSL with SHA-NI support). - A `Digest` value object wraps `(algorithm, hex)`; serialised forms always include the algorithm prefix. - A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not delete bytes automatically — it marks them eligible. ## Status of the original blueprint pin The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by `digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup blueprint's implicit path-keyed storage is replaced by content-keyed storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`.