Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.5 KiB
ADR-0001 — Content-Addressed Storage with Dual Digest
Status: accepted
Date: 2026-05-15
Supersedes: —
Related: ADR-0003, ADR-0006, docs/PLATFORM-AMBITION.md commitments A1, A2, A9
Context
The architecture blueprint as originally drafted addresses stored bytes by
logical (package, relative_path). That is sufficient for v1 ingestion but
forecloses global deduplication, Merkle integrity proofs, partial
replication, federation, and OCI artifact compatibility — all of which the
platform ambition requires to remain reachable.
Independently, the original blueprint pins SHA-256 as the only file digest. SHA-256 with SHA-NI on modern x86 reaches ~1.5–2 GB/s/core. BLAKE3 on the same hardware reaches 6–10+ GB/s/core, parallelises across cores, and its construction is a Merkle tree — package-level integrity becomes free. SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we cannot drop it.
Decision
- The canonical storage key for any byte sequence is its content address
in the form
<algorithm>:<lowercase-hex-digest>. Storage backends store and retrieve by this key.relative_pathis logical metadata recorded in the manifest, not a storage-layer concept. - Every
artifact_filesrow carries two digest columns:digest_primary— the native digest; default algorithmblake3.digest_sha256— always populated for interop, even whenblake3is the primary. Both are computed in a single ingest pass (one read of the input).
- The schema also carries a
digest_algorithmcolumn naming the primary algorithm. Additional algorithms are added by new columns or a side table, never by overloadingdigest_primary. - Storage backend object keys are derived from
digest_primaryonly. Migrations between primary algorithms are explicit and audited; they are not silent.
Consequences
Positive:
- Global deduplication is automatic — two identical files in two packages share one backend object.
- Merkle integrity over a package is free with BLAKE3 (use the tree mode).
- Federation, partial mirrors, and OCI compatibility (ADR-0006) become reachable without schema migration.
- Verification of a single file does not require fetching its package.
Negative:
- Two digests must be computed per ingest. Mitigated by streaming both through one buffer; the bottleneck is I/O, not hashing.
- Reference counting: deletion of an
artifact_filerow cannot unconditionally delete the backend object. A garbage-collector pass reconciles references before deleting bytes. This is correct anyway (deletion should be deliberate, per the blueprint). - Producers requesting "store these N bytes at path P" must understand that their P is logical. This is a documentation problem, not a technical one.
Implementation notes
- v1 ships BLAKE3 via the
blake3PyPI wheel (Rust core, SIMD-accelerated; no asm we maintain). - v1 ships SHA-256 via stdlib
hashlib(SHA-NI used when the CPython build links against OpenSSL with SHA-NI support). - A
Digestvalue object wraps(algorithm, hex); serialised forms always include the algorithm prefix. - A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not delete bytes automatically — it marks them eligible.
Status of the original blueprint pin
The pre-cleanup blueprint's artifact_files.sha256 column is replaced by
digest_algorithm, digest_primary, digest_sha256. The pre-cleanup
blueprint's implicit path-keyed storage is replaced by content-keyed
storage. These changes are absorbed into docs/ARCHITECTURE-BLUEPRINT.md.