Files
artifact-store/docs/adr/0001-content-addressed-storage.md
tegwick 747afc27a6 docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 21:16:17 +02:00

3.5 KiB
Raw Permalink Blame History

ADR-0001 — Content-Addressed Storage with Dual Digest

Status: accepted Date: 2026-05-15 Supersedes: — Related: ADR-0003, ADR-0006, docs/PLATFORM-AMBITION.md commitments A1, A2, A9

Context

The architecture blueprint as originally drafted addresses stored bytes by logical (package, relative_path). That is sufficient for v1 ingestion but forecloses global deduplication, Merkle integrity proofs, partial replication, federation, and OCI artifact compatibility — all of which the platform ambition requires to remain reachable.

Independently, the original blueprint pins SHA-256 as the only file digest. SHA-256 with SHA-NI on modern x86 reaches ~1.52 GB/s/core. BLAKE3 on the same hardware reaches 610+ GB/s/core, parallelises across cores, and its construction is a Merkle tree — package-level integrity becomes free. SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we cannot drop it.

Decision

  1. The canonical storage key for any byte sequence is its content address in the form <algorithm>:<lowercase-hex-digest>. Storage backends store and retrieve by this key. relative_path is logical metadata recorded in the manifest, not a storage-layer concept.
  2. Every artifact_files row carries two digest columns:
    • digest_primary — the native digest; default algorithm blake3.
    • digest_sha256 — always populated for interop, even when blake3 is the primary. Both are computed in a single ingest pass (one read of the input).
  3. The schema also carries a digest_algorithm column naming the primary algorithm. Additional algorithms are added by new columns or a side table, never by overloading digest_primary.
  4. Storage backend object keys are derived from digest_primary only. Migrations between primary algorithms are explicit and audited; they are not silent.

Consequences

Positive:

  • Global deduplication is automatic — two identical files in two packages share one backend object.
  • Merkle integrity over a package is free with BLAKE3 (use the tree mode).
  • Federation, partial mirrors, and OCI compatibility (ADR-0006) become reachable without schema migration.
  • Verification of a single file does not require fetching its package.

Negative:

  • Two digests must be computed per ingest. Mitigated by streaming both through one buffer; the bottleneck is I/O, not hashing.
  • Reference counting: deletion of an artifact_file row cannot unconditionally delete the backend object. A garbage-collector pass reconciles references before deleting bytes. This is correct anyway (deletion should be deliberate, per the blueprint).
  • Producers requesting "store these N bytes at path P" must understand that their P is logical. This is a documentation problem, not a technical one.

Implementation notes

  • v1 ships BLAKE3 via the blake3 PyPI wheel (Rust core, SIMD-accelerated; no asm we maintain).
  • v1 ships SHA-256 via stdlib hashlib (SHA-NI used when the CPython build links against OpenSSL with SHA-NI support).
  • A Digest value object wraps (algorithm, hex); serialised forms always include the algorithm prefix.
  • A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not delete bytes automatically — it marks them eligible.

Status of the original blueprint pin

The pre-cleanup blueprint's artifact_files.sha256 column is replaced by digest_algorithm, digest_primary, digest_sha256. The pre-cleanup blueprint's implicit path-keyed storage is replaced by content-keyed storage. These changes are absorbed into docs/ARCHITECTURE-BLUEPRINT.md.