Files
artifact-store/docs/adr/0001-content-addressed-storage.md
tegwick 747afc27a6 docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 21:16:17 +02:00

81 lines
3.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# ADR-0001 — Content-Addressed Storage with Dual Digest
Status: accepted
Date: 2026-05-15
Supersedes: —
Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9
## Context
The architecture blueprint as originally drafted addresses stored bytes by
logical `(package, relative_path)`. That is sufficient for v1 ingestion but
forecloses global deduplication, Merkle integrity proofs, partial
replication, federation, and OCI artifact compatibility — all of which the
platform ambition requires to remain reachable.
Independently, the original blueprint pins SHA-256 as the only file digest.
SHA-256 with SHA-NI on modern x86 reaches ~1.52 GB/s/core. BLAKE3 on the
same hardware reaches 610+ GB/s/core, parallelises across cores, and its
construction *is* a Merkle tree — package-level integrity becomes free.
SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we
cannot drop it.
## Decision
1. The canonical storage key for any byte sequence is its content address
in the form `<algorithm>:<lowercase-hex-digest>`. Storage backends store
and retrieve by this key. `relative_path` is logical metadata recorded
in the manifest, not a storage-layer concept.
2. Every `artifact_files` row carries two digest columns:
- `digest_primary` — the native digest; default algorithm `blake3`.
- `digest_sha256` — always populated for interop, even when `blake3`
is the primary.
Both are computed in a single ingest pass (one read of the input).
3. The schema also carries a `digest_algorithm` column naming the primary
algorithm. Additional algorithms are added by new columns or a side
table, never by overloading `digest_primary`.
4. Storage backend object keys are derived from `digest_primary` only.
Migrations between primary algorithms are explicit and audited; they
are not silent.
## Consequences
Positive:
- Global deduplication is automatic — two identical files in two packages
share one backend object.
- Merkle integrity over a package is free with BLAKE3 (use the tree mode).
- Federation, partial mirrors, and OCI compatibility (ADR-0006) become
reachable without schema migration.
- Verification of a single file does not require fetching its package.
Negative:
- Two digests must be computed per ingest. Mitigated by streaming both
through one buffer; the bottleneck is I/O, not hashing.
- Reference counting: deletion of an `artifact_file` row cannot
unconditionally delete the backend object. A garbage-collector pass
reconciles references before deleting bytes. This is correct anyway
(deletion should be deliberate, per the blueprint).
- Producers requesting "store these N bytes at path P" must understand
that their P is logical. This is a documentation problem, not a
technical one.
## Implementation notes
- v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated;
no asm we maintain).
- v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython
build links against OpenSSL with SHA-NI support).
- A `Digest` value object wraps `(algorithm, hex)`; serialised forms
always include the algorithm prefix.
- A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not
delete bytes automatically — it marks them eligible.
## Status of the original blueprint pin
The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by
`digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup
blueprint's implicit path-keyed storage is replaced by content-keyed
storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`.