generated from coulomb/repo-seed
Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
81 lines
3.5 KiB
Markdown
81 lines
3.5 KiB
Markdown
# ADR-0001 — Content-Addressed Storage with Dual Digest
|
||
|
||
Status: accepted
|
||
Date: 2026-05-15
|
||
Supersedes: —
|
||
Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9
|
||
|
||
## Context
|
||
|
||
The architecture blueprint as originally drafted addresses stored bytes by
|
||
logical `(package, relative_path)`. That is sufficient for v1 ingestion but
|
||
forecloses global deduplication, Merkle integrity proofs, partial
|
||
replication, federation, and OCI artifact compatibility — all of which the
|
||
platform ambition requires to remain reachable.
|
||
|
||
Independently, the original blueprint pins SHA-256 as the only file digest.
|
||
SHA-256 with SHA-NI on modern x86 reaches ~1.5–2 GB/s/core. BLAKE3 on the
|
||
same hardware reaches 6–10+ GB/s/core, parallelises across cores, and its
|
||
construction *is* a Merkle tree — package-level integrity becomes free.
|
||
SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we
|
||
cannot drop it.
|
||
|
||
## Decision
|
||
|
||
1. The canonical storage key for any byte sequence is its content address
|
||
in the form `<algorithm>:<lowercase-hex-digest>`. Storage backends store
|
||
and retrieve by this key. `relative_path` is logical metadata recorded
|
||
in the manifest, not a storage-layer concept.
|
||
2. Every `artifact_files` row carries two digest columns:
|
||
- `digest_primary` — the native digest; default algorithm `blake3`.
|
||
- `digest_sha256` — always populated for interop, even when `blake3`
|
||
is the primary.
|
||
Both are computed in a single ingest pass (one read of the input).
|
||
3. The schema also carries a `digest_algorithm` column naming the primary
|
||
algorithm. Additional algorithms are added by new columns or a side
|
||
table, never by overloading `digest_primary`.
|
||
4. Storage backend object keys are derived from `digest_primary` only.
|
||
Migrations between primary algorithms are explicit and audited; they
|
||
are not silent.
|
||
|
||
## Consequences
|
||
|
||
Positive:
|
||
|
||
- Global deduplication is automatic — two identical files in two packages
|
||
share one backend object.
|
||
- Merkle integrity over a package is free with BLAKE3 (use the tree mode).
|
||
- Federation, partial mirrors, and OCI compatibility (ADR-0006) become
|
||
reachable without schema migration.
|
||
- Verification of a single file does not require fetching its package.
|
||
|
||
Negative:
|
||
|
||
- Two digests must be computed per ingest. Mitigated by streaming both
|
||
through one buffer; the bottleneck is I/O, not hashing.
|
||
- Reference counting: deletion of an `artifact_file` row cannot
|
||
unconditionally delete the backend object. A garbage-collector pass
|
||
reconciles references before deleting bytes. This is correct anyway
|
||
(deletion should be deliberate, per the blueprint).
|
||
- Producers requesting "store these N bytes at path P" must understand
|
||
that their P is logical. This is a documentation problem, not a
|
||
technical one.
|
||
|
||
## Implementation notes
|
||
|
||
- v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated;
|
||
no asm we maintain).
|
||
- v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython
|
||
build links against OpenSSL with SHA-NI support).
|
||
- A `Digest` value object wraps `(algorithm, hex)`; serialised forms
|
||
always include the algorithm prefix.
|
||
- A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not
|
||
delete bytes automatically — it marks them eligible.
|
||
|
||
## Status of the original blueprint pin
|
||
|
||
The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by
|
||
`digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup
|
||
blueprint's implicit path-keyed storage is replaced by content-keyed
|
||
storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`.
|