# Review — INTENT and Architecture Blueprint Date: 2026-05-15 Reviewer: claude (opus-4-7) Inputs: `INTENT.md`, `docs/ARCHITECTURE-BLUEPRINT.md`, `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md`, `SCOPE.md`, `AGENTS.md` This review reframes the current scoped-internal-service design against a longer-horizon ambition: make `artifact-store` the leading open source substrate for generic artifact storage in the same sense that VLC and ffmpeg lead their domain. See `docs/PLATFORM-AMBITION.md` for the ambition framing this review is in service of. ## SWOT ### Strengths - Clean separation between artifact *identity / lifecycle* and *bytes*. Registry owns metadata; storage adapter owns persistence. This is the single most consequential architectural decision and the docs get it right. - Retention is a first-class concept from day one, not bolted on later. - Audit log designed in from the start, with explicit room for signed events. - Storage adapter contract is minimal and well-shaped (`put / get / head / delete / health`). - Pilot-first discipline (`guide-board` / OpenCMIS TCK) anchors the work in a real producer rather than a hypothetical one. - Manifest portability is an explicit goal — a package should be understandable without calling its producer. - Boundary statements are explicit (will not replace StateHub, will not encode producer semantics). ### Weaknesses - Storage is keyed by logical path, not by content hash. Blocks global deduplication, Merkle integrity proofs, partial replication, federation. - No streaming, chunked, or resumable upload story. Multipart REST will cap throughput at the slowest Python/WSGI hop for multi-GB packages. - No content-defined chunking (CDC). Evidence packages with logs are highly dedup-able; current design captures none of that. - SHA-256 is the right *compatibility* digest but the wrong *throughput* digest at platform scale. - Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no partitioning / sharding story exists. - No event / CDC stream for downstream consumers — Statehub, search, UIs would have to poll. - No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage without signed attestations leaves half the value on the table. - Metadata is open-ended JSON without a schema-registration path. Hard to build typed tooling on top. - No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit. - No observability targets (latency / throughput SLOs, metrics, traces). Platform-grade claims will eventually require numbers. - No OCI / `oras` artifact compatibility — leaves the largest existing artifact ecosystem off the table. ### Opportunities - **OCI Artifact + ORAS compatibility.** Inherit Helm, ML model, SBOM, cosign tooling for free. Probably the single highest-leverage external move. - **Sigstore + in-toto + SLSA.** Evidence packages should be signed by default; this is exactly the gap most generic registries leave unfilled. - **Content-addressed CAS + Merkle DAG** (Git / IPFS / restic pattern): enables global dedup, integrity proofs, federation, partial mirroring. - **BLAKE3** as native digest with SHA-256 retained for interop: orders-of-magnitude faster hashing, and BLAKE3's construction *is* a Merkle tree — package-level integrity comes for free. - **WASM plugin surface for transforms, extractors, indexers, redactors.** The "ffmpeg moment" for this domain: a stable host API that ecosystem contributors can extend without forking the core. - **Federation / mirroring** between artifact-store instances via signed manifests. Nothing comparable exists in the evidence space today. - **FUSE / NFS / S3-gateway frontends.** Legacy producers ingest without code changes. - **Embeddable mode.** A single static binary like `restic`, plus a server mode. Embedding is what makes ffmpeg ubiquitous. ### Threats - Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic, IPFS, Sigstore, plain S3. None are exactly this, but each chips at the value proposition. - Scope creep vs the carefully-scoped INTENT. The platform ambition pulls toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve this tension explicitly or you get neither. - Python performance ceiling on the data plane (ingestion of multi-GB packages, hashing, chunking). - Governance / maintenance debt. VLC and ffmpeg have decades of contributor depth; underestimating that is a project-killer. ## Architecture optimizations worth taking now Each of these is cheap to lock in before code lands, and expensive (or breaking) to add later. 1. **Split control plane from data plane.** Registry / API / retention stays in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a separate process (Rust sidecar, eventually with hot kernels in C / asm) that can scale and be rewritten independently. Pin the contract now (Unix socket, gRPC or framed bincode). See `docs/PLATFORM-AMBITION.md`. 2. **Make content the primary address.** Internal object key `blake3:` (or `sha256:` for compat). `relative_path` becomes logical metadata in the manifest. Unlocks dedup, integrity, federation, OCI compatibility. 3. **Append-only WAL as the source of truth.** Metadata DB is a materialized view rebuildable from the log. Same pattern as Kafka / EventStore / Datomic. Cheap audit, replication, point-in-time recovery. 4. **OCI artifact spec as a wire format**, even if the native API is richer. Buys instant interop with `oras`, `cosign`, `crane`, Helm. 5. **Signed manifests from day one.** Pin a signing format (cosign / Sigstore) and a canonicalization (JCS or canonical CBOR). Post-hoc signing means every legacy manifest is unsigned forever. 6. **Resumable, chunked uploads on the wire.** Upload session resource (`POST /uploads` → `PATCH /uploads/{id}` ranges → `POST /uploads/{id}/complete`). `tus.io` is a reasonable reference. v1 implementation can still be single-shot multipart. 7. **Event stream out.** A monotonic-sequence `events` table; consumers tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later. 8. **Schema-typed metadata with escape hatch.** Producers register a JSON Schema for their metadata variant (`guide-board.run.v1`). Stored as open JSON, validated at ingest, queryable by typed views. 9. **Tiering as a first-class column of `storage_location`.** Promote `retrieval_tier` and `restore_status` into the schema now (nullable, default `hot`). 10. **Ship a great CLI before any UI.** ffmpeg ships a binary, not a GUI. ## Performance hotspots — where native code actually matters Ranked by realistic impact for this workload. Adopting libraries that already contain hand-tuned assembly is the cheap path; writing fresh assembly is an explicit research line — see `docs/ASSEMBLY-EXPERIMENT.md`. 1. **Hashing (dominant ingest cost).** SHA-256 with SHA-NI: ~1.5–2 GB/s/core. BLAKE3 with AVX-512: 6–10+ GB/s/core, parallelizable, free Merkle tree. Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop. 2. **Content-defined chunking (FastCDC / Gear).** Rolling hash over every byte; pure-Python is unusable, optimized C / Rust hits 5–10 GB/s. Mandatory if dedup is on the roadmap. 3. **Compression.** Zstd with bundled SIMD reaches multi-GB/s. Evidence logs typically compress 5–20×. Apply at chunk level so dedup still works. 4. **I/O path.** Linux: `io_uring` for ingest writes; `sendfile(2)` / `splice(2)` for download zero-copy; `O_DIRECT` for very large objects. 5. **Encryption.** AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305 vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or AWS-LC. Never write crypto by hand. 6. **Metadata hot paths.** Bloom or Cuckoo filter in front of the "have I seen this hash?" lookup. ~50 lines of Rust, ~100× win. 7. **Manifest canonicalization.** Signed manifests canonicalize on every ingest and every verify. Pick a fast canonical CBOR / JCS impl. Not worth native code: HTTP layer, retention engine, audit log, DB access, orchestration, workflow logic. Keep Python. ## Concrete suggestions before WP-0001 lands - Add `digest_algorithm` to `artifact_files` (default `sha256`, allow `blake3`). - Add `content_address` (e.g., `blake3:…`) as canonical storage key, with `relative_path` retained as logical metadata. - Add `retrieval_tier` and `restore_status` to `storage_locations` now, nullable. - Define the upload session resource shape even if v1 implements only single-shot multipart. - Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a signing format target (cosign / Sigstore). Decide, do not implement. - Add an `events` table with a monotonic sequence number so a CDC feed is trivial later. - Decide explicitly whether OCI artifact compatibility is a v2 goal or out of scope. Either is fine; ambiguity will distort schema decisions. ## What this review does not change INTENT and SCOPE remain correctly scoped for v1. The pilot path through WP-0001 should ship as planned. The schema annotations above are additive, not redirective. The platform ambition lives in `docs/PLATFORM-AMBITION.md` so it can guide later decisions without expanding the current workplan.