Captures the longer-horizon thesis (sovereign-cloud artifact substrate) alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine schema/contract commitments the v1 must preserve to keep that horizon reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.2 KiB
Review — INTENT and Architecture Blueprint
Date: 2026-05-15
Reviewer: claude (opus-4-7)
Inputs: INTENT.md, docs/ARCHITECTURE-BLUEPRINT.md,
workplans/ARTIFACT-STORE-WP-0001-service-baseline.md, SCOPE.md, AGENTS.md
This review reframes the current scoped-internal-service design against a
longer-horizon ambition: make artifact-store the leading open source
substrate for generic artifact storage in the same sense that VLC and ffmpeg
lead their domain. See docs/PLATFORM-AMBITION.md for the ambition framing
this review is in service of.
SWOT
Strengths
- Clean separation between artifact identity / lifecycle and bytes. Registry owns metadata; storage adapter owns persistence. This is the single most consequential architectural decision and the docs get it right.
- Retention is a first-class concept from day one, not bolted on later.
- Audit log designed in from the start, with explicit room for signed events.
- Storage adapter contract is minimal and well-shaped
(
put / get / head / delete / health). - Pilot-first discipline (
guide-board/ OpenCMIS TCK) anchors the work in a real producer rather than a hypothetical one. - Manifest portability is an explicit goal — a package should be understandable without calling its producer.
- Boundary statements are explicit (will not replace StateHub, will not encode producer semantics).
Weaknesses
- Storage is keyed by logical path, not by content hash. Blocks global deduplication, Merkle integrity proofs, partial replication, federation.
- No streaming, chunked, or resumable upload story. Multipart REST will cap throughput at the slowest Python/WSGI hop for multi-GB packages.
- No content-defined chunking (CDC). Evidence packages with logs are highly dedup-able; current design captures none of that.
- SHA-256 is the right compatibility digest but the wrong throughput digest at platform scale.
- Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no partitioning / sharding story exists.
- No event / CDC stream for downstream consumers — Statehub, search, UIs would have to poll.
- No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage without signed attestations leaves half the value on the table.
- Metadata is open-ended JSON without a schema-registration path. Hard to build typed tooling on top.
- No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit.
- No observability targets (latency / throughput SLOs, metrics, traces). Platform-grade claims will eventually require numbers.
- No OCI /
orasartifact compatibility — leaves the largest existing artifact ecosystem off the table.
Opportunities
- OCI Artifact + ORAS compatibility. Inherit Helm, ML model, SBOM, cosign tooling for free. Probably the single highest-leverage external move.
- Sigstore + in-toto + SLSA. Evidence packages should be signed by default; this is exactly the gap most generic registries leave unfilled.
- Content-addressed CAS + Merkle DAG (Git / IPFS / restic pattern): enables global dedup, integrity proofs, federation, partial mirroring.
- BLAKE3 as native digest with SHA-256 retained for interop: orders-of-magnitude faster hashing, and BLAKE3's construction is a Merkle tree — package-level integrity comes for free.
- WASM plugin surface for transforms, extractors, indexers, redactors. The "ffmpeg moment" for this domain: a stable host API that ecosystem contributors can extend without forking the core.
- Federation / mirroring between artifact-store instances via signed manifests. Nothing comparable exists in the evidence space today.
- FUSE / NFS / S3-gateway frontends. Legacy producers ingest without code changes.
- Embeddable mode. A single static binary like
restic, plus a server mode. Embedding is what makes ffmpeg ubiquitous.
Threats
- Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic, IPFS, Sigstore, plain S3. None are exactly this, but each chips at the value proposition.
- Scope creep vs the carefully-scoped INTENT. The platform ambition pulls toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve this tension explicitly or you get neither.
- Python performance ceiling on the data plane (ingestion of multi-GB packages, hashing, chunking).
- Governance / maintenance debt. VLC and ffmpeg have decades of contributor depth; underestimating that is a project-killer.
Architecture optimizations worth taking now
Each of these is cheap to lock in before code lands, and expensive (or breaking) to add later.
- Split control plane from data plane. Registry / API / retention stays
in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a
separate process (Rust sidecar, eventually with hot kernels in C / asm)
that can scale and be rewritten independently. Pin the contract now (Unix
socket, gRPC or framed bincode). See
docs/PLATFORM-AMBITION.md. - Make content the primary address. Internal object key
blake3:<digest>(orsha256:<digest>for compat).relative_pathbecomes logical metadata in the manifest. Unlocks dedup, integrity, federation, OCI compatibility. - Append-only WAL as the source of truth. Metadata DB is a materialized view rebuildable from the log. Same pattern as Kafka / EventStore / Datomic. Cheap audit, replication, point-in-time recovery.
- OCI artifact spec as a wire format, even if the native API is richer.
Buys instant interop with
oras,cosign,crane, Helm. - Signed manifests from day one. Pin a signing format (cosign / Sigstore) and a canonicalization (JCS or canonical CBOR). Post-hoc signing means every legacy manifest is unsigned forever.
- Resumable, chunked uploads on the wire. Upload session resource
(
POST /uploads→PATCH /uploads/{id}ranges →POST /uploads/{id}/complete).tus.iois a reasonable reference. v1 implementation can still be single-shot multipart. - Event stream out. A monotonic-sequence
eventstable; consumers tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later. - Schema-typed metadata with escape hatch. Producers register a JSON
Schema for their metadata variant (
guide-board.run.v1). Stored as open JSON, validated at ingest, queryable by typed views. - Tiering as a first-class column of
storage_location. Promoteretrieval_tierandrestore_statusinto the schema now (nullable, defaulthot). - Ship a great CLI before any UI. ffmpeg ships a binary, not a GUI.
Performance hotspots — where native code actually matters
Ranked by realistic impact for this workload. Adopting libraries that already
contain hand-tuned assembly is the cheap path; writing fresh assembly is an
explicit research line — see docs/ASSEMBLY-EXPERIMENT.md.
- Hashing (dominant ingest cost). SHA-256 with SHA-NI: ~1.5–2 GB/s/core. BLAKE3 with AVX-512: 6–10+ GB/s/core, parallelizable, free Merkle tree. Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop.
- Content-defined chunking (FastCDC / Gear). Rolling hash over every byte; pure-Python is unusable, optimized C / Rust hits 5–10 GB/s. Mandatory if dedup is on the roadmap.
- Compression. Zstd with bundled SIMD reaches multi-GB/s. Evidence logs typically compress 5–20×. Apply at chunk level so dedup still works.
- I/O path. Linux:
io_uringfor ingest writes;sendfile(2)/splice(2)for download zero-copy;O_DIRECTfor very large objects. - Encryption. AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305 vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or AWS-LC. Never write crypto by hand.
- Metadata hot paths. Bloom or Cuckoo filter in front of the "have I seen this hash?" lookup. ~50 lines of Rust, ~100× win.
- Manifest canonicalization. Signed manifests canonicalize on every ingest and every verify. Pick a fast canonical CBOR / JCS impl.
Not worth native code: HTTP layer, retention engine, audit log, DB access, orchestration, workflow logic. Keep Python.
Concrete suggestions before WP-0001 lands
- Add
digest_algorithmtoartifact_files(defaultsha256, allowblake3). - Add
content_address(e.g.,blake3:…) as canonical storage key, withrelative_pathretained as logical metadata. - Add
retrieval_tierandrestore_statustostorage_locationsnow, nullable. - Define the upload session resource shape even if v1 implements only single-shot multipart.
- Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a signing format target (cosign / Sigstore). Decide, do not implement.
- Add an
eventstable with a monotonic sequence number so a CDC feed is trivial later. - Decide explicitly whether OCI artifact compatibility is a v2 goal or out of scope. Either is fine; ambiguity will distort schema decisions.
What this review does not change
INTENT and SCOPE remain correctly scoped for v1. The pilot path through
WP-0001 should ship as planned. The schema annotations above are additive,
not redirective. The platform ambition lives in docs/PLATFORM-AMBITION.md
so it can guide later decisions without expanding the current workplan.