Files
artifact-store/docs/REVIEW-2026-05-15-intent-and-blueprint.md
tegwick 403d903585 docs: add platform ambition, blueprint review, and assembly experiment
Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 20:56:01 +02:00

9.2 KiB
Raw Blame History

Review — INTENT and Architecture Blueprint

Date: 2026-05-15 Reviewer: claude (opus-4-7) Inputs: INTENT.md, docs/ARCHITECTURE-BLUEPRINT.md, workplans/ARTIFACT-STORE-WP-0001-service-baseline.md, SCOPE.md, AGENTS.md

This review reframes the current scoped-internal-service design against a longer-horizon ambition: make artifact-store the leading open source substrate for generic artifact storage in the same sense that VLC and ffmpeg lead their domain. See docs/PLATFORM-AMBITION.md for the ambition framing this review is in service of.

SWOT

Strengths

  • Clean separation between artifact identity / lifecycle and bytes. Registry owns metadata; storage adapter owns persistence. This is the single most consequential architectural decision and the docs get it right.
  • Retention is a first-class concept from day one, not bolted on later.
  • Audit log designed in from the start, with explicit room for signed events.
  • Storage adapter contract is minimal and well-shaped (put / get / head / delete / health).
  • Pilot-first discipline (guide-board / OpenCMIS TCK) anchors the work in a real producer rather than a hypothetical one.
  • Manifest portability is an explicit goal — a package should be understandable without calling its producer.
  • Boundary statements are explicit (will not replace StateHub, will not encode producer semantics).

Weaknesses

  • Storage is keyed by logical path, not by content hash. Blocks global deduplication, Merkle integrity proofs, partial replication, federation.
  • No streaming, chunked, or resumable upload story. Multipart REST will cap throughput at the slowest Python/WSGI hop for multi-GB packages.
  • No content-defined chunking (CDC). Evidence packages with logs are highly dedup-able; current design captures none of that.
  • SHA-256 is the right compatibility digest but the wrong throughput digest at platform scale.
  • Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no partitioning / sharding story exists.
  • No event / CDC stream for downstream consumers — Statehub, search, UIs would have to poll.
  • No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage without signed attestations leaves half the value on the table.
  • Metadata is open-ended JSON without a schema-registration path. Hard to build typed tooling on top.
  • No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit.
  • No observability targets (latency / throughput SLOs, metrics, traces). Platform-grade claims will eventually require numbers.
  • No OCI / oras artifact compatibility — leaves the largest existing artifact ecosystem off the table.

Opportunities

  • OCI Artifact + ORAS compatibility. Inherit Helm, ML model, SBOM, cosign tooling for free. Probably the single highest-leverage external move.
  • Sigstore + in-toto + SLSA. Evidence packages should be signed by default; this is exactly the gap most generic registries leave unfilled.
  • Content-addressed CAS + Merkle DAG (Git / IPFS / restic pattern): enables global dedup, integrity proofs, federation, partial mirroring.
  • BLAKE3 as native digest with SHA-256 retained for interop: orders-of-magnitude faster hashing, and BLAKE3's construction is a Merkle tree — package-level integrity comes for free.
  • WASM plugin surface for transforms, extractors, indexers, redactors. The "ffmpeg moment" for this domain: a stable host API that ecosystem contributors can extend without forking the core.
  • Federation / mirroring between artifact-store instances via signed manifests. Nothing comparable exists in the evidence space today.
  • FUSE / NFS / S3-gateway frontends. Legacy producers ingest without code changes.
  • Embeddable mode. A single static binary like restic, plus a server mode. Embedding is what makes ffmpeg ubiquitous.

Threats

  • Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic, IPFS, Sigstore, plain S3. None are exactly this, but each chips at the value proposition.
  • Scope creep vs the carefully-scoped INTENT. The platform ambition pulls toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve this tension explicitly or you get neither.
  • Python performance ceiling on the data plane (ingestion of multi-GB packages, hashing, chunking).
  • Governance / maintenance debt. VLC and ffmpeg have decades of contributor depth; underestimating that is a project-killer.

Architecture optimizations worth taking now

Each of these is cheap to lock in before code lands, and expensive (or breaking) to add later.

  1. Split control plane from data plane. Registry / API / retention stays in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a separate process (Rust sidecar, eventually with hot kernels in C / asm) that can scale and be rewritten independently. Pin the contract now (Unix socket, gRPC or framed bincode). See docs/PLATFORM-AMBITION.md.
  2. Make content the primary address. Internal object key blake3:<digest> (or sha256:<digest> for compat). relative_path becomes logical metadata in the manifest. Unlocks dedup, integrity, federation, OCI compatibility.
  3. Append-only WAL as the source of truth. Metadata DB is a materialized view rebuildable from the log. Same pattern as Kafka / EventStore / Datomic. Cheap audit, replication, point-in-time recovery.
  4. OCI artifact spec as a wire format, even if the native API is richer. Buys instant interop with oras, cosign, crane, Helm.
  5. Signed manifests from day one. Pin a signing format (cosign / Sigstore) and a canonicalization (JCS or canonical CBOR). Post-hoc signing means every legacy manifest is unsigned forever.
  6. Resumable, chunked uploads on the wire. Upload session resource (POST /uploadsPATCH /uploads/{id} ranges → POST /uploads/{id}/complete). tus.io is a reasonable reference. v1 implementation can still be single-shot multipart.
  7. Event stream out. A monotonic-sequence events table; consumers tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later.
  8. Schema-typed metadata with escape hatch. Producers register a JSON Schema for their metadata variant (guide-board.run.v1). Stored as open JSON, validated at ingest, queryable by typed views.
  9. Tiering as a first-class column of storage_location. Promote retrieval_tier and restore_status into the schema now (nullable, default hot).
  10. Ship a great CLI before any UI. ffmpeg ships a binary, not a GUI.

Performance hotspots — where native code actually matters

Ranked by realistic impact for this workload. Adopting libraries that already contain hand-tuned assembly is the cheap path; writing fresh assembly is an explicit research line — see docs/ASSEMBLY-EXPERIMENT.md.

  1. Hashing (dominant ingest cost). SHA-256 with SHA-NI: ~1.52 GB/s/core. BLAKE3 with AVX-512: 610+ GB/s/core, parallelizable, free Merkle tree. Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop.
  2. Content-defined chunking (FastCDC / Gear). Rolling hash over every byte; pure-Python is unusable, optimized C / Rust hits 510 GB/s. Mandatory if dedup is on the roadmap.
  3. Compression. Zstd with bundled SIMD reaches multi-GB/s. Evidence logs typically compress 520×. Apply at chunk level so dedup still works.
  4. I/O path. Linux: io_uring for ingest writes; sendfile(2) / splice(2) for download zero-copy; O_DIRECT for very large objects.
  5. Encryption. AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305 vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or AWS-LC. Never write crypto by hand.
  6. Metadata hot paths. Bloom or Cuckoo filter in front of the "have I seen this hash?" lookup. ~50 lines of Rust, ~100× win.
  7. Manifest canonicalization. Signed manifests canonicalize on every ingest and every verify. Pick a fast canonical CBOR / JCS impl.

Not worth native code: HTTP layer, retention engine, audit log, DB access, orchestration, workflow logic. Keep Python.

Concrete suggestions before WP-0001 lands

  • Add digest_algorithm to artifact_files (default sha256, allow blake3).
  • Add content_address (e.g., blake3:…) as canonical storage key, with relative_path retained as logical metadata.
  • Add retrieval_tier and restore_status to storage_locations now, nullable.
  • Define the upload session resource shape even if v1 implements only single-shot multipart.
  • Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a signing format target (cosign / Sigstore). Decide, do not implement.
  • Add an events table with a monotonic sequence number so a CDC feed is trivial later.
  • Decide explicitly whether OCI artifact compatibility is a v2 goal or out of scope. Either is fine; ambiguity will distort schema decisions.

What this review does not change

INTENT and SCOPE remain correctly scoped for v1. The pilot path through WP-0001 should ship as planned. The schema annotations above are additive, not redirective. The platform ambition lives in docs/PLATFORM-AMBITION.md so it can guide later decisions without expanding the current workplan.