generated from coulomb/repo-seed
Captures the longer-horizon thesis (sovereign-cloud artifact substrate) alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine schema/contract commitments the v1 must preserve to keep that horizon reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
176 lines
9.2 KiB
Markdown
176 lines
9.2 KiB
Markdown
# Review — INTENT and Architecture Blueprint
|
||
|
||
Date: 2026-05-15
|
||
Reviewer: claude (opus-4-7)
|
||
Inputs: `INTENT.md`, `docs/ARCHITECTURE-BLUEPRINT.md`,
|
||
`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md`, `SCOPE.md`, `AGENTS.md`
|
||
|
||
This review reframes the current scoped-internal-service design against a
|
||
longer-horizon ambition: make `artifact-store` the leading open source
|
||
substrate for generic artifact storage in the same sense that VLC and ffmpeg
|
||
lead their domain. See `docs/PLATFORM-AMBITION.md` for the ambition framing
|
||
this review is in service of.
|
||
|
||
## SWOT
|
||
|
||
### Strengths
|
||
|
||
- Clean separation between artifact *identity / lifecycle* and *bytes*.
|
||
Registry owns metadata; storage adapter owns persistence. This is the single
|
||
most consequential architectural decision and the docs get it right.
|
||
- Retention is a first-class concept from day one, not bolted on later.
|
||
- Audit log designed in from the start, with explicit room for signed events.
|
||
- Storage adapter contract is minimal and well-shaped
|
||
(`put / get / head / delete / health`).
|
||
- Pilot-first discipline (`guide-board` / OpenCMIS TCK) anchors the work in a
|
||
real producer rather than a hypothetical one.
|
||
- Manifest portability is an explicit goal — a package should be understandable
|
||
without calling its producer.
|
||
- Boundary statements are explicit (will not replace StateHub, will not encode
|
||
producer semantics).
|
||
|
||
### Weaknesses
|
||
|
||
- Storage is keyed by logical path, not by content hash. Blocks global
|
||
deduplication, Merkle integrity proofs, partial replication, federation.
|
||
- No streaming, chunked, or resumable upload story. Multipart REST will cap
|
||
throughput at the slowest Python/WSGI hop for multi-GB packages.
|
||
- No content-defined chunking (CDC). Evidence packages with logs are highly
|
||
dedup-able; current design captures none of that.
|
||
- SHA-256 is the right *compatibility* digest but the wrong *throughput*
|
||
digest at platform scale.
|
||
- Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no
|
||
partitioning / sharding story exists.
|
||
- No event / CDC stream for downstream consumers — Statehub, search, UIs would
|
||
have to poll.
|
||
- No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage
|
||
without signed attestations leaves half the value on the table.
|
||
- Metadata is open-ended JSON without a schema-registration path. Hard to
|
||
build typed tooling on top.
|
||
- No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit.
|
||
- No observability targets (latency / throughput SLOs, metrics, traces).
|
||
Platform-grade claims will eventually require numbers.
|
||
- No OCI / `oras` artifact compatibility — leaves the largest existing
|
||
artifact ecosystem off the table.
|
||
|
||
### Opportunities
|
||
|
||
- **OCI Artifact + ORAS compatibility.** Inherit Helm, ML model, SBOM, cosign
|
||
tooling for free. Probably the single highest-leverage external move.
|
||
- **Sigstore + in-toto + SLSA.** Evidence packages should be signed by
|
||
default; this is exactly the gap most generic registries leave unfilled.
|
||
- **Content-addressed CAS + Merkle DAG** (Git / IPFS / restic pattern):
|
||
enables global dedup, integrity proofs, federation, partial mirroring.
|
||
- **BLAKE3** as native digest with SHA-256 retained for interop:
|
||
orders-of-magnitude faster hashing, and BLAKE3's construction *is* a Merkle
|
||
tree — package-level integrity comes for free.
|
||
- **WASM plugin surface for transforms, extractors, indexers, redactors.**
|
||
The "ffmpeg moment" for this domain: a stable host API that ecosystem
|
||
contributors can extend without forking the core.
|
||
- **Federation / mirroring** between artifact-store instances via signed
|
||
manifests. Nothing comparable exists in the evidence space today.
|
||
- **FUSE / NFS / S3-gateway frontends.** Legacy producers ingest without code
|
||
changes.
|
||
- **Embeddable mode.** A single static binary like `restic`, plus a server
|
||
mode. Embedding is what makes ffmpeg ubiquitous.
|
||
|
||
### Threats
|
||
|
||
- Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic,
|
||
IPFS, Sigstore, plain S3. None are exactly this, but each chips at the
|
||
value proposition.
|
||
- Scope creep vs the carefully-scoped INTENT. The platform ambition pulls
|
||
toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve
|
||
this tension explicitly or you get neither.
|
||
- Python performance ceiling on the data plane (ingestion of multi-GB
|
||
packages, hashing, chunking).
|
||
- Governance / maintenance debt. VLC and ffmpeg have decades of contributor
|
||
depth; underestimating that is a project-killer.
|
||
|
||
## Architecture optimizations worth taking now
|
||
|
||
Each of these is cheap to lock in before code lands, and expensive (or
|
||
breaking) to add later.
|
||
|
||
1. **Split control plane from data plane.** Registry / API / retention stays
|
||
in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a
|
||
separate process (Rust sidecar, eventually with hot kernels in C / asm)
|
||
that can scale and be rewritten independently. Pin the contract now (Unix
|
||
socket, gRPC or framed bincode). See `docs/PLATFORM-AMBITION.md`.
|
||
2. **Make content the primary address.** Internal object key
|
||
`blake3:<digest>` (or `sha256:<digest>` for compat). `relative_path`
|
||
becomes logical metadata in the manifest. Unlocks dedup, integrity,
|
||
federation, OCI compatibility.
|
||
3. **Append-only WAL as the source of truth.** Metadata DB is a materialized
|
||
view rebuildable from the log. Same pattern as Kafka / EventStore /
|
||
Datomic. Cheap audit, replication, point-in-time recovery.
|
||
4. **OCI artifact spec as a wire format**, even if the native API is richer.
|
||
Buys instant interop with `oras`, `cosign`, `crane`, Helm.
|
||
5. **Signed manifests from day one.** Pin a signing format (cosign / Sigstore)
|
||
and a canonicalization (JCS or canonical CBOR). Post-hoc signing means
|
||
every legacy manifest is unsigned forever.
|
||
6. **Resumable, chunked uploads on the wire.** Upload session resource
|
||
(`POST /uploads` → `PATCH /uploads/{id}` ranges → `POST /uploads/{id}/complete`).
|
||
`tus.io` is a reasonable reference. v1 implementation can still be
|
||
single-shot multipart.
|
||
7. **Event stream out.** A monotonic-sequence `events` table; consumers
|
||
tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later.
|
||
8. **Schema-typed metadata with escape hatch.** Producers register a JSON
|
||
Schema for their metadata variant (`guide-board.run.v1`). Stored as open
|
||
JSON, validated at ingest, queryable by typed views.
|
||
9. **Tiering as a first-class column of `storage_location`.** Promote
|
||
`retrieval_tier` and `restore_status` into the schema now (nullable,
|
||
default `hot`).
|
||
10. **Ship a great CLI before any UI.** ffmpeg ships a binary, not a GUI.
|
||
|
||
## Performance hotspots — where native code actually matters
|
||
|
||
Ranked by realistic impact for this workload. Adopting libraries that already
|
||
contain hand-tuned assembly is the cheap path; writing fresh assembly is an
|
||
explicit research line — see `docs/ASSEMBLY-EXPERIMENT.md`.
|
||
|
||
1. **Hashing (dominant ingest cost).** SHA-256 with SHA-NI: ~1.5–2 GB/s/core.
|
||
BLAKE3 with AVX-512: 6–10+ GB/s/core, parallelizable, free Merkle tree.
|
||
Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop.
|
||
2. **Content-defined chunking (FastCDC / Gear).** Rolling hash over every
|
||
byte; pure-Python is unusable, optimized C / Rust hits 5–10 GB/s.
|
||
Mandatory if dedup is on the roadmap.
|
||
3. **Compression.** Zstd with bundled SIMD reaches multi-GB/s. Evidence logs
|
||
typically compress 5–20×. Apply at chunk level so dedup still works.
|
||
4. **I/O path.** Linux: `io_uring` for ingest writes; `sendfile(2)` /
|
||
`splice(2)` for download zero-copy; `O_DIRECT` for very large objects.
|
||
5. **Encryption.** AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305
|
||
vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or
|
||
AWS-LC. Never write crypto by hand.
|
||
6. **Metadata hot paths.** Bloom or Cuckoo filter in front of the
|
||
"have I seen this hash?" lookup. ~50 lines of Rust, ~100× win.
|
||
7. **Manifest canonicalization.** Signed manifests canonicalize on every
|
||
ingest and every verify. Pick a fast canonical CBOR / JCS impl.
|
||
|
||
Not worth native code: HTTP layer, retention engine, audit log, DB access,
|
||
orchestration, workflow logic. Keep Python.
|
||
|
||
## Concrete suggestions before WP-0001 lands
|
||
|
||
- Add `digest_algorithm` to `artifact_files` (default `sha256`, allow
|
||
`blake3`).
|
||
- Add `content_address` (e.g., `blake3:…`) as canonical storage key, with
|
||
`relative_path` retained as logical metadata.
|
||
- Add `retrieval_tier` and `restore_status` to `storage_locations` now,
|
||
nullable.
|
||
- Define the upload session resource shape even if v1 implements only
|
||
single-shot multipart.
|
||
- Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a
|
||
signing format target (cosign / Sigstore). Decide, do not implement.
|
||
- Add an `events` table with a monotonic sequence number so a CDC feed is
|
||
trivial later.
|
||
- Decide explicitly whether OCI artifact compatibility is a v2 goal or out of
|
||
scope. Either is fine; ambiguity will distort schema decisions.
|
||
|
||
## What this review does not change
|
||
|
||
INTENT and SCOPE remain correctly scoped for v1. The pilot path through
|
||
WP-0001 should ship as planned. The schema annotations above are additive,
|
||
not redirective. The platform ambition lives in `docs/PLATFORM-AMBITION.md`
|
||
so it can guide later decisions without expanding the current workplan.
|