Files
artifact-store/docs/REVIEW-2026-05-15-intent-and-blueprint.md
tegwick 403d903585 docs: add platform ambition, blueprint review, and assembly experiment
Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 20:56:01 +02:00

176 lines
9.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Review — INTENT and Architecture Blueprint
Date: 2026-05-15
Reviewer: claude (opus-4-7)
Inputs: `INTENT.md`, `docs/ARCHITECTURE-BLUEPRINT.md`,
`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md`, `SCOPE.md`, `AGENTS.md`
This review reframes the current scoped-internal-service design against a
longer-horizon ambition: make `artifact-store` the leading open source
substrate for generic artifact storage in the same sense that VLC and ffmpeg
lead their domain. See `docs/PLATFORM-AMBITION.md` for the ambition framing
this review is in service of.
## SWOT
### Strengths
- Clean separation between artifact *identity / lifecycle* and *bytes*.
Registry owns metadata; storage adapter owns persistence. This is the single
most consequential architectural decision and the docs get it right.
- Retention is a first-class concept from day one, not bolted on later.
- Audit log designed in from the start, with explicit room for signed events.
- Storage adapter contract is minimal and well-shaped
(`put / get / head / delete / health`).
- Pilot-first discipline (`guide-board` / OpenCMIS TCK) anchors the work in a
real producer rather than a hypothetical one.
- Manifest portability is an explicit goal — a package should be understandable
without calling its producer.
- Boundary statements are explicit (will not replace StateHub, will not encode
producer semantics).
### Weaknesses
- Storage is keyed by logical path, not by content hash. Blocks global
deduplication, Merkle integrity proofs, partial replication, federation.
- No streaming, chunked, or resumable upload story. Multipart REST will cap
throughput at the slowest Python/WSGI hop for multi-GB packages.
- No content-defined chunking (CDC). Evidence packages with logs are highly
dedup-able; current design captures none of that.
- SHA-256 is the right *compatibility* digest but the wrong *throughput*
digest at platform scale.
- Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no
partitioning / sharding story exists.
- No event / CDC stream for downstream consumers — Statehub, search, UIs would
have to poll.
- No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage
without signed attestations leaves half the value on the table.
- Metadata is open-ended JSON without a schema-registration path. Hard to
build typed tooling on top.
- No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit.
- No observability targets (latency / throughput SLOs, metrics, traces).
Platform-grade claims will eventually require numbers.
- No OCI / `oras` artifact compatibility — leaves the largest existing
artifact ecosystem off the table.
### Opportunities
- **OCI Artifact + ORAS compatibility.** Inherit Helm, ML model, SBOM, cosign
tooling for free. Probably the single highest-leverage external move.
- **Sigstore + in-toto + SLSA.** Evidence packages should be signed by
default; this is exactly the gap most generic registries leave unfilled.
- **Content-addressed CAS + Merkle DAG** (Git / IPFS / restic pattern):
enables global dedup, integrity proofs, federation, partial mirroring.
- **BLAKE3** as native digest with SHA-256 retained for interop:
orders-of-magnitude faster hashing, and BLAKE3's construction *is* a Merkle
tree — package-level integrity comes for free.
- **WASM plugin surface for transforms, extractors, indexers, redactors.**
The "ffmpeg moment" for this domain: a stable host API that ecosystem
contributors can extend without forking the core.
- **Federation / mirroring** between artifact-store instances via signed
manifests. Nothing comparable exists in the evidence space today.
- **FUSE / NFS / S3-gateway frontends.** Legacy producers ingest without code
changes.
- **Embeddable mode.** A single static binary like `restic`, plus a server
mode. Embedding is what makes ffmpeg ubiquitous.
### Threats
- Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic,
IPFS, Sigstore, plain S3. None are exactly this, but each chips at the
value proposition.
- Scope creep vs the carefully-scoped INTENT. The platform ambition pulls
toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve
this tension explicitly or you get neither.
- Python performance ceiling on the data plane (ingestion of multi-GB
packages, hashing, chunking).
- Governance / maintenance debt. VLC and ffmpeg have decades of contributor
depth; underestimating that is a project-killer.
## Architecture optimizations worth taking now
Each of these is cheap to lock in before code lands, and expensive (or
breaking) to add later.
1. **Split control plane from data plane.** Registry / API / retention stays
in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a
separate process (Rust sidecar, eventually with hot kernels in C / asm)
that can scale and be rewritten independently. Pin the contract now (Unix
socket, gRPC or framed bincode). See `docs/PLATFORM-AMBITION.md`.
2. **Make content the primary address.** Internal object key
`blake3:<digest>` (or `sha256:<digest>` for compat). `relative_path`
becomes logical metadata in the manifest. Unlocks dedup, integrity,
federation, OCI compatibility.
3. **Append-only WAL as the source of truth.** Metadata DB is a materialized
view rebuildable from the log. Same pattern as Kafka / EventStore /
Datomic. Cheap audit, replication, point-in-time recovery.
4. **OCI artifact spec as a wire format**, even if the native API is richer.
Buys instant interop with `oras`, `cosign`, `crane`, Helm.
5. **Signed manifests from day one.** Pin a signing format (cosign / Sigstore)
and a canonicalization (JCS or canonical CBOR). Post-hoc signing means
every legacy manifest is unsigned forever.
6. **Resumable, chunked uploads on the wire.** Upload session resource
(`POST /uploads``PATCH /uploads/{id}` ranges → `POST /uploads/{id}/complete`).
`tus.io` is a reasonable reference. v1 implementation can still be
single-shot multipart.
7. **Event stream out.** A monotonic-sequence `events` table; consumers
tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later.
8. **Schema-typed metadata with escape hatch.** Producers register a JSON
Schema for their metadata variant (`guide-board.run.v1`). Stored as open
JSON, validated at ingest, queryable by typed views.
9. **Tiering as a first-class column of `storage_location`.** Promote
`retrieval_tier` and `restore_status` into the schema now (nullable,
default `hot`).
10. **Ship a great CLI before any UI.** ffmpeg ships a binary, not a GUI.
## Performance hotspots — where native code actually matters
Ranked by realistic impact for this workload. Adopting libraries that already
contain hand-tuned assembly is the cheap path; writing fresh assembly is an
explicit research line — see `docs/ASSEMBLY-EXPERIMENT.md`.
1. **Hashing (dominant ingest cost).** SHA-256 with SHA-NI: ~1.52 GB/s/core.
BLAKE3 with AVX-512: 610+ GB/s/core, parallelizable, free Merkle tree.
Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop.
2. **Content-defined chunking (FastCDC / Gear).** Rolling hash over every
byte; pure-Python is unusable, optimized C / Rust hits 510 GB/s.
Mandatory if dedup is on the roadmap.
3. **Compression.** Zstd with bundled SIMD reaches multi-GB/s. Evidence logs
typically compress 520×. Apply at chunk level so dedup still works.
4. **I/O path.** Linux: `io_uring` for ingest writes; `sendfile(2)` /
`splice(2)` for download zero-copy; `O_DIRECT` for very large objects.
5. **Encryption.** AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305
vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or
AWS-LC. Never write crypto by hand.
6. **Metadata hot paths.** Bloom or Cuckoo filter in front of the
"have I seen this hash?" lookup. ~50 lines of Rust, ~100× win.
7. **Manifest canonicalization.** Signed manifests canonicalize on every
ingest and every verify. Pick a fast canonical CBOR / JCS impl.
Not worth native code: HTTP layer, retention engine, audit log, DB access,
orchestration, workflow logic. Keep Python.
## Concrete suggestions before WP-0001 lands
- Add `digest_algorithm` to `artifact_files` (default `sha256`, allow
`blake3`).
- Add `content_address` (e.g., `blake3:…`) as canonical storage key, with
`relative_path` retained as logical metadata.
- Add `retrieval_tier` and `restore_status` to `storage_locations` now,
nullable.
- Define the upload session resource shape even if v1 implements only
single-shot multipart.
- Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a
signing format target (cosign / Sigstore). Decide, do not implement.
- Add an `events` table with a monotonic sequence number so a CDC feed is
trivial later.
- Decide explicitly whether OCI artifact compatibility is a v2 goal or out of
scope. Either is fine; ambiguity will distort schema decisions.
## What this review does not change
INTENT and SCOPE remain correctly scoped for v1. The pilot path through
WP-0001 should ship as planned. The schema annotations above are additive,
not redirective. The platform ambition lives in `docs/PLATFORM-AMBITION.md`
so it can guide later decisions without expanding the current workplan.