generated from coulomb/repo-seed
docs: add platform ambition, blueprint review, and assembly experiment
Captures the longer-horizon thesis (sovereign-cloud artifact substrate) alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine schema/contract commitments the v1 must preserve to keep that horizon reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
175
docs/REVIEW-2026-05-15-intent-and-blueprint.md
Normal file
175
docs/REVIEW-2026-05-15-intent-and-blueprint.md
Normal file
@@ -0,0 +1,175 @@
|
||||
# Review — INTENT and Architecture Blueprint
|
||||
|
||||
Date: 2026-05-15
|
||||
Reviewer: claude (opus-4-7)
|
||||
Inputs: `INTENT.md`, `docs/ARCHITECTURE-BLUEPRINT.md`,
|
||||
`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md`, `SCOPE.md`, `AGENTS.md`
|
||||
|
||||
This review reframes the current scoped-internal-service design against a
|
||||
longer-horizon ambition: make `artifact-store` the leading open source
|
||||
substrate for generic artifact storage in the same sense that VLC and ffmpeg
|
||||
lead their domain. See `docs/PLATFORM-AMBITION.md` for the ambition framing
|
||||
this review is in service of.
|
||||
|
||||
## SWOT
|
||||
|
||||
### Strengths
|
||||
|
||||
- Clean separation between artifact *identity / lifecycle* and *bytes*.
|
||||
Registry owns metadata; storage adapter owns persistence. This is the single
|
||||
most consequential architectural decision and the docs get it right.
|
||||
- Retention is a first-class concept from day one, not bolted on later.
|
||||
- Audit log designed in from the start, with explicit room for signed events.
|
||||
- Storage adapter contract is minimal and well-shaped
|
||||
(`put / get / head / delete / health`).
|
||||
- Pilot-first discipline (`guide-board` / OpenCMIS TCK) anchors the work in a
|
||||
real producer rather than a hypothetical one.
|
||||
- Manifest portability is an explicit goal — a package should be understandable
|
||||
without calling its producer.
|
||||
- Boundary statements are explicit (will not replace StateHub, will not encode
|
||||
producer semantics).
|
||||
|
||||
### Weaknesses
|
||||
|
||||
- Storage is keyed by logical path, not by content hash. Blocks global
|
||||
deduplication, Merkle integrity proofs, partial replication, federation.
|
||||
- No streaming, chunked, or resumable upload story. Multipart REST will cap
|
||||
throughput at the slowest Python/WSGI hop for multi-GB packages.
|
||||
- No content-defined chunking (CDC). Evidence packages with logs are highly
|
||||
dedup-able; current design captures none of that.
|
||||
- SHA-256 is the right *compatibility* digest but the wrong *throughput*
|
||||
digest at platform scale.
|
||||
- Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no
|
||||
partitioning / sharding story exists.
|
||||
- No event / CDC stream for downstream consumers — Statehub, search, UIs would
|
||||
have to poll.
|
||||
- No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage
|
||||
without signed attestations leaves half the value on the table.
|
||||
- Metadata is open-ended JSON without a schema-registration path. Hard to
|
||||
build typed tooling on top.
|
||||
- No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit.
|
||||
- No observability targets (latency / throughput SLOs, metrics, traces).
|
||||
Platform-grade claims will eventually require numbers.
|
||||
- No OCI / `oras` artifact compatibility — leaves the largest existing
|
||||
artifact ecosystem off the table.
|
||||
|
||||
### Opportunities
|
||||
|
||||
- **OCI Artifact + ORAS compatibility.** Inherit Helm, ML model, SBOM, cosign
|
||||
tooling for free. Probably the single highest-leverage external move.
|
||||
- **Sigstore + in-toto + SLSA.** Evidence packages should be signed by
|
||||
default; this is exactly the gap most generic registries leave unfilled.
|
||||
- **Content-addressed CAS + Merkle DAG** (Git / IPFS / restic pattern):
|
||||
enables global dedup, integrity proofs, federation, partial mirroring.
|
||||
- **BLAKE3** as native digest with SHA-256 retained for interop:
|
||||
orders-of-magnitude faster hashing, and BLAKE3's construction *is* a Merkle
|
||||
tree — package-level integrity comes for free.
|
||||
- **WASM plugin surface for transforms, extractors, indexers, redactors.**
|
||||
The "ffmpeg moment" for this domain: a stable host API that ecosystem
|
||||
contributors can extend without forking the core.
|
||||
- **Federation / mirroring** between artifact-store instances via signed
|
||||
manifests. Nothing comparable exists in the evidence space today.
|
||||
- **FUSE / NFS / S3-gateway frontends.** Legacy producers ingest without code
|
||||
changes.
|
||||
- **Embeddable mode.** A single static binary like `restic`, plus a server
|
||||
mode. Embedding is what makes ffmpeg ubiquitous.
|
||||
|
||||
### Threats
|
||||
|
||||
- Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic,
|
||||
IPFS, Sigstore, plain S3. None are exactly this, but each chips at the
|
||||
value proposition.
|
||||
- Scope creep vs the carefully-scoped INTENT. The platform ambition pulls
|
||||
toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve
|
||||
this tension explicitly or you get neither.
|
||||
- Python performance ceiling on the data plane (ingestion of multi-GB
|
||||
packages, hashing, chunking).
|
||||
- Governance / maintenance debt. VLC and ffmpeg have decades of contributor
|
||||
depth; underestimating that is a project-killer.
|
||||
|
||||
## Architecture optimizations worth taking now
|
||||
|
||||
Each of these is cheap to lock in before code lands, and expensive (or
|
||||
breaking) to add later.
|
||||
|
||||
1. **Split control plane from data plane.** Registry / API / retention stays
|
||||
in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a
|
||||
separate process (Rust sidecar, eventually with hot kernels in C / asm)
|
||||
that can scale and be rewritten independently. Pin the contract now (Unix
|
||||
socket, gRPC or framed bincode). See `docs/PLATFORM-AMBITION.md`.
|
||||
2. **Make content the primary address.** Internal object key
|
||||
`blake3:<digest>` (or `sha256:<digest>` for compat). `relative_path`
|
||||
becomes logical metadata in the manifest. Unlocks dedup, integrity,
|
||||
federation, OCI compatibility.
|
||||
3. **Append-only WAL as the source of truth.** Metadata DB is a materialized
|
||||
view rebuildable from the log. Same pattern as Kafka / EventStore /
|
||||
Datomic. Cheap audit, replication, point-in-time recovery.
|
||||
4. **OCI artifact spec as a wire format**, even if the native API is richer.
|
||||
Buys instant interop with `oras`, `cosign`, `crane`, Helm.
|
||||
5. **Signed manifests from day one.** Pin a signing format (cosign / Sigstore)
|
||||
and a canonicalization (JCS or canonical CBOR). Post-hoc signing means
|
||||
every legacy manifest is unsigned forever.
|
||||
6. **Resumable, chunked uploads on the wire.** Upload session resource
|
||||
(`POST /uploads` → `PATCH /uploads/{id}` ranges → `POST /uploads/{id}/complete`).
|
||||
`tus.io` is a reasonable reference. v1 implementation can still be
|
||||
single-shot multipart.
|
||||
7. **Event stream out.** A monotonic-sequence `events` table; consumers
|
||||
tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later.
|
||||
8. **Schema-typed metadata with escape hatch.** Producers register a JSON
|
||||
Schema for their metadata variant (`guide-board.run.v1`). Stored as open
|
||||
JSON, validated at ingest, queryable by typed views.
|
||||
9. **Tiering as a first-class column of `storage_location`.** Promote
|
||||
`retrieval_tier` and `restore_status` into the schema now (nullable,
|
||||
default `hot`).
|
||||
10. **Ship a great CLI before any UI.** ffmpeg ships a binary, not a GUI.
|
||||
|
||||
## Performance hotspots — where native code actually matters
|
||||
|
||||
Ranked by realistic impact for this workload. Adopting libraries that already
|
||||
contain hand-tuned assembly is the cheap path; writing fresh assembly is an
|
||||
explicit research line — see `docs/ASSEMBLY-EXPERIMENT.md`.
|
||||
|
||||
1. **Hashing (dominant ingest cost).** SHA-256 with SHA-NI: ~1.5–2 GB/s/core.
|
||||
BLAKE3 with AVX-512: 6–10+ GB/s/core, parallelizable, free Merkle tree.
|
||||
Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop.
|
||||
2. **Content-defined chunking (FastCDC / Gear).** Rolling hash over every
|
||||
byte; pure-Python is unusable, optimized C / Rust hits 5–10 GB/s.
|
||||
Mandatory if dedup is on the roadmap.
|
||||
3. **Compression.** Zstd with bundled SIMD reaches multi-GB/s. Evidence logs
|
||||
typically compress 5–20×. Apply at chunk level so dedup still works.
|
||||
4. **I/O path.** Linux: `io_uring` for ingest writes; `sendfile(2)` /
|
||||
`splice(2)` for download zero-copy; `O_DIRECT` for very large objects.
|
||||
5. **Encryption.** AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305
|
||||
vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or
|
||||
AWS-LC. Never write crypto by hand.
|
||||
6. **Metadata hot paths.** Bloom or Cuckoo filter in front of the
|
||||
"have I seen this hash?" lookup. ~50 lines of Rust, ~100× win.
|
||||
7. **Manifest canonicalization.** Signed manifests canonicalize on every
|
||||
ingest and every verify. Pick a fast canonical CBOR / JCS impl.
|
||||
|
||||
Not worth native code: HTTP layer, retention engine, audit log, DB access,
|
||||
orchestration, workflow logic. Keep Python.
|
||||
|
||||
## Concrete suggestions before WP-0001 lands
|
||||
|
||||
- Add `digest_algorithm` to `artifact_files` (default `sha256`, allow
|
||||
`blake3`).
|
||||
- Add `content_address` (e.g., `blake3:…`) as canonical storage key, with
|
||||
`relative_path` retained as logical metadata.
|
||||
- Add `retrieval_tier` and `restore_status` to `storage_locations` now,
|
||||
nullable.
|
||||
- Define the upload session resource shape even if v1 implements only
|
||||
single-shot multipart.
|
||||
- Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a
|
||||
signing format target (cosign / Sigstore). Decide, do not implement.
|
||||
- Add an `events` table with a monotonic sequence number so a CDC feed is
|
||||
trivial later.
|
||||
- Decide explicitly whether OCI artifact compatibility is a v2 goal or out of
|
||||
scope. Either is fine; ambiguity will distort schema decisions.
|
||||
|
||||
## What this review does not change
|
||||
|
||||
INTENT and SCOPE remain correctly scoped for v1. The pilot path through
|
||||
WP-0001 should ship as planned. The schema annotations above are additive,
|
||||
not redirective. The platform ambition lives in `docs/PLATFORM-AMBITION.md`
|
||||
so it can guide later decisions without expanding the current workplan.
|
||||
Reference in New Issue
Block a user