docs: add platform ambition, blueprint review, and assembly experiment

Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-15 20:56:01 +02:00
parent 793c0c7ba5
commit 403d903585
4 changed files with 616 additions and 3 deletions

View File

@@ -9,6 +9,9 @@ as a local filesystem, S3-compatible object storage, or Ceph RGW.
Start here:
- [INTENT.md](INTENT.md)
- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md)
- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md)
- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary
- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — draft architecture
- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon thesis and the schema commitments v1 preserves
- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md) — SWOT and optimisation review
- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in research line on hand-tuned assembly for hot kernels
- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) — first implementation workplan

209
docs/ASSEMBLY-EXPERIMENT.md Normal file
View File

@@ -0,0 +1,209 @@
# Assembly Experiment
Status: draft / opt-in research line
Created: 2026-05-15
This document defines an opt-in research line under `artifact-store`: can
agentic coding adopt, extend, and eventually originate ffmpeg-grade hand-
written assembly for the hot paths of an artifact-storage data plane?
This is a research experiment, not roadmap-critical work. The platform
ambition (`docs/PLATFORM-AMBITION.md`) stands on its own merits whether or
not we ever write a single line of assembly. The experiment runs alongside.
## Why this experiment exists
ffmpeg is the empirical proof that hand-written assembly with runtime CPU
dispatch still substantially outperforms even the best Rust-with-SIMD-
intrinsics codebases for tight inner loops — often by 1.53× on the same
hardware, sometimes more. The cost is steep: domain expertise, multi-arch
maintenance, calling-convention discipline, microarchitecture awareness.
ffmpeg has decades of contributor depth to amortise that cost.
We do not have that depth. The interesting question is whether large
language models, used as coding agents, change the cost equation enough to
make this approach viable for a focused project. If they do, an artifact
substrate that competes on raw throughput-per-core has a real edge against
generic object stores. If they do not, we adopt prebuilt asm-tuned
libraries and lose nothing.
## Strategic context
This experiment ties to the commercial horizon recorded in
`docs/PLATFORM-AMBITION.md`. A sovereign-cloud artifact product that
ingests, hashes, dedups, and serves bytes at noticeably higher
throughput-per-core than commodity object stores has a defensible edge.
"Cheaper per-GB than AWS" is a losing race; "more throughput per server,
on hardware you already own" is not.
## Constraints
### Licence
- `artifact-store` is MIT No Attribution.
- ffmpeg's `libavutil` (where the storage-relevant asm lives) is LGPL 2.1+.
- We **cannot** copy LGPL-licensed asm into MIT-0 source.
- We **can**:
- dynamically link to `libavutil` at runtime (users get both licences);
- re-license a *segregated optional native module* under LGPL 2.1+ while
the rest of the repo stays MIT-0, provided the module is its own
package and the boundary is explicit;
- read LGPL code and implement the same algorithm from scratch
(algorithms are not copyrightable; specific source text is). This is
the standard practice for clean-room reimplementation. Document the
process per file.
- prefer asm sources under permissive licences (BSD, Apache, CC0,
public domain) where they exist.
Preferred upstream licences for the experiment, in order:
1. Public domain / CC0 (Intel reference, BLAKE3 reference)
2. Apache-2.0 / BSD / MIT (xxhash, zstd, ring)
3. LGPL via dynamic linking (libavutil)
4. Clean-room reimplementation inspired by LGPL (last resort)
### Maintenance budget
The experiment is bounded. Any asm we adopt or write must:
- have a portable C / Rust fallback that is correctness-equivalent;
- be reachable through a runtime CPU-feature dispatch table (the ffmpeg
pattern) so the binary still runs on machines without the relevant
extension;
- carry a test that compares its output byte-for-byte against the fallback
on randomised inputs;
- carry a microbenchmark with a recorded baseline so regressions are
visible.
If we cannot meet those four bars for a candidate, we ship the library
implementation and revisit later.
## What ffmpeg actually has that is reusable here
Inspection of `libavutil/x86/` (2026-05-15) found the following
storage-relevant assets:
| File / module | What it accelerates | Reuse value for artifact-store |
|------------------------------|-------------------------------|--------------------------------|
| `x86/crc.asm` | CRC-32 (LE + BE) via PCLMULQDQ | **High.** Fast non-crypto integrity check for chunks and network framing. Public function names `ff_crc_le`, `ff_crc`. LGPL — must dynamic-link or reimplement. |
| `x86/aes.asm` + `aes_init.c` | AES block cipher | **Lowmedium.** ffmpeg's AES is unauthenticated. At-rest encryption needs AES-GCM, better adopted from Ring / BoringSSL / AWS-LC (permissive licences, FIPS-validatable). |
| `x86/cpuid.asm` + `cpu.c` | CPU feature detection | **High (pattern, not code).** Reimplement the `ff_get_cpu_flags_x86()` + `AV_CPU_FLAG_*` pattern under MIT-0. This is the dispatch backbone. |
| `x86/x86inc.asm` | Macro library for asm authoring | **High (technique).** Cross-platform calling conventions, register naming, function prologue/epilogue. ffmpeg's macros are the de-facto standard outside game-dev. NASM-syntax. |
| `x86/x86util.asm` | SIMD helper macros | **Medium.** Useful patterns; not directly liftable. |
| `x86/emms.asm` | MMX state clearing | **Zero.** Legacy. |
| `sha.c` | SHA-1 / SHA-224 / SHA-256 | **Zero.** Pure C, no SIMD. We are better off with BLAKE3 (asm-tuned upstream) and SHA-NI via OpenSSL / Ring for SHA-256. |
| `aes_ctr.c`, `blowfish.c`, `camellia.c`, `cast5.c`, `des.c` | Block ciphers | **Zero.** Not relevant for our threat model. |
| `adler32.c`, `crc.c` | Reference integrity (C) | **Zero.** Use the asm-accelerated variants. |
Everything in `libavcodec` (DCT, motion estimation, deblocking) and the
video / audio / image-utility `.asm` files in `libavutil` is irrelevant to
artifact-store and stays out of scope.
## Candidate hot kernels for artifact-store, ranked
Each kernel below is a candidate either for adoption (drop in a vetted
permissive library), extension (start from a permissive baseline and
optimise further), or origination (write fresh).
### Tier 1 — adopt now, do not write
| Kernel | Recommended source | Notes |
|---------------|-------------------------------------------------------|-------|
| BLAKE3 | `blake3` (C reference + Rust crate), Apache-2.0 / CC0 | Already ships hand-tuned AVX-512, AVX2, SSE4.1, ARM NEON, ARM64. We will never beat upstream. |
| SHA-256 (compat) | OpenSSL / Ring / AWS-LC, permissive | Uses SHA-NI on supporting CPUs. |
| AES-GCM | Ring / BoringSSL, ISC / BSD | AES-NI + PCLMULQDQ for GHASH. Authenticated; what we actually need. |
| Zstandard | `zstd` (Facebook), BSD-3 | Multi-GB/s with SIMD. |
| LZ4 | `lz4`, BSD-2 | Faster than zstd at lower ratio; useful for high-throughput cold paths. |
### Tier 2 — adopt + extend, this is where the experiment starts
| Kernel | Baseline source | Extension question |
|--------------------|----------------------------------------------|--------------------|
| FastCDC (rolling hash) | `fastcdc-rs` (MIT) or original C paper code | Can we squeeze a SIMD'd Gear-hash variant that maintains the same boundary distribution? Existing Rust impl is scalar. |
| CRC-32C (Castagnoli, for chunk integrity) | Intel reference white paper code (public domain) | PCLMULQDQ-accelerated; ffmpeg's `crc.asm` shows the technique under LGPL — reimplement under MIT-0 from the Intel paper. |
| xxhash3 | `xxhash` (BSD-2) | Already SIMD'd; the extension is whether we can fuse it with our chunk-boundary loop to read each byte once. |
| Manifest canonicalisation hash | Whatever canonical-CBOR lib we pin | Likely no asm needed; included to monitor whether it ever appears on a profile. |
### Tier 3 — originate, only if profiles justify it
These are deliberately speculative. None of them are committed work.
- A fused "scan + chunk + hash" pass that reads each byte from the
upload buffer once and emits chunk boundaries plus per-chunk BLAKE3
state in a single pass. Today this requires three passes (CDC, hash
per chunk, hash for manifest root).
- A SIMD'd content-type sniffer for the first N kilobytes of unknown
uploads.
- An AVX-512 implementation of a bloom / cuckoo filter probe for the
"have I seen this hash?" hot path.
- Fast batch verification: given a list of `(content_address, bytes)`
pairs, verify all of them in one SIMD-dispatched pass.
## Experiment protocol
For each Tier 2 or Tier 3 candidate that we take on:
1. **Frame the kernel.** One function, one clear input / output, one
measurable metric (bytes per second per core).
2. **Baseline.** Land a portable C or Rust implementation with full test
coverage and a recorded microbenchmark number.
3. **Dispatch.** Wire the kernel through the runtime CPU-feature
dispatcher (ffmpeg pattern, reimplemented MIT-0). Default path = the
baseline.
4. **Agentic asm attempt.** Use the coding agent to author a NASM-syntax
asm implementation targeting one ISA extension (start with AVX2 — most
broadly available). The agent must:
- produce annotated source with cycle-accurate comments where relevant;
- include the test that compares its output to baseline on randomised
input;
- include the microbenchmark.
5. **Independent review.** A second pass — human or a fresh agent context
— reviews for correctness, calling-convention compliance, and obvious
microarchitectural issues (false dependencies, port pressure, unaligned
loads, misuse of `vzeroupper`).
6. **Land or shelve.** If the asm beats the baseline by a meaningful
margin (≥ 1.5×) and passes review, it lands behind the dispatcher.
Otherwise it shelves with the benchmark numbers recorded so we know
not to retry without new techniques.
7. **Extend.** Repeat for AVX-512, then ARM NEON, then SVE2, in that
order of impact.
Each completed kernel produces an ADR-style note in `docs/asm/` recording
the algorithm, the source of inspiration, the licence chain, the
benchmark numbers, and any microarchitectural notes.
## What the experiment proves or disproves
A succeeding experiment delivers:
- a portable asm-accelerated data plane that competes with hand-tuned C
storage stacks on throughput;
- a public record of which kernels the agentic approach handles well and
which it does not;
- a reusable dispatcher and macro foundation that other projects can adopt.
A failing experiment delivers:
- a published record of where agentic coding plateaus on hot-path asm;
- an artifact-store data plane that is still very good — because the
baseline is "use the asm-tuned library", which is already fast.
Either outcome is publishable. The downside is bounded.
## Out of scope for this experiment
- Cryptography written by us. Use vetted libraries. Always.
- Architectures with small deployment footprints in this domain (RISC-V,
POWER, MIPS). Revisit once x86_64 and ARM64 are solid.
- Kernel-bypass networking (DPDK, eBPF/XDP storage). Different
experiment, different document if we ever pursue it.
- GPU offload. Different cost model; not addressed here.
## Immediate next steps
None are committed. When the v1 baseline (WP-0001) lands and we have a
real profile of where time is spent, the first candidate to pick up is
almost certainly **FastCDC + BLAKE3 in a single pass**, because that is
the documented bottleneck of every CAS-style storage system that has
profiled it (restic, borg, kopia). Until then, this document is a
holding place for the ambition.

226
docs/PLATFORM-AMBITION.md Normal file
View File

@@ -0,0 +1,226 @@
# Platform Ambition
Status: draft
Created: 2026-05-15
This document records the longer-horizon thesis behind `artifact-store` and
captures which decisions are taken now to keep that horizon reachable without
expanding the v1 workplan. It sits beside, not above, `INTENT.md` and
`SCOPE.md`. INTENT defines what we build first; this document defines what
the v1 must not foreclose.
## Thesis
Generated artifacts — evidence packages, build outputs, ML models, logs,
snapshots, reports, scorecards, exports — are first-class durable objects in
modern software work. They sit somewhere between source code (well-served by
Git) and binary releases (well-served by OCI registries). The space between
is currently filled by a fragmented mix of bespoke directories, ad-hoc S3
buckets, vendor registries (Artifactory, Nexus), and document-management
systems that were not built for machine producers.
`artifact-store` aims to occupy that gap with one substrate: a generic,
content-addressed, signed, deduplicated, retention-aware artifact registry
and storage gateway that other tools embed or speak to.
The reference points are deliberate. **VLC** and **ffmpeg** lead their domain
not by being the prettiest applications but by being correct, fast, embeddable,
portable, and indispensable infrastructure for everyone else. The same
strategy applies here: build a kernel that is so good at the bytes-and-
identity layer that every artifact-producing tool would rather speak its
protocol than reinvent it.
## Commercial horizon
The longer-horizon commercial target is **a sovereign artifact-storage
product line for European cloud providers** — Stack IT (Schwarz Group) is
the concrete example. The thesis is:
- Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do
not sell *artifact identity, retention, attestation, federation, evidence
preservation*. Customers either build it themselves or buy proprietary
registries on top.
- A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned
artifact substrate on top of its own object storage has a defensible
differentiation against AWS — not in raw price-per-GB, which is a losing
race, but in regulated workloads (evidence retention, audit, signed
attestations, legal-hold, sovereign jurisdiction guarantees).
- Open source is the wedge. A widely-adopted upstream that the provider
ships, supports, and extends is far stronger than a proprietary stack.
This is a multi-year horizon, not a v1 deliverable. It is recorded here so
schema and protocol decisions made now keep that path open.
## Reference points
| System | What we learn from it |
|-----------------|--------------------------------------------------------------|
| ffmpeg | Embeddable core, hand-tuned hot paths, runtime CPU dispatch |
| VLC | Plugin architecture, portability, ubiquity through being a library too |
| Git | Content-addressed storage, Merkle DAG, pack files, integrity |
| restic | Single static binary, CDC + dedup, encryption by default |
| IPFS | Content-addressing, federation, partial replication |
| OCI Registry | Standardised manifest + blob model with broad ecosystem |
| Sigstore / cosign | Signed attestations as a first-class artifact property |
| MinIO | Operator ergonomics, S3 wire compat as adoption vector |
| SeaweedFS / Ceph | Separation of metadata plane from data plane |
| RocksDB / LMDB | Embeddable storage engines with predictable performance |
| Kafka | Log-as-source-of-truth, materialised views |
| BLAKE3 | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned |
We are not trying to reproduce any of these. We are trying to occupy a
specific gap between them with the best ideas from each.
## Non-goals (still)
The platform ambition does not change the v1 boundary in `INTENT.md` or
`SCOPE.md`. In particular it does not:
- replace StateHub as the work / decision system of record;
- encode producer-specific assessment semantics in the registry core;
- require any of the optimisations listed in the "near-horizon" section
below to land in v1;
- commit the project to writing assembly. The assembly-experiment line
(`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical.
## Architectural commitments — preserved by v1
The following decisions are taken now because reversing them later is
expensive. Each lands in v1 as a schema or contract decision; full
exploitation is later work.
### A1. Content as the primary address
Internal canonical key for stored bytes is `<algo>:<digest>`, not a logical
path. Files within a package keep a `relative_path` as logical metadata,
but the storage backend sees and addresses content hashes.
- Enables: global dedup, Merkle integrity proofs, partial mirrors,
federation, OCI compatibility.
- v1 cost: one schema column (`content_address`) and a deterministic key
derivation; no behaviour change.
### A2. BLAKE3 as native digest, SHA-256 retained for interop
`digest_algorithm` is a column on `artifact_files`. v1 default may remain
`sha256` to ship the pilot quickly; the column exists so `blake3` can ship
without migration.
- Enables: faster hashing, free Merkle root over a package, alignment with
modern signing tooling.
- v1 cost: column + adapter table mapping algo → hashing impl.
### A3. Append-only event log as source of truth
An `events` table with a monotonic sequence number is the authoritative
record of registry mutations. The current metadata tables are a
materialised view rebuildable from the log.
- Enables: CDC feeds, audit, replication, point-in-time recovery, signed
event streams.
- v1 cost: one extra table written on the same transaction as today's
mutations.
### A4. Signed manifests, canonicalisation pinned
Manifest serialisation uses a canonical form (recommendation: canonical
CBOR; JCS as alternative) so byte-identical signing is possible across
languages and time. v1 may not actually sign — the pin guarantees that
when signing lands, every prior manifest is re-signable byte-for-byte.
- Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style
manifest digests.
- v1 cost: pick one canonicalisation library and use it for manifest
writes. Zero runtime cost.
### A5. Control plane / data plane separation at the contract
Even if v1 implements both in one Python process, the boundary between
"registry / API / retention" (control plane) and "hash / chunk / store /
serve" (data plane) is a named contract. When the data plane is later
extracted into a Rust binary, the API does not change.
- Enables: native-speed ingestion, language flexibility on the hot path,
independent scaling.
- v1 cost: discipline (separate Python module with no API leakage), not
code.
### A6. Resumable upload wire shape
API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with
range, `POST /uploads/{id}/complete`. v1 implementation may still be
single-shot multipart under the hood, but the resource shape exists so
chunked / resumable upload is additive.
- Enables: streaming, retry-safe ingestion, very-large-package support.
- v1 cost: route definitions only; underlying logic can remain simple.
### A7. Tiering as a property of storage locations
`storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and
`restore_status` columns, nullable, default `hot`. The API can already
return "not immediately available" without changing artifact identity.
- Enables: future cold storage, Glacier-style restore flows.
- v1 cost: two nullable columns.
### A8. Schema-typed metadata with open escape hatch
Producers register a metadata schema (JSON Schema) per variant
(e.g. `guide-board.run.v1`). Stored as open JSON, validated against the
registered schema at ingest time. Queries can use typed views.
- Enables: tooling, search, GraphQL views, typed clients without losing
flexibility.
- v1 cost: a `metadata_schemas` table; v1 validation can be a no-op.
### A9. OCI compatibility kept reachable
We do not promise OCI compatibility in v1, but we do not adopt any
data model that prevents it. Concretely: keep content addresses as
`<algo>:<hex>`, keep manifest structure compatible with an OCI image
manifest (config + layers + annotations), and avoid invariants that the
OCI spec forbids.
- Enables: future `oras push` / `cosign sign` / Helm ecosystem entry.
- v1 cost: one design review per schema change against the OCI spec.
## Near-horizon technical roadmap (post-baseline)
Roughly ordered. Not commitments; planning hooks.
1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC +
optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a
minimal gRPC or framed-bincode protocol to the Python control plane over
a Unix socket.
2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup
unit. Package manifest references chunk digests; package digest is the
Merkle root.
3. **Cosign-compatible signing pipeline.** Every finalised manifest can be
signed; signatures stored alongside the manifest.
4. **Event stream out.** NATS or Kafka topic of registry events for
downstream consumers.
5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI
distribution spec on top of the same storage.
6. **WASM plugin host.** Producers and operators can ship signed `.wasm`
modules for content extraction, redaction, scorecard generation,
custom hashing, indexing. This is the "ffmpeg moment" — open extension
surface that does not require forking the core.
7. **Federation.** Signed manifest exchange between artifact-store
instances. Gossip or explicit peering.
8. **Cold tier adapters.** S3 Glacier, Tape, IA classes.
## How this document is used
- Every schema change in WP-0001 (or successors) is checked against
commitments A1A9. A change that violates one is either rejected or
documented as a deliberate revision of this document.
- Every "we could do this faster in native code" idea is filed against
`docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan.
- Every new producer integration is checked against the commercial
horizon: does it generalise, or does it bake in producer-specific
assumptions?
This document is allowed to be wrong. It is not allowed to be silent.
Update it when the thesis changes; do not let v1 quietly close doors that
the v3 needs open.

View File

@@ -0,0 +1,175 @@
# Review — INTENT and Architecture Blueprint
Date: 2026-05-15
Reviewer: claude (opus-4-7)
Inputs: `INTENT.md`, `docs/ARCHITECTURE-BLUEPRINT.md`,
`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md`, `SCOPE.md`, `AGENTS.md`
This review reframes the current scoped-internal-service design against a
longer-horizon ambition: make `artifact-store` the leading open source
substrate for generic artifact storage in the same sense that VLC and ffmpeg
lead their domain. See `docs/PLATFORM-AMBITION.md` for the ambition framing
this review is in service of.
## SWOT
### Strengths
- Clean separation between artifact *identity / lifecycle* and *bytes*.
Registry owns metadata; storage adapter owns persistence. This is the single
most consequential architectural decision and the docs get it right.
- Retention is a first-class concept from day one, not bolted on later.
- Audit log designed in from the start, with explicit room for signed events.
- Storage adapter contract is minimal and well-shaped
(`put / get / head / delete / health`).
- Pilot-first discipline (`guide-board` / OpenCMIS TCK) anchors the work in a
real producer rather than a hypothetical one.
- Manifest portability is an explicit goal — a package should be understandable
without calling its producer.
- Boundary statements are explicit (will not replace StateHub, will not encode
producer semantics).
### Weaknesses
- Storage is keyed by logical path, not by content hash. Blocks global
deduplication, Merkle integrity proofs, partial replication, federation.
- No streaming, chunked, or resumable upload story. Multipart REST will cap
throughput at the slowest Python/WSGI hop for multi-GB packages.
- No content-defined chunking (CDC). Evidence packages with logs are highly
dedup-able; current design captures none of that.
- SHA-256 is the right *compatibility* digest but the wrong *throughput*
digest at platform scale.
- Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no
partitioning / sharding story exists.
- No event / CDC stream for downstream consumers — Statehub, search, UIs would
have to poll.
- No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage
without signed attestations leaves half the value on the table.
- Metadata is open-ended JSON without a schema-registration path. Hard to
build typed tooling on top.
- No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit.
- No observability targets (latency / throughput SLOs, metrics, traces).
Platform-grade claims will eventually require numbers.
- No OCI / `oras` artifact compatibility — leaves the largest existing
artifact ecosystem off the table.
### Opportunities
- **OCI Artifact + ORAS compatibility.** Inherit Helm, ML model, SBOM, cosign
tooling for free. Probably the single highest-leverage external move.
- **Sigstore + in-toto + SLSA.** Evidence packages should be signed by
default; this is exactly the gap most generic registries leave unfilled.
- **Content-addressed CAS + Merkle DAG** (Git / IPFS / restic pattern):
enables global dedup, integrity proofs, federation, partial mirroring.
- **BLAKE3** as native digest with SHA-256 retained for interop:
orders-of-magnitude faster hashing, and BLAKE3's construction *is* a Merkle
tree — package-level integrity comes for free.
- **WASM plugin surface for transforms, extractors, indexers, redactors.**
The "ffmpeg moment" for this domain: a stable host API that ecosystem
contributors can extend without forking the core.
- **Federation / mirroring** between artifact-store instances via signed
manifests. Nothing comparable exists in the evidence space today.
- **FUSE / NFS / S3-gateway frontends.** Legacy producers ingest without code
changes.
- **Embeddable mode.** A single static binary like `restic`, plus a server
mode. Embedding is what makes ffmpeg ubiquitous.
### Threats
- Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic,
IPFS, Sigstore, plain S3. None are exactly this, but each chips at the
value proposition.
- Scope creep vs the carefully-scoped INTENT. The platform ambition pulls
toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve
this tension explicitly or you get neither.
- Python performance ceiling on the data plane (ingestion of multi-GB
packages, hashing, chunking).
- Governance / maintenance debt. VLC and ffmpeg have decades of contributor
depth; underestimating that is a project-killer.
## Architecture optimizations worth taking now
Each of these is cheap to lock in before code lands, and expensive (or
breaking) to add later.
1. **Split control plane from data plane.** Registry / API / retention stays
in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a
separate process (Rust sidecar, eventually with hot kernels in C / asm)
that can scale and be rewritten independently. Pin the contract now (Unix
socket, gRPC or framed bincode). See `docs/PLATFORM-AMBITION.md`.
2. **Make content the primary address.** Internal object key
`blake3:<digest>` (or `sha256:<digest>` for compat). `relative_path`
becomes logical metadata in the manifest. Unlocks dedup, integrity,
federation, OCI compatibility.
3. **Append-only WAL as the source of truth.** Metadata DB is a materialized
view rebuildable from the log. Same pattern as Kafka / EventStore /
Datomic. Cheap audit, replication, point-in-time recovery.
4. **OCI artifact spec as a wire format**, even if the native API is richer.
Buys instant interop with `oras`, `cosign`, `crane`, Helm.
5. **Signed manifests from day one.** Pin a signing format (cosign / Sigstore)
and a canonicalization (JCS or canonical CBOR). Post-hoc signing means
every legacy manifest is unsigned forever.
6. **Resumable, chunked uploads on the wire.** Upload session resource
(`POST /uploads``PATCH /uploads/{id}` ranges → `POST /uploads/{id}/complete`).
`tus.io` is a reasonable reference. v1 implementation can still be
single-shot multipart.
7. **Event stream out.** A monotonic-sequence `events` table; consumers
tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later.
8. **Schema-typed metadata with escape hatch.** Producers register a JSON
Schema for their metadata variant (`guide-board.run.v1`). Stored as open
JSON, validated at ingest, queryable by typed views.
9. **Tiering as a first-class column of `storage_location`.** Promote
`retrieval_tier` and `restore_status` into the schema now (nullable,
default `hot`).
10. **Ship a great CLI before any UI.** ffmpeg ships a binary, not a GUI.
## Performance hotspots — where native code actually matters
Ranked by realistic impact for this workload. Adopting libraries that already
contain hand-tuned assembly is the cheap path; writing fresh assembly is an
explicit research line — see `docs/ASSEMBLY-EXPERIMENT.md`.
1. **Hashing (dominant ingest cost).** SHA-256 with SHA-NI: ~1.52 GB/s/core.
BLAKE3 with AVX-512: 610+ GB/s/core, parallelizable, free Merkle tree.
Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop.
2. **Content-defined chunking (FastCDC / Gear).** Rolling hash over every
byte; pure-Python is unusable, optimized C / Rust hits 510 GB/s.
Mandatory if dedup is on the roadmap.
3. **Compression.** Zstd with bundled SIMD reaches multi-GB/s. Evidence logs
typically compress 520×. Apply at chunk level so dedup still works.
4. **I/O path.** Linux: `io_uring` for ingest writes; `sendfile(2)` /
`splice(2)` for download zero-copy; `O_DIRECT` for very large objects.
5. **Encryption.** AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305
vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or
AWS-LC. Never write crypto by hand.
6. **Metadata hot paths.** Bloom or Cuckoo filter in front of the
"have I seen this hash?" lookup. ~50 lines of Rust, ~100× win.
7. **Manifest canonicalization.** Signed manifests canonicalize on every
ingest and every verify. Pick a fast canonical CBOR / JCS impl.
Not worth native code: HTTP layer, retention engine, audit log, DB access,
orchestration, workflow logic. Keep Python.
## Concrete suggestions before WP-0001 lands
- Add `digest_algorithm` to `artifact_files` (default `sha256`, allow
`blake3`).
- Add `content_address` (e.g., `blake3:…`) as canonical storage key, with
`relative_path` retained as logical metadata.
- Add `retrieval_tier` and `restore_status` to `storage_locations` now,
nullable.
- Define the upload session resource shape even if v1 implements only
single-shot multipart.
- Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a
signing format target (cosign / Sigstore). Decide, do not implement.
- Add an `events` table with a monotonic sequence number so a CDC feed is
trivial later.
- Decide explicitly whether OCI artifact compatibility is a v2 goal or out of
scope. Either is fine; ambiguity will distort schema decisions.
## What this review does not change
INTENT and SCOPE remain correctly scoped for v1. The pilot path through
WP-0001 should ship as planned. The schema annotations above are additive,
not redirective. The platform ambition lives in `docs/PLATFORM-AMBITION.md`
so it can guide later decisions without expanding the current workplan.