Files
artifact-store/docs/ASSEMBLY-EXPERIMENT.md
tegwick 403d903585 docs: add platform ambition, blueprint review, and assembly experiment
Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 20:56:01 +02:00

210 lines
11 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Assembly Experiment
Status: draft / opt-in research line
Created: 2026-05-15
This document defines an opt-in research line under `artifact-store`: can
agentic coding adopt, extend, and eventually originate ffmpeg-grade hand-
written assembly for the hot paths of an artifact-storage data plane?
This is a research experiment, not roadmap-critical work. The platform
ambition (`docs/PLATFORM-AMBITION.md`) stands on its own merits whether or
not we ever write a single line of assembly. The experiment runs alongside.
## Why this experiment exists
ffmpeg is the empirical proof that hand-written assembly with runtime CPU
dispatch still substantially outperforms even the best Rust-with-SIMD-
intrinsics codebases for tight inner loops — often by 1.53× on the same
hardware, sometimes more. The cost is steep: domain expertise, multi-arch
maintenance, calling-convention discipline, microarchitecture awareness.
ffmpeg has decades of contributor depth to amortise that cost.
We do not have that depth. The interesting question is whether large
language models, used as coding agents, change the cost equation enough to
make this approach viable for a focused project. If they do, an artifact
substrate that competes on raw throughput-per-core has a real edge against
generic object stores. If they do not, we adopt prebuilt asm-tuned
libraries and lose nothing.
## Strategic context
This experiment ties to the commercial horizon recorded in
`docs/PLATFORM-AMBITION.md`. A sovereign-cloud artifact product that
ingests, hashes, dedups, and serves bytes at noticeably higher
throughput-per-core than commodity object stores has a defensible edge.
"Cheaper per-GB than AWS" is a losing race; "more throughput per server,
on hardware you already own" is not.
## Constraints
### Licence
- `artifact-store` is MIT No Attribution.
- ffmpeg's `libavutil` (where the storage-relevant asm lives) is LGPL 2.1+.
- We **cannot** copy LGPL-licensed asm into MIT-0 source.
- We **can**:
- dynamically link to `libavutil` at runtime (users get both licences);
- re-license a *segregated optional native module* under LGPL 2.1+ while
the rest of the repo stays MIT-0, provided the module is its own
package and the boundary is explicit;
- read LGPL code and implement the same algorithm from scratch
(algorithms are not copyrightable; specific source text is). This is
the standard practice for clean-room reimplementation. Document the
process per file.
- prefer asm sources under permissive licences (BSD, Apache, CC0,
public domain) where they exist.
Preferred upstream licences for the experiment, in order:
1. Public domain / CC0 (Intel reference, BLAKE3 reference)
2. Apache-2.0 / BSD / MIT (xxhash, zstd, ring)
3. LGPL via dynamic linking (libavutil)
4. Clean-room reimplementation inspired by LGPL (last resort)
### Maintenance budget
The experiment is bounded. Any asm we adopt or write must:
- have a portable C / Rust fallback that is correctness-equivalent;
- be reachable through a runtime CPU-feature dispatch table (the ffmpeg
pattern) so the binary still runs on machines without the relevant
extension;
- carry a test that compares its output byte-for-byte against the fallback
on randomised inputs;
- carry a microbenchmark with a recorded baseline so regressions are
visible.
If we cannot meet those four bars for a candidate, we ship the library
implementation and revisit later.
## What ffmpeg actually has that is reusable here
Inspection of `libavutil/x86/` (2026-05-15) found the following
storage-relevant assets:
| File / module | What it accelerates | Reuse value for artifact-store |
|------------------------------|-------------------------------|--------------------------------|
| `x86/crc.asm` | CRC-32 (LE + BE) via PCLMULQDQ | **High.** Fast non-crypto integrity check for chunks and network framing. Public function names `ff_crc_le`, `ff_crc`. LGPL — must dynamic-link or reimplement. |
| `x86/aes.asm` + `aes_init.c` | AES block cipher | **Lowmedium.** ffmpeg's AES is unauthenticated. At-rest encryption needs AES-GCM, better adopted from Ring / BoringSSL / AWS-LC (permissive licences, FIPS-validatable). |
| `x86/cpuid.asm` + `cpu.c` | CPU feature detection | **High (pattern, not code).** Reimplement the `ff_get_cpu_flags_x86()` + `AV_CPU_FLAG_*` pattern under MIT-0. This is the dispatch backbone. |
| `x86/x86inc.asm` | Macro library for asm authoring | **High (technique).** Cross-platform calling conventions, register naming, function prologue/epilogue. ffmpeg's macros are the de-facto standard outside game-dev. NASM-syntax. |
| `x86/x86util.asm` | SIMD helper macros | **Medium.** Useful patterns; not directly liftable. |
| `x86/emms.asm` | MMX state clearing | **Zero.** Legacy. |
| `sha.c` | SHA-1 / SHA-224 / SHA-256 | **Zero.** Pure C, no SIMD. We are better off with BLAKE3 (asm-tuned upstream) and SHA-NI via OpenSSL / Ring for SHA-256. |
| `aes_ctr.c`, `blowfish.c`, `camellia.c`, `cast5.c`, `des.c` | Block ciphers | **Zero.** Not relevant for our threat model. |
| `adler32.c`, `crc.c` | Reference integrity (C) | **Zero.** Use the asm-accelerated variants. |
Everything in `libavcodec` (DCT, motion estimation, deblocking) and the
video / audio / image-utility `.asm` files in `libavutil` is irrelevant to
artifact-store and stays out of scope.
## Candidate hot kernels for artifact-store, ranked
Each kernel below is a candidate either for adoption (drop in a vetted
permissive library), extension (start from a permissive baseline and
optimise further), or origination (write fresh).
### Tier 1 — adopt now, do not write
| Kernel | Recommended source | Notes |
|---------------|-------------------------------------------------------|-------|
| BLAKE3 | `blake3` (C reference + Rust crate), Apache-2.0 / CC0 | Already ships hand-tuned AVX-512, AVX2, SSE4.1, ARM NEON, ARM64. We will never beat upstream. |
| SHA-256 (compat) | OpenSSL / Ring / AWS-LC, permissive | Uses SHA-NI on supporting CPUs. |
| AES-GCM | Ring / BoringSSL, ISC / BSD | AES-NI + PCLMULQDQ for GHASH. Authenticated; what we actually need. |
| Zstandard | `zstd` (Facebook), BSD-3 | Multi-GB/s with SIMD. |
| LZ4 | `lz4`, BSD-2 | Faster than zstd at lower ratio; useful for high-throughput cold paths. |
### Tier 2 — adopt + extend, this is where the experiment starts
| Kernel | Baseline source | Extension question |
|--------------------|----------------------------------------------|--------------------|
| FastCDC (rolling hash) | `fastcdc-rs` (MIT) or original C paper code | Can we squeeze a SIMD'd Gear-hash variant that maintains the same boundary distribution? Existing Rust impl is scalar. |
| CRC-32C (Castagnoli, for chunk integrity) | Intel reference white paper code (public domain) | PCLMULQDQ-accelerated; ffmpeg's `crc.asm` shows the technique under LGPL — reimplement under MIT-0 from the Intel paper. |
| xxhash3 | `xxhash` (BSD-2) | Already SIMD'd; the extension is whether we can fuse it with our chunk-boundary loop to read each byte once. |
| Manifest canonicalisation hash | Whatever canonical-CBOR lib we pin | Likely no asm needed; included to monitor whether it ever appears on a profile. |
### Tier 3 — originate, only if profiles justify it
These are deliberately speculative. None of them are committed work.
- A fused "scan + chunk + hash" pass that reads each byte from the
upload buffer once and emits chunk boundaries plus per-chunk BLAKE3
state in a single pass. Today this requires three passes (CDC, hash
per chunk, hash for manifest root).
- A SIMD'd content-type sniffer for the first N kilobytes of unknown
uploads.
- An AVX-512 implementation of a bloom / cuckoo filter probe for the
"have I seen this hash?" hot path.
- Fast batch verification: given a list of `(content_address, bytes)`
pairs, verify all of them in one SIMD-dispatched pass.
## Experiment protocol
For each Tier 2 or Tier 3 candidate that we take on:
1. **Frame the kernel.** One function, one clear input / output, one
measurable metric (bytes per second per core).
2. **Baseline.** Land a portable C or Rust implementation with full test
coverage and a recorded microbenchmark number.
3. **Dispatch.** Wire the kernel through the runtime CPU-feature
dispatcher (ffmpeg pattern, reimplemented MIT-0). Default path = the
baseline.
4. **Agentic asm attempt.** Use the coding agent to author a NASM-syntax
asm implementation targeting one ISA extension (start with AVX2 — most
broadly available). The agent must:
- produce annotated source with cycle-accurate comments where relevant;
- include the test that compares its output to baseline on randomised
input;
- include the microbenchmark.
5. **Independent review.** A second pass — human or a fresh agent context
— reviews for correctness, calling-convention compliance, and obvious
microarchitectural issues (false dependencies, port pressure, unaligned
loads, misuse of `vzeroupper`).
6. **Land or shelve.** If the asm beats the baseline by a meaningful
margin (≥ 1.5×) and passes review, it lands behind the dispatcher.
Otherwise it shelves with the benchmark numbers recorded so we know
not to retry without new techniques.
7. **Extend.** Repeat for AVX-512, then ARM NEON, then SVE2, in that
order of impact.
Each completed kernel produces an ADR-style note in `docs/asm/` recording
the algorithm, the source of inspiration, the licence chain, the
benchmark numbers, and any microarchitectural notes.
## What the experiment proves or disproves
A succeeding experiment delivers:
- a portable asm-accelerated data plane that competes with hand-tuned C
storage stacks on throughput;
- a public record of which kernels the agentic approach handles well and
which it does not;
- a reusable dispatcher and macro foundation that other projects can adopt.
A failing experiment delivers:
- a published record of where agentic coding plateaus on hot-path asm;
- an artifact-store data plane that is still very good — because the
baseline is "use the asm-tuned library", which is already fast.
Either outcome is publishable. The downside is bounded.
## Out of scope for this experiment
- Cryptography written by us. Use vetted libraries. Always.
- Architectures with small deployment footprints in this domain (RISC-V,
POWER, MIPS). Revisit once x86_64 and ARM64 are solid.
- Kernel-bypass networking (DPDK, eBPF/XDP storage). Different
experiment, different document if we ever pursue it.
- GPU offload. Different cost model; not addressed here.
## Immediate next steps
None are committed. When the v1 baseline (WP-0001) lands and we have a
real profile of where time is spent, the first candidate to pick up is
almost certainly **FastCDC + BLAKE3 in a single pass**, because that is
the documented bottleneck of every CAS-style storage system that has
profiled it (restic, borg, kopia). Until then, this document is a
holding place for the ambition.