generated from coulomb/repo-seed
Captures the longer-horizon thesis (sovereign-cloud artifact substrate) alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine schema/contract commitments the v1 must preserve to keep that horizon reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
210 lines
11 KiB
Markdown
210 lines
11 KiB
Markdown
# Assembly Experiment
|
||
|
||
Status: draft / opt-in research line
|
||
Created: 2026-05-15
|
||
|
||
This document defines an opt-in research line under `artifact-store`: can
|
||
agentic coding adopt, extend, and eventually originate ffmpeg-grade hand-
|
||
written assembly for the hot paths of an artifact-storage data plane?
|
||
|
||
This is a research experiment, not roadmap-critical work. The platform
|
||
ambition (`docs/PLATFORM-AMBITION.md`) stands on its own merits whether or
|
||
not we ever write a single line of assembly. The experiment runs alongside.
|
||
|
||
## Why this experiment exists
|
||
|
||
ffmpeg is the empirical proof that hand-written assembly with runtime CPU
|
||
dispatch still substantially outperforms even the best Rust-with-SIMD-
|
||
intrinsics codebases for tight inner loops — often by 1.5–3× on the same
|
||
hardware, sometimes more. The cost is steep: domain expertise, multi-arch
|
||
maintenance, calling-convention discipline, microarchitecture awareness.
|
||
ffmpeg has decades of contributor depth to amortise that cost.
|
||
|
||
We do not have that depth. The interesting question is whether large
|
||
language models, used as coding agents, change the cost equation enough to
|
||
make this approach viable for a focused project. If they do, an artifact
|
||
substrate that competes on raw throughput-per-core has a real edge against
|
||
generic object stores. If they do not, we adopt prebuilt asm-tuned
|
||
libraries and lose nothing.
|
||
|
||
## Strategic context
|
||
|
||
This experiment ties to the commercial horizon recorded in
|
||
`docs/PLATFORM-AMBITION.md`. A sovereign-cloud artifact product that
|
||
ingests, hashes, dedups, and serves bytes at noticeably higher
|
||
throughput-per-core than commodity object stores has a defensible edge.
|
||
"Cheaper per-GB than AWS" is a losing race; "more throughput per server,
|
||
on hardware you already own" is not.
|
||
|
||
## Constraints
|
||
|
||
### Licence
|
||
|
||
- `artifact-store` is MIT No Attribution.
|
||
- ffmpeg's `libavutil` (where the storage-relevant asm lives) is LGPL 2.1+.
|
||
- We **cannot** copy LGPL-licensed asm into MIT-0 source.
|
||
- We **can**:
|
||
- dynamically link to `libavutil` at runtime (users get both licences);
|
||
- re-license a *segregated optional native module* under LGPL 2.1+ while
|
||
the rest of the repo stays MIT-0, provided the module is its own
|
||
package and the boundary is explicit;
|
||
- read LGPL code and implement the same algorithm from scratch
|
||
(algorithms are not copyrightable; specific source text is). This is
|
||
the standard practice for clean-room reimplementation. Document the
|
||
process per file.
|
||
- prefer asm sources under permissive licences (BSD, Apache, CC0,
|
||
public domain) where they exist.
|
||
|
||
Preferred upstream licences for the experiment, in order:
|
||
|
||
1. Public domain / CC0 (Intel reference, BLAKE3 reference)
|
||
2. Apache-2.0 / BSD / MIT (xxhash, zstd, ring)
|
||
3. LGPL via dynamic linking (libavutil)
|
||
4. Clean-room reimplementation inspired by LGPL (last resort)
|
||
|
||
### Maintenance budget
|
||
|
||
The experiment is bounded. Any asm we adopt or write must:
|
||
|
||
- have a portable C / Rust fallback that is correctness-equivalent;
|
||
- be reachable through a runtime CPU-feature dispatch table (the ffmpeg
|
||
pattern) so the binary still runs on machines without the relevant
|
||
extension;
|
||
- carry a test that compares its output byte-for-byte against the fallback
|
||
on randomised inputs;
|
||
- carry a microbenchmark with a recorded baseline so regressions are
|
||
visible.
|
||
|
||
If we cannot meet those four bars for a candidate, we ship the library
|
||
implementation and revisit later.
|
||
|
||
## What ffmpeg actually has that is reusable here
|
||
|
||
Inspection of `libavutil/x86/` (2026-05-15) found the following
|
||
storage-relevant assets:
|
||
|
||
| File / module | What it accelerates | Reuse value for artifact-store |
|
||
|------------------------------|-------------------------------|--------------------------------|
|
||
| `x86/crc.asm` | CRC-32 (LE + BE) via PCLMULQDQ | **High.** Fast non-crypto integrity check for chunks and network framing. Public function names `ff_crc_le`, `ff_crc`. LGPL — must dynamic-link or reimplement. |
|
||
| `x86/aes.asm` + `aes_init.c` | AES block cipher | **Low–medium.** ffmpeg's AES is unauthenticated. At-rest encryption needs AES-GCM, better adopted from Ring / BoringSSL / AWS-LC (permissive licences, FIPS-validatable). |
|
||
| `x86/cpuid.asm` + `cpu.c` | CPU feature detection | **High (pattern, not code).** Reimplement the `ff_get_cpu_flags_x86()` + `AV_CPU_FLAG_*` pattern under MIT-0. This is the dispatch backbone. |
|
||
| `x86/x86inc.asm` | Macro library for asm authoring | **High (technique).** Cross-platform calling conventions, register naming, function prologue/epilogue. ffmpeg's macros are the de-facto standard outside game-dev. NASM-syntax. |
|
||
| `x86/x86util.asm` | SIMD helper macros | **Medium.** Useful patterns; not directly liftable. |
|
||
| `x86/emms.asm` | MMX state clearing | **Zero.** Legacy. |
|
||
| `sha.c` | SHA-1 / SHA-224 / SHA-256 | **Zero.** Pure C, no SIMD. We are better off with BLAKE3 (asm-tuned upstream) and SHA-NI via OpenSSL / Ring for SHA-256. |
|
||
| `aes_ctr.c`, `blowfish.c`, `camellia.c`, `cast5.c`, `des.c` | Block ciphers | **Zero.** Not relevant for our threat model. |
|
||
| `adler32.c`, `crc.c` | Reference integrity (C) | **Zero.** Use the asm-accelerated variants. |
|
||
|
||
Everything in `libavcodec` (DCT, motion estimation, deblocking) and the
|
||
video / audio / image-utility `.asm` files in `libavutil` is irrelevant to
|
||
artifact-store and stays out of scope.
|
||
|
||
## Candidate hot kernels for artifact-store, ranked
|
||
|
||
Each kernel below is a candidate either for adoption (drop in a vetted
|
||
permissive library), extension (start from a permissive baseline and
|
||
optimise further), or origination (write fresh).
|
||
|
||
### Tier 1 — adopt now, do not write
|
||
|
||
| Kernel | Recommended source | Notes |
|
||
|---------------|-------------------------------------------------------|-------|
|
||
| BLAKE3 | `blake3` (C reference + Rust crate), Apache-2.0 / CC0 | Already ships hand-tuned AVX-512, AVX2, SSE4.1, ARM NEON, ARM64. We will never beat upstream. |
|
||
| SHA-256 (compat) | OpenSSL / Ring / AWS-LC, permissive | Uses SHA-NI on supporting CPUs. |
|
||
| AES-GCM | Ring / BoringSSL, ISC / BSD | AES-NI + PCLMULQDQ for GHASH. Authenticated; what we actually need. |
|
||
| Zstandard | `zstd` (Facebook), BSD-3 | Multi-GB/s with SIMD. |
|
||
| LZ4 | `lz4`, BSD-2 | Faster than zstd at lower ratio; useful for high-throughput cold paths. |
|
||
|
||
### Tier 2 — adopt + extend, this is where the experiment starts
|
||
|
||
| Kernel | Baseline source | Extension question |
|
||
|--------------------|----------------------------------------------|--------------------|
|
||
| FastCDC (rolling hash) | `fastcdc-rs` (MIT) or original C paper code | Can we squeeze a SIMD'd Gear-hash variant that maintains the same boundary distribution? Existing Rust impl is scalar. |
|
||
| CRC-32C (Castagnoli, for chunk integrity) | Intel reference white paper code (public domain) | PCLMULQDQ-accelerated; ffmpeg's `crc.asm` shows the technique under LGPL — reimplement under MIT-0 from the Intel paper. |
|
||
| xxhash3 | `xxhash` (BSD-2) | Already SIMD'd; the extension is whether we can fuse it with our chunk-boundary loop to read each byte once. |
|
||
| Manifest canonicalisation hash | Whatever canonical-CBOR lib we pin | Likely no asm needed; included to monitor whether it ever appears on a profile. |
|
||
|
||
### Tier 3 — originate, only if profiles justify it
|
||
|
||
These are deliberately speculative. None of them are committed work.
|
||
|
||
- A fused "scan + chunk + hash" pass that reads each byte from the
|
||
upload buffer once and emits chunk boundaries plus per-chunk BLAKE3
|
||
state in a single pass. Today this requires three passes (CDC, hash
|
||
per chunk, hash for manifest root).
|
||
- A SIMD'd content-type sniffer for the first N kilobytes of unknown
|
||
uploads.
|
||
- An AVX-512 implementation of a bloom / cuckoo filter probe for the
|
||
"have I seen this hash?" hot path.
|
||
- Fast batch verification: given a list of `(content_address, bytes)`
|
||
pairs, verify all of them in one SIMD-dispatched pass.
|
||
|
||
## Experiment protocol
|
||
|
||
For each Tier 2 or Tier 3 candidate that we take on:
|
||
|
||
1. **Frame the kernel.** One function, one clear input / output, one
|
||
measurable metric (bytes per second per core).
|
||
2. **Baseline.** Land a portable C or Rust implementation with full test
|
||
coverage and a recorded microbenchmark number.
|
||
3. **Dispatch.** Wire the kernel through the runtime CPU-feature
|
||
dispatcher (ffmpeg pattern, reimplemented MIT-0). Default path = the
|
||
baseline.
|
||
4. **Agentic asm attempt.** Use the coding agent to author a NASM-syntax
|
||
asm implementation targeting one ISA extension (start with AVX2 — most
|
||
broadly available). The agent must:
|
||
- produce annotated source with cycle-accurate comments where relevant;
|
||
- include the test that compares its output to baseline on randomised
|
||
input;
|
||
- include the microbenchmark.
|
||
5. **Independent review.** A second pass — human or a fresh agent context
|
||
— reviews for correctness, calling-convention compliance, and obvious
|
||
microarchitectural issues (false dependencies, port pressure, unaligned
|
||
loads, misuse of `vzeroupper`).
|
||
6. **Land or shelve.** If the asm beats the baseline by a meaningful
|
||
margin (≥ 1.5×) and passes review, it lands behind the dispatcher.
|
||
Otherwise it shelves with the benchmark numbers recorded so we know
|
||
not to retry without new techniques.
|
||
7. **Extend.** Repeat for AVX-512, then ARM NEON, then SVE2, in that
|
||
order of impact.
|
||
|
||
Each completed kernel produces an ADR-style note in `docs/asm/` recording
|
||
the algorithm, the source of inspiration, the licence chain, the
|
||
benchmark numbers, and any microarchitectural notes.
|
||
|
||
## What the experiment proves or disproves
|
||
|
||
A succeeding experiment delivers:
|
||
|
||
- a portable asm-accelerated data plane that competes with hand-tuned C
|
||
storage stacks on throughput;
|
||
- a public record of which kernels the agentic approach handles well and
|
||
which it does not;
|
||
- a reusable dispatcher and macro foundation that other projects can adopt.
|
||
|
||
A failing experiment delivers:
|
||
|
||
- a published record of where agentic coding plateaus on hot-path asm;
|
||
- an artifact-store data plane that is still very good — because the
|
||
baseline is "use the asm-tuned library", which is already fast.
|
||
|
||
Either outcome is publishable. The downside is bounded.
|
||
|
||
## Out of scope for this experiment
|
||
|
||
- Cryptography written by us. Use vetted libraries. Always.
|
||
- Architectures with small deployment footprints in this domain (RISC-V,
|
||
POWER, MIPS). Revisit once x86_64 and ARM64 are solid.
|
||
- Kernel-bypass networking (DPDK, eBPF/XDP storage). Different
|
||
experiment, different document if we ever pursue it.
|
||
- GPU offload. Different cost model; not addressed here.
|
||
|
||
## Immediate next steps
|
||
|
||
None are committed. When the v1 baseline (WP-0001) lands and we have a
|
||
real profile of where time is spent, the first candidate to pick up is
|
||
almost certainly **FastCDC + BLAKE3 in a single pass**, because that is
|
||
the documented bottleneck of every CAS-style storage system that has
|
||
profiled it (restic, borg, kopia). Until then, this document is a
|
||
holding place for the ambition.
|