diff --git a/README.md b/README.md index 59e55c4..dd304bd 100644 --- a/README.md +++ b/README.md @@ -9,6 +9,9 @@ as a local filesystem, S3-compatible object storage, or Ceph RGW. Start here: -- [INTENT.md](INTENT.md) -- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) -- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) +- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary +- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — draft architecture +- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon thesis and the schema commitments v1 preserves +- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md) — SWOT and optimisation review +- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in research line on hand-tuned assembly for hot kernels +- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) — first implementation workplan diff --git a/docs/ASSEMBLY-EXPERIMENT.md b/docs/ASSEMBLY-EXPERIMENT.md new file mode 100644 index 0000000..36f08a5 --- /dev/null +++ b/docs/ASSEMBLY-EXPERIMENT.md @@ -0,0 +1,209 @@ +# Assembly Experiment + +Status: draft / opt-in research line +Created: 2026-05-15 + +This document defines an opt-in research line under `artifact-store`: can +agentic coding adopt, extend, and eventually originate ffmpeg-grade hand- +written assembly for the hot paths of an artifact-storage data plane? + +This is a research experiment, not roadmap-critical work. The platform +ambition (`docs/PLATFORM-AMBITION.md`) stands on its own merits whether or +not we ever write a single line of assembly. The experiment runs alongside. + +## Why this experiment exists + +ffmpeg is the empirical proof that hand-written assembly with runtime CPU +dispatch still substantially outperforms even the best Rust-with-SIMD- +intrinsics codebases for tight inner loops — often by 1.5–3× on the same +hardware, sometimes more. The cost is steep: domain expertise, multi-arch +maintenance, calling-convention discipline, microarchitecture awareness. +ffmpeg has decades of contributor depth to amortise that cost. + +We do not have that depth. The interesting question is whether large +language models, used as coding agents, change the cost equation enough to +make this approach viable for a focused project. If they do, an artifact +substrate that competes on raw throughput-per-core has a real edge against +generic object stores. If they do not, we adopt prebuilt asm-tuned +libraries and lose nothing. + +## Strategic context + +This experiment ties to the commercial horizon recorded in +`docs/PLATFORM-AMBITION.md`. A sovereign-cloud artifact product that +ingests, hashes, dedups, and serves bytes at noticeably higher +throughput-per-core than commodity object stores has a defensible edge. +"Cheaper per-GB than AWS" is a losing race; "more throughput per server, +on hardware you already own" is not. + +## Constraints + +### Licence + +- `artifact-store` is MIT No Attribution. +- ffmpeg's `libavutil` (where the storage-relevant asm lives) is LGPL 2.1+. +- We **cannot** copy LGPL-licensed asm into MIT-0 source. +- We **can**: + - dynamically link to `libavutil` at runtime (users get both licences); + - re-license a *segregated optional native module* under LGPL 2.1+ while + the rest of the repo stays MIT-0, provided the module is its own + package and the boundary is explicit; + - read LGPL code and implement the same algorithm from scratch + (algorithms are not copyrightable; specific source text is). This is + the standard practice for clean-room reimplementation. Document the + process per file. + - prefer asm sources under permissive licences (BSD, Apache, CC0, + public domain) where they exist. + +Preferred upstream licences for the experiment, in order: + +1. Public domain / CC0 (Intel reference, BLAKE3 reference) +2. Apache-2.0 / BSD / MIT (xxhash, zstd, ring) +3. LGPL via dynamic linking (libavutil) +4. Clean-room reimplementation inspired by LGPL (last resort) + +### Maintenance budget + +The experiment is bounded. Any asm we adopt or write must: + +- have a portable C / Rust fallback that is correctness-equivalent; +- be reachable through a runtime CPU-feature dispatch table (the ffmpeg + pattern) so the binary still runs on machines without the relevant + extension; +- carry a test that compares its output byte-for-byte against the fallback + on randomised inputs; +- carry a microbenchmark with a recorded baseline so regressions are + visible. + +If we cannot meet those four bars for a candidate, we ship the library +implementation and revisit later. + +## What ffmpeg actually has that is reusable here + +Inspection of `libavutil/x86/` (2026-05-15) found the following +storage-relevant assets: + +| File / module | What it accelerates | Reuse value for artifact-store | +|------------------------------|-------------------------------|--------------------------------| +| `x86/crc.asm` | CRC-32 (LE + BE) via PCLMULQDQ | **High.** Fast non-crypto integrity check for chunks and network framing. Public function names `ff_crc_le`, `ff_crc`. LGPL — must dynamic-link or reimplement. | +| `x86/aes.asm` + `aes_init.c` | AES block cipher | **Low–medium.** ffmpeg's AES is unauthenticated. At-rest encryption needs AES-GCM, better adopted from Ring / BoringSSL / AWS-LC (permissive licences, FIPS-validatable). | +| `x86/cpuid.asm` + `cpu.c` | CPU feature detection | **High (pattern, not code).** Reimplement the `ff_get_cpu_flags_x86()` + `AV_CPU_FLAG_*` pattern under MIT-0. This is the dispatch backbone. | +| `x86/x86inc.asm` | Macro library for asm authoring | **High (technique).** Cross-platform calling conventions, register naming, function prologue/epilogue. ffmpeg's macros are the de-facto standard outside game-dev. NASM-syntax. | +| `x86/x86util.asm` | SIMD helper macros | **Medium.** Useful patterns; not directly liftable. | +| `x86/emms.asm` | MMX state clearing | **Zero.** Legacy. | +| `sha.c` | SHA-1 / SHA-224 / SHA-256 | **Zero.** Pure C, no SIMD. We are better off with BLAKE3 (asm-tuned upstream) and SHA-NI via OpenSSL / Ring for SHA-256. | +| `aes_ctr.c`, `blowfish.c`, `camellia.c`, `cast5.c`, `des.c` | Block ciphers | **Zero.** Not relevant for our threat model. | +| `adler32.c`, `crc.c` | Reference integrity (C) | **Zero.** Use the asm-accelerated variants. | + +Everything in `libavcodec` (DCT, motion estimation, deblocking) and the +video / audio / image-utility `.asm` files in `libavutil` is irrelevant to +artifact-store and stays out of scope. + +## Candidate hot kernels for artifact-store, ranked + +Each kernel below is a candidate either for adoption (drop in a vetted +permissive library), extension (start from a permissive baseline and +optimise further), or origination (write fresh). + +### Tier 1 — adopt now, do not write + +| Kernel | Recommended source | Notes | +|---------------|-------------------------------------------------------|-------| +| BLAKE3 | `blake3` (C reference + Rust crate), Apache-2.0 / CC0 | Already ships hand-tuned AVX-512, AVX2, SSE4.1, ARM NEON, ARM64. We will never beat upstream. | +| SHA-256 (compat) | OpenSSL / Ring / AWS-LC, permissive | Uses SHA-NI on supporting CPUs. | +| AES-GCM | Ring / BoringSSL, ISC / BSD | AES-NI + PCLMULQDQ for GHASH. Authenticated; what we actually need. | +| Zstandard | `zstd` (Facebook), BSD-3 | Multi-GB/s with SIMD. | +| LZ4 | `lz4`, BSD-2 | Faster than zstd at lower ratio; useful for high-throughput cold paths. | + +### Tier 2 — adopt + extend, this is where the experiment starts + +| Kernel | Baseline source | Extension question | +|--------------------|----------------------------------------------|--------------------| +| FastCDC (rolling hash) | `fastcdc-rs` (MIT) or original C paper code | Can we squeeze a SIMD'd Gear-hash variant that maintains the same boundary distribution? Existing Rust impl is scalar. | +| CRC-32C (Castagnoli, for chunk integrity) | Intel reference white paper code (public domain) | PCLMULQDQ-accelerated; ffmpeg's `crc.asm` shows the technique under LGPL — reimplement under MIT-0 from the Intel paper. | +| xxhash3 | `xxhash` (BSD-2) | Already SIMD'd; the extension is whether we can fuse it with our chunk-boundary loop to read each byte once. | +| Manifest canonicalisation hash | Whatever canonical-CBOR lib we pin | Likely no asm needed; included to monitor whether it ever appears on a profile. | + +### Tier 3 — originate, only if profiles justify it + +These are deliberately speculative. None of them are committed work. + +- A fused "scan + chunk + hash" pass that reads each byte from the + upload buffer once and emits chunk boundaries plus per-chunk BLAKE3 + state in a single pass. Today this requires three passes (CDC, hash + per chunk, hash for manifest root). +- A SIMD'd content-type sniffer for the first N kilobytes of unknown + uploads. +- An AVX-512 implementation of a bloom / cuckoo filter probe for the + "have I seen this hash?" hot path. +- Fast batch verification: given a list of `(content_address, bytes)` + pairs, verify all of them in one SIMD-dispatched pass. + +## Experiment protocol + +For each Tier 2 or Tier 3 candidate that we take on: + +1. **Frame the kernel.** One function, one clear input / output, one + measurable metric (bytes per second per core). +2. **Baseline.** Land a portable C or Rust implementation with full test + coverage and a recorded microbenchmark number. +3. **Dispatch.** Wire the kernel through the runtime CPU-feature + dispatcher (ffmpeg pattern, reimplemented MIT-0). Default path = the + baseline. +4. **Agentic asm attempt.** Use the coding agent to author a NASM-syntax + asm implementation targeting one ISA extension (start with AVX2 — most + broadly available). The agent must: + - produce annotated source with cycle-accurate comments where relevant; + - include the test that compares its output to baseline on randomised + input; + - include the microbenchmark. +5. **Independent review.** A second pass — human or a fresh agent context + — reviews for correctness, calling-convention compliance, and obvious + microarchitectural issues (false dependencies, port pressure, unaligned + loads, misuse of `vzeroupper`). +6. **Land or shelve.** If the asm beats the baseline by a meaningful + margin (≥ 1.5×) and passes review, it lands behind the dispatcher. + Otherwise it shelves with the benchmark numbers recorded so we know + not to retry without new techniques. +7. **Extend.** Repeat for AVX-512, then ARM NEON, then SVE2, in that + order of impact. + +Each completed kernel produces an ADR-style note in `docs/asm/` recording +the algorithm, the source of inspiration, the licence chain, the +benchmark numbers, and any microarchitectural notes. + +## What the experiment proves or disproves + +A succeeding experiment delivers: + +- a portable asm-accelerated data plane that competes with hand-tuned C + storage stacks on throughput; +- a public record of which kernels the agentic approach handles well and + which it does not; +- a reusable dispatcher and macro foundation that other projects can adopt. + +A failing experiment delivers: + +- a published record of where agentic coding plateaus on hot-path asm; +- an artifact-store data plane that is still very good — because the + baseline is "use the asm-tuned library", which is already fast. + +Either outcome is publishable. The downside is bounded. + +## Out of scope for this experiment + +- Cryptography written by us. Use vetted libraries. Always. +- Architectures with small deployment footprints in this domain (RISC-V, + POWER, MIPS). Revisit once x86_64 and ARM64 are solid. +- Kernel-bypass networking (DPDK, eBPF/XDP storage). Different + experiment, different document if we ever pursue it. +- GPU offload. Different cost model; not addressed here. + +## Immediate next steps + +None are committed. When the v1 baseline (WP-0001) lands and we have a +real profile of where time is spent, the first candidate to pick up is +almost certainly **FastCDC + BLAKE3 in a single pass**, because that is +the documented bottleneck of every CAS-style storage system that has +profiled it (restic, borg, kopia). Until then, this document is a +holding place for the ambition. diff --git a/docs/PLATFORM-AMBITION.md b/docs/PLATFORM-AMBITION.md new file mode 100644 index 0000000..4f51564 --- /dev/null +++ b/docs/PLATFORM-AMBITION.md @@ -0,0 +1,226 @@ +# Platform Ambition + +Status: draft +Created: 2026-05-15 + +This document records the longer-horizon thesis behind `artifact-store` and +captures which decisions are taken now to keep that horizon reachable without +expanding the v1 workplan. It sits beside, not above, `INTENT.md` and +`SCOPE.md`. INTENT defines what we build first; this document defines what +the v1 must not foreclose. + +## Thesis + +Generated artifacts — evidence packages, build outputs, ML models, logs, +snapshots, reports, scorecards, exports — are first-class durable objects in +modern software work. They sit somewhere between source code (well-served by +Git) and binary releases (well-served by OCI registries). The space between +is currently filled by a fragmented mix of bespoke directories, ad-hoc S3 +buckets, vendor registries (Artifactory, Nexus), and document-management +systems that were not built for machine producers. + +`artifact-store` aims to occupy that gap with one substrate: a generic, +content-addressed, signed, deduplicated, retention-aware artifact registry +and storage gateway that other tools embed or speak to. + +The reference points are deliberate. **VLC** and **ffmpeg** lead their domain +not by being the prettiest applications but by being correct, fast, embeddable, +portable, and indispensable infrastructure for everyone else. The same +strategy applies here: build a kernel that is so good at the bytes-and- +identity layer that every artifact-producing tool would rather speak its +protocol than reinvent it. + +## Commercial horizon + +The longer-horizon commercial target is **a sovereign artifact-storage +product line for European cloud providers** — Stack IT (Schwarz Group) is +the concrete example. The thesis is: + +- Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do + not sell *artifact identity, retention, attestation, federation, evidence + preservation*. Customers either build it themselves or buy proprietary + registries on top. +- A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned + artifact substrate on top of its own object storage has a defensible + differentiation against AWS — not in raw price-per-GB, which is a losing + race, but in regulated workloads (evidence retention, audit, signed + attestations, legal-hold, sovereign jurisdiction guarantees). +- Open source is the wedge. A widely-adopted upstream that the provider + ships, supports, and extends is far stronger than a proprietary stack. + +This is a multi-year horizon, not a v1 deliverable. It is recorded here so +schema and protocol decisions made now keep that path open. + +## Reference points + +| System | What we learn from it | +|-----------------|--------------------------------------------------------------| +| ffmpeg | Embeddable core, hand-tuned hot paths, runtime CPU dispatch | +| VLC | Plugin architecture, portability, ubiquity through being a library too | +| Git | Content-addressed storage, Merkle DAG, pack files, integrity | +| restic | Single static binary, CDC + dedup, encryption by default | +| IPFS | Content-addressing, federation, partial replication | +| OCI Registry | Standardised manifest + blob model with broad ecosystem | +| Sigstore / cosign | Signed attestations as a first-class artifact property | +| MinIO | Operator ergonomics, S3 wire compat as adoption vector | +| SeaweedFS / Ceph | Separation of metadata plane from data plane | +| RocksDB / LMDB | Embeddable storage engines with predictable performance | +| Kafka | Log-as-source-of-truth, materialised views | +| BLAKE3 | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned | + +We are not trying to reproduce any of these. We are trying to occupy a +specific gap between them with the best ideas from each. + +## Non-goals (still) + +The platform ambition does not change the v1 boundary in `INTENT.md` or +`SCOPE.md`. In particular it does not: + +- replace StateHub as the work / decision system of record; +- encode producer-specific assessment semantics in the registry core; +- require any of the optimisations listed in the "near-horizon" section + below to land in v1; +- commit the project to writing assembly. The assembly-experiment line + (`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical. + +## Architectural commitments — preserved by v1 + +The following decisions are taken now because reversing them later is +expensive. Each lands in v1 as a schema or contract decision; full +exploitation is later work. + +### A1. Content as the primary address + +Internal canonical key for stored bytes is `:`, not a logical +path. Files within a package keep a `relative_path` as logical metadata, +but the storage backend sees and addresses content hashes. + +- Enables: global dedup, Merkle integrity proofs, partial mirrors, + federation, OCI compatibility. +- v1 cost: one schema column (`content_address`) and a deterministic key + derivation; no behaviour change. + +### A2. BLAKE3 as native digest, SHA-256 retained for interop + +`digest_algorithm` is a column on `artifact_files`. v1 default may remain +`sha256` to ship the pilot quickly; the column exists so `blake3` can ship +without migration. + +- Enables: faster hashing, free Merkle root over a package, alignment with + modern signing tooling. +- v1 cost: column + adapter table mapping algo → hashing impl. + +### A3. Append-only event log as source of truth + +An `events` table with a monotonic sequence number is the authoritative +record of registry mutations. The current metadata tables are a +materialised view rebuildable from the log. + +- Enables: CDC feeds, audit, replication, point-in-time recovery, signed + event streams. +- v1 cost: one extra table written on the same transaction as today's + mutations. + +### A4. Signed manifests, canonicalisation pinned + +Manifest serialisation uses a canonical form (recommendation: canonical +CBOR; JCS as alternative) so byte-identical signing is possible across +languages and time. v1 may not actually sign — the pin guarantees that +when signing lands, every prior manifest is re-signable byte-for-byte. + +- Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style + manifest digests. +- v1 cost: pick one canonicalisation library and use it for manifest + writes. Zero runtime cost. + +### A5. Control plane / data plane separation at the contract + +Even if v1 implements both in one Python process, the boundary between +"registry / API / retention" (control plane) and "hash / chunk / store / +serve" (data plane) is a named contract. When the data plane is later +extracted into a Rust binary, the API does not change. + +- Enables: native-speed ingestion, language flexibility on the hot path, + independent scaling. +- v1 cost: discipline (separate Python module with no API leakage), not + code. + +### A6. Resumable upload wire shape + +API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with +range, `POST /uploads/{id}/complete`. v1 implementation may still be +single-shot multipart under the hood, but the resource shape exists so +chunked / resumable upload is additive. + +- Enables: streaming, retry-safe ingestion, very-large-package support. +- v1 cost: route definitions only; underlying logic can remain simple. + +### A7. Tiering as a property of storage locations + +`storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and +`restore_status` columns, nullable, default `hot`. The API can already +return "not immediately available" without changing artifact identity. + +- Enables: future cold storage, Glacier-style restore flows. +- v1 cost: two nullable columns. + +### A8. Schema-typed metadata with open escape hatch + +Producers register a metadata schema (JSON Schema) per variant +(e.g. `guide-board.run.v1`). Stored as open JSON, validated against the +registered schema at ingest time. Queries can use typed views. + +- Enables: tooling, search, GraphQL views, typed clients without losing + flexibility. +- v1 cost: a `metadata_schemas` table; v1 validation can be a no-op. + +### A9. OCI compatibility kept reachable + +We do not promise OCI compatibility in v1, but we do not adopt any +data model that prevents it. Concretely: keep content addresses as +`:`, keep manifest structure compatible with an OCI image +manifest (config + layers + annotations), and avoid invariants that the +OCI spec forbids. + +- Enables: future `oras push` / `cosign sign` / Helm ecosystem entry. +- v1 cost: one design review per schema change against the OCI spec. + +## Near-horizon technical roadmap (post-baseline) + +Roughly ordered. Not commitments; planning hooks. + +1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC + + optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a + minimal gRPC or framed-bincode protocol to the Python control plane over + a Unix socket. +2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup + unit. Package manifest references chunk digests; package digest is the + Merkle root. +3. **Cosign-compatible signing pipeline.** Every finalised manifest can be + signed; signatures stored alongside the manifest. +4. **Event stream out.** NATS or Kafka topic of registry events for + downstream consumers. +5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI + distribution spec on top of the same storage. +6. **WASM plugin host.** Producers and operators can ship signed `.wasm` + modules for content extraction, redaction, scorecard generation, + custom hashing, indexing. This is the "ffmpeg moment" — open extension + surface that does not require forking the core. +7. **Federation.** Signed manifest exchange between artifact-store + instances. Gossip or explicit peering. +8. **Cold tier adapters.** S3 Glacier, Tape, IA classes. + +## How this document is used + +- Every schema change in WP-0001 (or successors) is checked against + commitments A1–A9. A change that violates one is either rejected or + documented as a deliberate revision of this document. +- Every "we could do this faster in native code" idea is filed against + `docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan. +- Every new producer integration is checked against the commercial + horizon: does it generalise, or does it bake in producer-specific + assumptions? + +This document is allowed to be wrong. It is not allowed to be silent. +Update it when the thesis changes; do not let v1 quietly close doors that +the v3 needs open. diff --git a/docs/REVIEW-2026-05-15-intent-and-blueprint.md b/docs/REVIEW-2026-05-15-intent-and-blueprint.md new file mode 100644 index 0000000..b39b04e --- /dev/null +++ b/docs/REVIEW-2026-05-15-intent-and-blueprint.md @@ -0,0 +1,175 @@ +# Review — INTENT and Architecture Blueprint + +Date: 2026-05-15 +Reviewer: claude (opus-4-7) +Inputs: `INTENT.md`, `docs/ARCHITECTURE-BLUEPRINT.md`, +`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md`, `SCOPE.md`, `AGENTS.md` + +This review reframes the current scoped-internal-service design against a +longer-horizon ambition: make `artifact-store` the leading open source +substrate for generic artifact storage in the same sense that VLC and ffmpeg +lead their domain. See `docs/PLATFORM-AMBITION.md` for the ambition framing +this review is in service of. + +## SWOT + +### Strengths + +- Clean separation between artifact *identity / lifecycle* and *bytes*. + Registry owns metadata; storage adapter owns persistence. This is the single + most consequential architectural decision and the docs get it right. +- Retention is a first-class concept from day one, not bolted on later. +- Audit log designed in from the start, with explicit room for signed events. +- Storage adapter contract is minimal and well-shaped + (`put / get / head / delete / health`). +- Pilot-first discipline (`guide-board` / OpenCMIS TCK) anchors the work in a + real producer rather than a hypothetical one. +- Manifest portability is an explicit goal — a package should be understandable + without calling its producer. +- Boundary statements are explicit (will not replace StateHub, will not encode + producer semantics). + +### Weaknesses + +- Storage is keyed by logical path, not by content hash. Blocks global + deduplication, Merkle integrity proofs, partial replication, federation. +- No streaming, chunked, or resumable upload story. Multipart REST will cap + throughput at the slowest Python/WSGI hop for multi-GB packages. +- No content-defined chunking (CDC). Evidence packages with logs are highly + dedup-able; current design captures none of that. +- SHA-256 is the right *compatibility* digest but the wrong *throughput* + digest at platform scale. +- Single-writer SQLite is a real concurrency ceiling; PostgreSQL helps but no + partitioning / sharding story exists. +- No event / CDC stream for downstream consumers — Statehub, search, UIs would + have to poll. +- No signing / attestation story (Sigstore, in-toto, SLSA). Evidence storage + without signed attestations leaves half the value on the table. +- Metadata is open-ended JSON without a schema-registration path. Hard to + build typed tooling on top. +- No multi-tenancy, quota, or rate-limiting primitives. Painful to retrofit. +- No observability targets (latency / throughput SLOs, metrics, traces). + Platform-grade claims will eventually require numbers. +- No OCI / `oras` artifact compatibility — leaves the largest existing + artifact ecosystem off the table. + +### Opportunities + +- **OCI Artifact + ORAS compatibility.** Inherit Helm, ML model, SBOM, cosign + tooling for free. Probably the single highest-leverage external move. +- **Sigstore + in-toto + SLSA.** Evidence packages should be signed by + default; this is exactly the gap most generic registries leave unfilled. +- **Content-addressed CAS + Merkle DAG** (Git / IPFS / restic pattern): + enables global dedup, integrity proofs, federation, partial mirroring. +- **BLAKE3** as native digest with SHA-256 retained for interop: + orders-of-magnitude faster hashing, and BLAKE3's construction *is* a Merkle + tree — package-level integrity comes for free. +- **WASM plugin surface for transforms, extractors, indexers, redactors.** + The "ffmpeg moment" for this domain: a stable host API that ecosystem + contributors can extend without forking the core. +- **Federation / mirroring** between artifact-store instances via signed + manifests. Nothing comparable exists in the evidence space today. +- **FUSE / NFS / S3-gateway frontends.** Legacy producers ingest without code + changes. +- **Embeddable mode.** A single static binary like `restic`, plus a server + mode. Embedding is what makes ffmpeg ubiquitous. + +### Threats + +- Crowded adjacency: MinIO, Pulp, Harbor / Zot, Artifactory / Nexus, restic, + IPFS, Sigstore, plain S3. None are exactly this, but each chips at the + value proposition. +- Scope creep vs the carefully-scoped INTENT. The platform ambition pulls + toward "do everything"; the INTENT pulls toward "ship the pilot." Resolve + this tension explicitly or you get neither. +- Python performance ceiling on the data plane (ingestion of multi-GB + packages, hashing, chunking). +- Governance / maintenance debt. VLC and ffmpeg have decades of contributor + depth; underestimating that is a project-killer. + +## Architecture optimizations worth taking now + +Each of these is cheap to lock in before code lands, and expensive (or +breaking) to add later. + +1. **Split control plane from data plane.** Registry / API / retention stays + in Python with PostgreSQL. Ingestion + hashing + storage I/O becomes a + separate process (Rust sidecar, eventually with hot kernels in C / asm) + that can scale and be rewritten independently. Pin the contract now (Unix + socket, gRPC or framed bincode). See `docs/PLATFORM-AMBITION.md`. +2. **Make content the primary address.** Internal object key + `blake3:` (or `sha256:` for compat). `relative_path` + becomes logical metadata in the manifest. Unlocks dedup, integrity, + federation, OCI compatibility. +3. **Append-only WAL as the source of truth.** Metadata DB is a materialized + view rebuildable from the log. Same pattern as Kafka / EventStore / + Datomic. Cheap audit, replication, point-in-time recovery. +4. **OCI artifact spec as a wire format**, even if the native API is richer. + Buys instant interop with `oras`, `cosign`, `crane`, Helm. +5. **Signed manifests from day one.** Pin a signing format (cosign / Sigstore) + and a canonicalization (JCS or canonical CBOR). Post-hoc signing means + every legacy manifest is unsigned forever. +6. **Resumable, chunked uploads on the wire.** Upload session resource + (`POST /uploads` → `PATCH /uploads/{id}` ranges → `POST /uploads/{id}/complete`). + `tus.io` is a reasonable reference. v1 implementation can still be + single-shot multipart. +7. **Event stream out.** A monotonic-sequence `events` table; consumers + tail via long-poll, NATS, or Kafka. Trivial to add now, expensive later. +8. **Schema-typed metadata with escape hatch.** Producers register a JSON + Schema for their metadata variant (`guide-board.run.v1`). Stored as open + JSON, validated at ingest, queryable by typed views. +9. **Tiering as a first-class column of `storage_location`.** Promote + `retrieval_tier` and `restore_status` into the schema now (nullable, + default `hot`). +10. **Ship a great CLI before any UI.** ffmpeg ships a binary, not a GUI. + +## Performance hotspots — where native code actually matters + +Ranked by realistic impact for this workload. Adopting libraries that already +contain hand-tuned assembly is the cheap path; writing fresh assembly is an +explicit research line — see `docs/ASSEMBLY-EXPERIMENT.md`. + +1. **Hashing (dominant ingest cost).** SHA-256 with SHA-NI: ~1.5–2 GB/s/core. + BLAKE3 with AVX-512: 6–10+ GB/s/core, parallelizable, free Merkle tree. + Adopt BLAKE3 as native; retain SHA-256 for SLSA / OCI interop. +2. **Content-defined chunking (FastCDC / Gear).** Rolling hash over every + byte; pure-Python is unusable, optimized C / Rust hits 5–10 GB/s. + Mandatory if dedup is on the roadmap. +3. **Compression.** Zstd with bundled SIMD reaches multi-GB/s. Evidence logs + typically compress 5–20×. Apply at chunk level so dedup still works. +4. **I/O path.** Linux: `io_uring` for ingest writes; `sendfile(2)` / + `splice(2)` for download zero-copy; `O_DIRECT` for very large objects. +5. **Encryption.** AES-GCM with AES-NI: ~5 GB/s/core. ChaCha20-Poly1305 + vector implementations for non-AES-NI hardware. Use Ring, BoringSSL, or + AWS-LC. Never write crypto by hand. +6. **Metadata hot paths.** Bloom or Cuckoo filter in front of the + "have I seen this hash?" lookup. ~50 lines of Rust, ~100× win. +7. **Manifest canonicalization.** Signed manifests canonicalize on every + ingest and every verify. Pick a fast canonical CBOR / JCS impl. + +Not worth native code: HTTP layer, retention engine, audit log, DB access, +orchestration, workflow logic. Keep Python. + +## Concrete suggestions before WP-0001 lands + +- Add `digest_algorithm` to `artifact_files` (default `sha256`, allow + `blake3`). +- Add `content_address` (e.g., `blake3:…`) as canonical storage key, with + `relative_path` retained as logical metadata. +- Add `retrieval_tier` and `restore_status` to `storage_locations` now, + nullable. +- Define the upload session resource shape even if v1 implements only + single-shot multipart. +- Pin a manifest canonicalization (recommend JCS or canonical CBOR) and a + signing format target (cosign / Sigstore). Decide, do not implement. +- Add an `events` table with a monotonic sequence number so a CDC feed is + trivial later. +- Decide explicitly whether OCI artifact compatibility is a v2 goal or out of + scope. Either is fine; ambiguity will distort schema decisions. + +## What this review does not change + +INTENT and SCOPE remain correctly scoped for v1. The pilot path through +WP-0001 should ship as planned. The schema annotations above are additive, +not redirective. The platform ambition lives in `docs/PLATFORM-AMBITION.md` +so it can guide later decisions without expanding the current workplan.