# Platform Ambition Status: draft Created: 2026-05-15 This document records the longer-horizon thesis behind `artifact-store` and captures which decisions are taken now to keep that horizon reachable without expanding the v1 workplan. It sits beside, not above, `INTENT.md` and `SCOPE.md`. INTENT defines what we build first; this document defines what the v1 must not foreclose. ## Thesis Generated artifacts — evidence packages, build outputs, ML models, logs, snapshots, reports, scorecards, exports — are first-class durable objects in modern software work. They sit somewhere between source code (well-served by Git) and binary releases (well-served by OCI registries). The space between is currently filled by a fragmented mix of bespoke directories, ad-hoc S3 buckets, vendor registries (Artifactory, Nexus), and document-management systems that were not built for machine producers. `artifact-store` aims to occupy that gap with one substrate: a generic, content-addressed, signed, deduplicated, retention-aware artifact registry and storage gateway that other tools embed or speak to. The reference points are deliberate. **VLC** and **ffmpeg** lead their domain not by being the prettiest applications but by being correct, fast, embeddable, portable, and indispensable infrastructure for everyone else. The same strategy applies here: build a kernel that is so good at the bytes-and- identity layer that every artifact-producing tool would rather speak its protocol than reinvent it. ## Commercial horizon The longer-horizon commercial target is **a sovereign artifact-storage product line for European cloud providers** — Stack IT (Schwarz Group) is the concrete example. The thesis is: - Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do not sell *artifact identity, retention, attestation, federation, evidence preservation*. Customers either build it themselves or buy proprietary registries on top. - A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned artifact substrate on top of its own object storage has a defensible differentiation against AWS — not in raw price-per-GB, which is a losing race, but in regulated workloads (evidence retention, audit, signed attestations, legal-hold, sovereign jurisdiction guarantees). - Open source is the wedge. A widely-adopted upstream that the provider ships, supports, and extends is far stronger than a proprietary stack. This is a multi-year horizon, not a v1 deliverable. It is recorded here so schema and protocol decisions made now keep that path open. ## Reference points | System | What we learn from it | |-----------------|--------------------------------------------------------------| | ffmpeg | Embeddable core, hand-tuned hot paths, runtime CPU dispatch | | VLC | Plugin architecture, portability, ubiquity through being a library too | | Git | Content-addressed storage, Merkle DAG, pack files, integrity | | restic | Single static binary, CDC + dedup, encryption by default | | IPFS | Content-addressing, federation, partial replication | | OCI Registry | Standardised manifest + blob model with broad ecosystem | | Sigstore / cosign | Signed attestations as a first-class artifact property | | MinIO | Operator ergonomics, S3 wire compat as adoption vector | | SeaweedFS / Ceph | Separation of metadata plane from data plane | | RocksDB / LMDB | Embeddable storage engines with predictable performance | | Kafka | Log-as-source-of-truth, materialised views | | BLAKE3 | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned | We are not trying to reproduce any of these. We are trying to occupy a specific gap between them with the best ideas from each. ## Non-goals (still) The platform ambition does not change the v1 boundary in `INTENT.md` or `SCOPE.md`. In particular it does not: - replace StateHub as the work / decision system of record; - encode producer-specific assessment semantics in the registry core; - require any of the optimisations listed in the "near-horizon" section below to land in v1; - commit the project to writing assembly. The assembly-experiment line (`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical. ## Architectural commitments — preserved by v1 The following decisions are taken now because reversing them later is expensive. Each lands in v1 as a schema or contract decision; full exploitation is later work. ### A1. Content as the primary address Internal canonical key for stored bytes is `:`, not a logical path. Files within a package keep a `relative_path` as logical metadata, but the storage backend sees and addresses content hashes. - Enables: global dedup, Merkle integrity proofs, partial mirrors, federation, OCI compatibility. - v1 cost: one schema column (`content_address`) and a deterministic key derivation; no behaviour change. ### A2. BLAKE3 as native digest, SHA-256 retained for interop `digest_algorithm` is a column on `artifact_files`. v1 default may remain `sha256` to ship the pilot quickly; the column exists so `blake3` can ship without migration. - Enables: faster hashing, free Merkle root over a package, alignment with modern signing tooling. - v1 cost: column + adapter table mapping algo → hashing impl. ### A3. Append-only event log as source of truth An `events` table with a monotonic sequence number is the authoritative record of registry mutations. The current metadata tables are a materialised view rebuildable from the log. - Enables: CDC feeds, audit, replication, point-in-time recovery, signed event streams. - v1 cost: one extra table written on the same transaction as today's mutations. ### A4. Signed manifests, canonicalisation pinned Manifest serialisation uses a canonical form (recommendation: canonical CBOR; JCS as alternative) so byte-identical signing is possible across languages and time. v1 may not actually sign — the pin guarantees that when signing lands, every prior manifest is re-signable byte-for-byte. - Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style manifest digests. - v1 cost: pick one canonicalisation library and use it for manifest writes. Zero runtime cost. ### A5. Control plane / data plane separation at the contract Even if v1 implements both in one Python process, the boundary between "registry / API / retention" (control plane) and "hash / chunk / store / serve" (data plane) is a named contract. When the data plane is later extracted into a Rust binary, the API does not change. - Enables: native-speed ingestion, language flexibility on the hot path, independent scaling. - v1 cost: discipline (separate Python module with no API leakage), not code. ### A6. Resumable upload wire shape API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with range, `POST /uploads/{id}/complete`. v1 implementation may still be single-shot multipart under the hood, but the resource shape exists so chunked / resumable upload is additive. - Enables: streaming, retry-safe ingestion, very-large-package support. - v1 cost: route definitions only; underlying logic can remain simple. ### A7. Tiering as a property of storage locations `storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and `restore_status` columns, nullable, default `hot`. The API can already return "not immediately available" without changing artifact identity. - Enables: future cold storage, Glacier-style restore flows. - v1 cost: two nullable columns. ### A8. Schema-typed metadata with open escape hatch Producers register a metadata schema (JSON Schema) per variant (e.g. `guide-board.run.v1`). Stored as open JSON, validated against the registered schema at ingest time. Queries can use typed views. - Enables: tooling, search, GraphQL views, typed clients without losing flexibility. - v1 cost: a `metadata_schemas` table; v1 validation can be a no-op. ### A9. OCI compatibility kept reachable We do not promise OCI compatibility in v1, but we do not adopt any data model that prevents it. Concretely: keep content addresses as `:`, keep manifest structure compatible with an OCI image manifest (config + layers + annotations), and avoid invariants that the OCI spec forbids. - Enables: future `oras push` / `cosign sign` / Helm ecosystem entry. - v1 cost: one design review per schema change against the OCI spec. ## Near-horizon technical roadmap (post-baseline) Roughly ordered. Not commitments; planning hooks. 1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC + optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a minimal gRPC or framed-bincode protocol to the Python control plane over a Unix socket. 2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup unit. Package manifest references chunk digests; package digest is the Merkle root. 3. **Cosign-compatible signing pipeline.** Every finalised manifest can be signed; signatures stored alongside the manifest. 4. **Event stream out.** NATS or Kafka topic of registry events for downstream consumers. 5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI distribution spec on top of the same storage. 6. **WASM plugin host.** Producers and operators can ship signed `.wasm` modules for content extraction, redaction, scorecard generation, custom hashing, indexing. This is the "ffmpeg moment" — open extension surface that does not require forking the core. 7. **Federation.** Signed manifest exchange between artifact-store instances. Gossip or explicit peering. 8. **Cold tier adapters.** S3 Glacier, Tape, IA classes. ## How this document is used - Every schema change in WP-0001 (or successors) is checked against commitments A1–A9. A change that violates one is either rejected or documented as a deliberate revision of this document. - Every "we could do this faster in native code" idea is filed against `docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan. - Every new producer integration is checked against the commercial horizon: does it generalise, or does it bake in producer-specific assumptions? This document is allowed to be wrong. It is not allowed to be silent. Update it when the thesis changes; do not let v1 quietly close doors that the v3 needs open.