Files
artifact-store/docs/PLATFORM-AMBITION.md
tegwick 403d903585 docs: add platform ambition, blueprint review, and assembly experiment
Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 20:56:01 +02:00

10 KiB
Raw Permalink Blame History

Platform Ambition

Status: draft Created: 2026-05-15

This document records the longer-horizon thesis behind artifact-store and captures which decisions are taken now to keep that horizon reachable without expanding the v1 workplan. It sits beside, not above, INTENT.md and SCOPE.md. INTENT defines what we build first; this document defines what the v1 must not foreclose.

Thesis

Generated artifacts — evidence packages, build outputs, ML models, logs, snapshots, reports, scorecards, exports — are first-class durable objects in modern software work. They sit somewhere between source code (well-served by Git) and binary releases (well-served by OCI registries). The space between is currently filled by a fragmented mix of bespoke directories, ad-hoc S3 buckets, vendor registries (Artifactory, Nexus), and document-management systems that were not built for machine producers.

artifact-store aims to occupy that gap with one substrate: a generic, content-addressed, signed, deduplicated, retention-aware artifact registry and storage gateway that other tools embed or speak to.

The reference points are deliberate. VLC and ffmpeg lead their domain not by being the prettiest applications but by being correct, fast, embeddable, portable, and indispensable infrastructure for everyone else. The same strategy applies here: build a kernel that is so good at the bytes-and- identity layer that every artifact-producing tool would rather speak its protocol than reinvent it.

Commercial horizon

The longer-horizon commercial target is a sovereign artifact-storage product line for European cloud providers — Stack IT (Schwarz Group) is the concrete example. The thesis is:

  • Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do not sell artifact identity, retention, attestation, federation, evidence preservation. Customers either build it themselves or buy proprietary registries on top.
  • A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned artifact substrate on top of its own object storage has a defensible differentiation against AWS — not in raw price-per-GB, which is a losing race, but in regulated workloads (evidence retention, audit, signed attestations, legal-hold, sovereign jurisdiction guarantees).
  • Open source is the wedge. A widely-adopted upstream that the provider ships, supports, and extends is far stronger than a proprietary stack.

This is a multi-year horizon, not a v1 deliverable. It is recorded here so schema and protocol decisions made now keep that path open.

Reference points

System What we learn from it
ffmpeg Embeddable core, hand-tuned hot paths, runtime CPU dispatch
VLC Plugin architecture, portability, ubiquity through being a library too
Git Content-addressed storage, Merkle DAG, pack files, integrity
restic Single static binary, CDC + dedup, encryption by default
IPFS Content-addressing, federation, partial replication
OCI Registry Standardised manifest + blob model with broad ecosystem
Sigstore / cosign Signed attestations as a first-class artifact property
MinIO Operator ergonomics, S3 wire compat as adoption vector
SeaweedFS / Ceph Separation of metadata plane from data plane
RocksDB / LMDB Embeddable storage engines with predictable performance
Kafka Log-as-source-of-truth, materialised views
BLAKE3 Modern hash primitive: parallel, Merkle-tree-native, asm-tuned

We are not trying to reproduce any of these. We are trying to occupy a specific gap between them with the best ideas from each.

Non-goals (still)

The platform ambition does not change the v1 boundary in INTENT.md or SCOPE.md. In particular it does not:

  • replace StateHub as the work / decision system of record;
  • encode producer-specific assessment semantics in the registry core;
  • require any of the optimisations listed in the "near-horizon" section below to land in v1;
  • commit the project to writing assembly. The assembly-experiment line (docs/ASSEMBLY-EXPERIMENT.md) is opt-in research, not roadmap-critical.

Architectural commitments — preserved by v1

The following decisions are taken now because reversing them later is expensive. Each lands in v1 as a schema or contract decision; full exploitation is later work.

A1. Content as the primary address

Internal canonical key for stored bytes is <algo>:<digest>, not a logical path. Files within a package keep a relative_path as logical metadata, but the storage backend sees and addresses content hashes.

  • Enables: global dedup, Merkle integrity proofs, partial mirrors, federation, OCI compatibility.
  • v1 cost: one schema column (content_address) and a deterministic key derivation; no behaviour change.

A2. BLAKE3 as native digest, SHA-256 retained for interop

digest_algorithm is a column on artifact_files. v1 default may remain sha256 to ship the pilot quickly; the column exists so blake3 can ship without migration.

  • Enables: faster hashing, free Merkle root over a package, alignment with modern signing tooling.
  • v1 cost: column + adapter table mapping algo → hashing impl.

A3. Append-only event log as source of truth

An events table with a monotonic sequence number is the authoritative record of registry mutations. The current metadata tables are a materialised view rebuildable from the log.

  • Enables: CDC feeds, audit, replication, point-in-time recovery, signed event streams.
  • v1 cost: one extra table written on the same transaction as today's mutations.

A4. Signed manifests, canonicalisation pinned

Manifest serialisation uses a canonical form (recommendation: canonical CBOR; JCS as alternative) so byte-identical signing is possible across languages and time. v1 may not actually sign — the pin guarantees that when signing lands, every prior manifest is re-signable byte-for-byte.

  • Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style manifest digests.
  • v1 cost: pick one canonicalisation library and use it for manifest writes. Zero runtime cost.

A5. Control plane / data plane separation at the contract

Even if v1 implements both in one Python process, the boundary between "registry / API / retention" (control plane) and "hash / chunk / store / serve" (data plane) is a named contract. When the data plane is later extracted into a Rust binary, the API does not change.

  • Enables: native-speed ingestion, language flexibility on the hot path, independent scaling.
  • v1 cost: discipline (separate Python module with no API leakage), not code.

A6. Resumable upload wire shape

API exposes upload sessions: POST /uploads, PATCH /uploads/{id} with range, POST /uploads/{id}/complete. v1 implementation may still be single-shot multipart under the hood, but the resource shape exists so chunked / resumable upload is additive.

  • Enables: streaming, retry-safe ingestion, very-large-package support.
  • v1 cost: route definitions only; underlying logic can remain simple.

A7. Tiering as a property of storage locations

storage_location carries retrieval_tier (hot|warm|cold|archive) and restore_status columns, nullable, default hot. The API can already return "not immediately available" without changing artifact identity.

  • Enables: future cold storage, Glacier-style restore flows.
  • v1 cost: two nullable columns.

A8. Schema-typed metadata with open escape hatch

Producers register a metadata schema (JSON Schema) per variant (e.g. guide-board.run.v1). Stored as open JSON, validated against the registered schema at ingest time. Queries can use typed views.

  • Enables: tooling, search, GraphQL views, typed clients without losing flexibility.
  • v1 cost: a metadata_schemas table; v1 validation can be a no-op.

A9. OCI compatibility kept reachable

We do not promise OCI compatibility in v1, but we do not adopt any data model that prevents it. Concretely: keep content addresses as <algo>:<hex>, keep manifest structure compatible with an OCI image manifest (config + layers + annotations), and avoid invariants that the OCI spec forbids.

  • Enables: future oras push / cosign sign / Helm ecosystem entry.
  • v1 cost: one design review per schema change against the OCI spec.

Near-horizon technical roadmap (post-baseline)

Roughly ordered. Not commitments; planning hooks.

  1. Rust data-plane binary. Receives chunked uploads, runs BLAKE3 + CDC + optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a minimal gRPC or framed-bincode protocol to the Python control plane over a Unix socket.
  2. Content-defined chunking (FastCDC). Stored chunks become the dedup unit. Package manifest references chunk digests; package digest is the Merkle root.
  3. Cosign-compatible signing pipeline. Every finalised manifest can be signed; signatures stored alongside the manifest.
  4. Event stream out. NATS or Kafka topic of registry events for downstream consumers.
  5. OCI artifact endpoint. A /v2/ namespace that speaks the OCI distribution spec on top of the same storage.
  6. WASM plugin host. Producers and operators can ship signed .wasm modules for content extraction, redaction, scorecard generation, custom hashing, indexing. This is the "ffmpeg moment" — open extension surface that does not require forking the core.
  7. Federation. Signed manifest exchange between artifact-store instances. Gossip or explicit peering.
  8. Cold tier adapters. S3 Glacier, Tape, IA classes.

How this document is used

  • Every schema change in WP-0001 (or successors) is checked against commitments A1A9. A change that violates one is either rejected or documented as a deliberate revision of this document.
  • Every "we could do this faster in native code" idea is filed against docs/ASSEMBLY-EXPERIMENT.md, not bolted onto a workplan.
  • Every new producer integration is checked against the commercial horizon: does it generalise, or does it bake in producer-specific assumptions?

This document is allowed to be wrong. It is not allowed to be silent. Update it when the thesis changes; do not let v1 quietly close doors that the v3 needs open.