# Platform Ambition

Status: draft
Created: 2026-05-15

This document records the longer-horizon thesis behind `artifact-store` and
captures which decisions are taken now to keep that horizon reachable without
expanding the v1 workplan. It sits beside, not above, `INTENT.md` and
`SCOPE.md`. INTENT defines what we build first; this document defines what
the v1 must not foreclose.

## Thesis

Generated artifacts — evidence packages, build outputs, ML models, logs,
snapshots, reports, scorecards, exports — are first-class durable objects in
modern software work. They sit somewhere between source code (well-served by
Git) and binary releases (well-served by OCI registries). The space between
is currently filled by a fragmented mix of bespoke directories, ad-hoc S3
buckets, vendor registries (Artifactory, Nexus), and document-management
systems that were not built for machine producers.

`artifact-store` aims to occupy that gap with one substrate: a generic,
content-addressed, signed, deduplicated, retention-aware artifact registry
and storage gateway that other tools embed or speak to.

The reference points are deliberate. **VLC** and **ffmpeg** lead their domain
not by being the prettiest applications but by being correct, fast, embeddable,
portable, and indispensable infrastructure for everyone else. The same
strategy applies here: build a kernel that is so good at the bytes-and-
identity layer that every artifact-producing tool would rather speak its
protocol than reinvent it.

## Commercial horizon

The longer-horizon commercial target is **a sovereign artifact-storage
product line for European cloud providers** — Stack IT (Schwarz Group) is
the concrete example. The thesis is:

- Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do
  not sell *artifact identity, retention, attestation, federation, evidence
  preservation*. Customers either build it themselves or buy proprietary
  registries on top.
- A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned
  artifact substrate on top of its own object storage has a defensible
  differentiation against AWS — not in raw price-per-GB, which is a losing
  race, but in regulated workloads (evidence retention, audit, signed
  attestations, legal-hold, sovereign jurisdiction guarantees).
- Open source is the wedge. A widely-adopted upstream that the provider
  ships, supports, and extends is far stronger than a proprietary stack.

This is a multi-year horizon, not a v1 deliverable. It is recorded here so
schema and protocol decisions made now keep that path open.

## Reference points

| System          | What we learn from it                                        |
|-----------------|--------------------------------------------------------------|
| ffmpeg          | Embeddable core, hand-tuned hot paths, runtime CPU dispatch  |
| VLC             | Plugin architecture, portability, ubiquity through being a library too |
| Git             | Content-addressed storage, Merkle DAG, pack files, integrity |
| restic          | Single static binary, CDC + dedup, encryption by default     |
| IPFS            | Content-addressing, federation, partial replication          |
| OCI Registry    | Standardised manifest + blob model with broad ecosystem      |
| Sigstore / cosign | Signed attestations as a first-class artifact property     |
| MinIO           | Operator ergonomics, S3 wire compat as adoption vector       |
| SeaweedFS / Ceph | Separation of metadata plane from data plane                |
| RocksDB / LMDB  | Embeddable storage engines with predictable performance      |
| Kafka           | Log-as-source-of-truth, materialised views                   |
| BLAKE3          | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned |

We are not trying to reproduce any of these. We are trying to occupy a
specific gap between them with the best ideas from each.

## Non-goals (still)

The platform ambition does not change the v1 boundary in `INTENT.md` or
`SCOPE.md`. In particular it does not:

- replace StateHub as the work / decision system of record;
- encode producer-specific assessment semantics in the registry core;
- require any of the optimisations listed in the "near-horizon" section
  below to land in v1;
- commit the project to writing assembly. The assembly-experiment line
  (`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical.

## Architectural commitments — preserved by v1

The following decisions are taken now because reversing them later is
expensive. Each lands in v1 as a schema or contract decision; full
exploitation is later work.

### A1. Content as the primary address

Internal canonical key for stored bytes is `<algo>:<digest>`, not a logical
path. Files within a package keep a `relative_path` as logical metadata,
but the storage backend sees and addresses content hashes.

- Enables: global dedup, Merkle integrity proofs, partial mirrors,
  federation, OCI compatibility.
- v1 cost: one schema column (`content_address`) and a deterministic key
  derivation; no behaviour change.

### A2. BLAKE3 as native digest, SHA-256 retained for interop

`digest_algorithm` is a column on `artifact_files`. v1 default may remain
`sha256` to ship the pilot quickly; the column exists so `blake3` can ship
without migration.

- Enables: faster hashing, free Merkle root over a package, alignment with
  modern signing tooling.
- v1 cost: column + adapter table mapping algo → hashing impl.

### A3. Append-only event log as source of truth

An `events` table with a monotonic sequence number is the authoritative
record of registry mutations. The current metadata tables are a
materialised view rebuildable from the log.

- Enables: CDC feeds, audit, replication, point-in-time recovery, signed
  event streams.
- v1 cost: one extra table written on the same transaction as today's
  mutations.

### A4. Signed manifests, canonicalisation pinned

Manifest serialisation uses a canonical form (recommendation: canonical
CBOR; JCS as alternative) so byte-identical signing is possible across
languages and time. v1 may not actually sign — the pin guarantees that
when signing lands, every prior manifest is re-signable byte-for-byte.

- Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style
  manifest digests.
- v1 cost: pick one canonicalisation library and use it for manifest
  writes. Zero runtime cost.

### A5. Control plane / data plane separation at the contract

Even if v1 implements both in one Python process, the boundary between
"registry / API / retention" (control plane) and "hash / chunk / store /
serve" (data plane) is a named contract. When the data plane is later
extracted into a Rust binary, the API does not change.

- Enables: native-speed ingestion, language flexibility on the hot path,
  independent scaling.
- v1 cost: discipline (separate Python module with no API leakage), not
  code.

### A6. Resumable upload wire shape

API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with
range, `POST /uploads/{id}/complete`. v1 implementation may still be
single-shot multipart under the hood, but the resource shape exists so
chunked / resumable upload is additive.

- Enables: streaming, retry-safe ingestion, very-large-package support.
- v1 cost: route definitions only; underlying logic can remain simple.

### A7. Tiering as a property of storage locations

`storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and
`restore_status` columns, nullable, default `hot`. The API can already
return "not immediately available" without changing artifact identity.

- Enables: future cold storage, Glacier-style restore flows.
- v1 cost: two nullable columns.

### A8. Schema-typed metadata with open escape hatch

Producers register a metadata schema (JSON Schema) per variant
(e.g. `guide-board.run.v1`). Stored as open JSON, validated against the
registered schema at ingest time. Queries can use typed views.

- Enables: tooling, search, GraphQL views, typed clients without losing
  flexibility.
- v1 cost: a `metadata_schemas` table; v1 validation can be a no-op.

### A9. OCI compatibility kept reachable

We do not promise OCI compatibility in v1, but we do not adopt any
data model that prevents it. Concretely: keep content addresses as
`<algo>:<hex>`, keep manifest structure compatible with an OCI image
manifest (config + layers + annotations), and avoid invariants that the
OCI spec forbids.

- Enables: future `oras push` / `cosign sign` / Helm ecosystem entry.
- v1 cost: one design review per schema change against the OCI spec.

## Near-horizon technical roadmap (post-baseline)

Roughly ordered. Not commitments; planning hooks.

1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC +
   optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a
   minimal gRPC or framed-bincode protocol to the Python control plane over
   a Unix socket.
2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup
   unit. Package manifest references chunk digests; package digest is the
   Merkle root.
3. **Cosign-compatible signing pipeline.** Every finalised manifest can be
   signed; signatures stored alongside the manifest.
4. **Event stream out.** NATS or Kafka topic of registry events for
   downstream consumers.
5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI
   distribution spec on top of the same storage.
6. **WASM plugin host.** Producers and operators can ship signed `.wasm`
   modules for content extraction, redaction, scorecard generation,
   custom hashing, indexing. This is the "ffmpeg moment" — open extension
   surface that does not require forking the core.
7. **Federation.** Signed manifest exchange between artifact-store
   instances. Gossip or explicit peering.
8. **Cold tier adapters.** S3 Glacier, Tape, IA classes.

## How this document is used

- Every schema change in WP-0001 (or successors) is checked against
  commitments A1–A9. A change that violates one is either rejected or
  documented as a deliberate revision of this document.
- Every "we could do this faster in native code" idea is filed against
  `docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan.
- Every new producer integration is checked against the commercial
  horizon: does it generalise, or does it bake in producer-specific
  assumptions?

This document is allowed to be wrong. It is not allowed to be silent.
Update it when the thesis changes; do not let v1 quietly close doors that
the v3 needs open.