Files
artifact-store/docs/PLATFORM-AMBITION.md
tegwick 403d903585 docs: add platform ambition, blueprint review, and assembly experiment
Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 20:56:01 +02:00

227 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Platform Ambition
Status: draft
Created: 2026-05-15
This document records the longer-horizon thesis behind `artifact-store` and
captures which decisions are taken now to keep that horizon reachable without
expanding the v1 workplan. It sits beside, not above, `INTENT.md` and
`SCOPE.md`. INTENT defines what we build first; this document defines what
the v1 must not foreclose.
## Thesis
Generated artifacts — evidence packages, build outputs, ML models, logs,
snapshots, reports, scorecards, exports — are first-class durable objects in
modern software work. They sit somewhere between source code (well-served by
Git) and binary releases (well-served by OCI registries). The space between
is currently filled by a fragmented mix of bespoke directories, ad-hoc S3
buckets, vendor registries (Artifactory, Nexus), and document-management
systems that were not built for machine producers.
`artifact-store` aims to occupy that gap with one substrate: a generic,
content-addressed, signed, deduplicated, retention-aware artifact registry
and storage gateway that other tools embed or speak to.
The reference points are deliberate. **VLC** and **ffmpeg** lead their domain
not by being the prettiest applications but by being correct, fast, embeddable,
portable, and indispensable infrastructure for everyone else. The same
strategy applies here: build a kernel that is so good at the bytes-and-
identity layer that every artifact-producing tool would rather speak its
protocol than reinvent it.
## Commercial horizon
The longer-horizon commercial target is **a sovereign artifact-storage
product line for European cloud providers** — Stack IT (Schwarz Group) is
the concrete example. The thesis is:
- Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do
not sell *artifact identity, retention, attestation, federation, evidence
preservation*. Customers either build it themselves or buy proprietary
registries on top.
- A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned
artifact substrate on top of its own object storage has a defensible
differentiation against AWS — not in raw price-per-GB, which is a losing
race, but in regulated workloads (evidence retention, audit, signed
attestations, legal-hold, sovereign jurisdiction guarantees).
- Open source is the wedge. A widely-adopted upstream that the provider
ships, supports, and extends is far stronger than a proprietary stack.
This is a multi-year horizon, not a v1 deliverable. It is recorded here so
schema and protocol decisions made now keep that path open.
## Reference points
| System | What we learn from it |
|-----------------|--------------------------------------------------------------|
| ffmpeg | Embeddable core, hand-tuned hot paths, runtime CPU dispatch |
| VLC | Plugin architecture, portability, ubiquity through being a library too |
| Git | Content-addressed storage, Merkle DAG, pack files, integrity |
| restic | Single static binary, CDC + dedup, encryption by default |
| IPFS | Content-addressing, federation, partial replication |
| OCI Registry | Standardised manifest + blob model with broad ecosystem |
| Sigstore / cosign | Signed attestations as a first-class artifact property |
| MinIO | Operator ergonomics, S3 wire compat as adoption vector |
| SeaweedFS / Ceph | Separation of metadata plane from data plane |
| RocksDB / LMDB | Embeddable storage engines with predictable performance |
| Kafka | Log-as-source-of-truth, materialised views |
| BLAKE3 | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned |
We are not trying to reproduce any of these. We are trying to occupy a
specific gap between them with the best ideas from each.
## Non-goals (still)
The platform ambition does not change the v1 boundary in `INTENT.md` or
`SCOPE.md`. In particular it does not:
- replace StateHub as the work / decision system of record;
- encode producer-specific assessment semantics in the registry core;
- require any of the optimisations listed in the "near-horizon" section
below to land in v1;
- commit the project to writing assembly. The assembly-experiment line
(`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical.
## Architectural commitments — preserved by v1
The following decisions are taken now because reversing them later is
expensive. Each lands in v1 as a schema or contract decision; full
exploitation is later work.
### A1. Content as the primary address
Internal canonical key for stored bytes is `<algo>:<digest>`, not a logical
path. Files within a package keep a `relative_path` as logical metadata,
but the storage backend sees and addresses content hashes.
- Enables: global dedup, Merkle integrity proofs, partial mirrors,
federation, OCI compatibility.
- v1 cost: one schema column (`content_address`) and a deterministic key
derivation; no behaviour change.
### A2. BLAKE3 as native digest, SHA-256 retained for interop
`digest_algorithm` is a column on `artifact_files`. v1 default may remain
`sha256` to ship the pilot quickly; the column exists so `blake3` can ship
without migration.
- Enables: faster hashing, free Merkle root over a package, alignment with
modern signing tooling.
- v1 cost: column + adapter table mapping algo → hashing impl.
### A3. Append-only event log as source of truth
An `events` table with a monotonic sequence number is the authoritative
record of registry mutations. The current metadata tables are a
materialised view rebuildable from the log.
- Enables: CDC feeds, audit, replication, point-in-time recovery, signed
event streams.
- v1 cost: one extra table written on the same transaction as today's
mutations.
### A4. Signed manifests, canonicalisation pinned
Manifest serialisation uses a canonical form (recommendation: canonical
CBOR; JCS as alternative) so byte-identical signing is possible across
languages and time. v1 may not actually sign — the pin guarantees that
when signing lands, every prior manifest is re-signable byte-for-byte.
- Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style
manifest digests.
- v1 cost: pick one canonicalisation library and use it for manifest
writes. Zero runtime cost.
### A5. Control plane / data plane separation at the contract
Even if v1 implements both in one Python process, the boundary between
"registry / API / retention" (control plane) and "hash / chunk / store /
serve" (data plane) is a named contract. When the data plane is later
extracted into a Rust binary, the API does not change.
- Enables: native-speed ingestion, language flexibility on the hot path,
independent scaling.
- v1 cost: discipline (separate Python module with no API leakage), not
code.
### A6. Resumable upload wire shape
API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with
range, `POST /uploads/{id}/complete`. v1 implementation may still be
single-shot multipart under the hood, but the resource shape exists so
chunked / resumable upload is additive.
- Enables: streaming, retry-safe ingestion, very-large-package support.
- v1 cost: route definitions only; underlying logic can remain simple.
### A7. Tiering as a property of storage locations
`storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and
`restore_status` columns, nullable, default `hot`. The API can already
return "not immediately available" without changing artifact identity.
- Enables: future cold storage, Glacier-style restore flows.
- v1 cost: two nullable columns.
### A8. Schema-typed metadata with open escape hatch
Producers register a metadata schema (JSON Schema) per variant
(e.g. `guide-board.run.v1`). Stored as open JSON, validated against the
registered schema at ingest time. Queries can use typed views.
- Enables: tooling, search, GraphQL views, typed clients without losing
flexibility.
- v1 cost: a `metadata_schemas` table; v1 validation can be a no-op.
### A9. OCI compatibility kept reachable
We do not promise OCI compatibility in v1, but we do not adopt any
data model that prevents it. Concretely: keep content addresses as
`<algo>:<hex>`, keep manifest structure compatible with an OCI image
manifest (config + layers + annotations), and avoid invariants that the
OCI spec forbids.
- Enables: future `oras push` / `cosign sign` / Helm ecosystem entry.
- v1 cost: one design review per schema change against the OCI spec.
## Near-horizon technical roadmap (post-baseline)
Roughly ordered. Not commitments; planning hooks.
1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC +
optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a
minimal gRPC or framed-bincode protocol to the Python control plane over
a Unix socket.
2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup
unit. Package manifest references chunk digests; package digest is the
Merkle root.
3. **Cosign-compatible signing pipeline.** Every finalised manifest can be
signed; signatures stored alongside the manifest.
4. **Event stream out.** NATS or Kafka topic of registry events for
downstream consumers.
5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI
distribution spec on top of the same storage.
6. **WASM plugin host.** Producers and operators can ship signed `.wasm`
modules for content extraction, redaction, scorecard generation,
custom hashing, indexing. This is the "ffmpeg moment" — open extension
surface that does not require forking the core.
7. **Federation.** Signed manifest exchange between artifact-store
instances. Gossip or explicit peering.
8. **Cold tier adapters.** S3 Glacier, Tape, IA classes.
## How this document is used
- Every schema change in WP-0001 (or successors) is checked against
commitments A1A9. A change that violates one is either rejected or
documented as a deliberate revision of this document.
- Every "we could do this faster in native code" idea is filed against
`docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan.
- Every new producer integration is checked against the commercial
horizon: does it generalise, or does it bake in producer-specific
assumptions?
This document is allowed to be wrong. It is not allowed to be silent.
Update it when the thesis changes; do not let v1 quietly close doors that
the v3 needs open.