docs: add platform ambition, blueprint review, and assembly experiment

Captures the longer-horizon thesis (sovereign-cloud artifact substrate)
alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine
schema/contract commitments the v1 must preserve to keep that horizon
reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on
ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-15 20:56:01 +02:00
parent 793c0c7ba5
commit 403d903585
4 changed files with 616 additions and 3 deletions

226
docs/PLATFORM-AMBITION.md Normal file
View File

@@ -0,0 +1,226 @@
# Platform Ambition
Status: draft
Created: 2026-05-15
This document records the longer-horizon thesis behind `artifact-store` and
captures which decisions are taken now to keep that horizon reachable without
expanding the v1 workplan. It sits beside, not above, `INTENT.md` and
`SCOPE.md`. INTENT defines what we build first; this document defines what
the v1 must not foreclose.
## Thesis
Generated artifacts — evidence packages, build outputs, ML models, logs,
snapshots, reports, scorecards, exports — are first-class durable objects in
modern software work. They sit somewhere between source code (well-served by
Git) and binary releases (well-served by OCI registries). The space between
is currently filled by a fragmented mix of bespoke directories, ad-hoc S3
buckets, vendor registries (Artifactory, Nexus), and document-management
systems that were not built for machine producers.
`artifact-store` aims to occupy that gap with one substrate: a generic,
content-addressed, signed, deduplicated, retention-aware artifact registry
and storage gateway that other tools embed or speak to.
The reference points are deliberate. **VLC** and **ffmpeg** lead their domain
not by being the prettiest applications but by being correct, fast, embeddable,
portable, and indispensable infrastructure for everyone else. The same
strategy applies here: build a kernel that is so good at the bytes-and-
identity layer that every artifact-producing tool would rather speak its
protocol than reinvent it.
## Commercial horizon
The longer-horizon commercial target is **a sovereign artifact-storage
product line for European cloud providers** — Stack IT (Schwarz Group) is
the concrete example. The thesis is:
- Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do
not sell *artifact identity, retention, attestation, federation, evidence
preservation*. Customers either build it themselves or buy proprietary
registries on top.
- A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned
artifact substrate on top of its own object storage has a defensible
differentiation against AWS — not in raw price-per-GB, which is a losing
race, but in regulated workloads (evidence retention, audit, signed
attestations, legal-hold, sovereign jurisdiction guarantees).
- Open source is the wedge. A widely-adopted upstream that the provider
ships, supports, and extends is far stronger than a proprietary stack.
This is a multi-year horizon, not a v1 deliverable. It is recorded here so
schema and protocol decisions made now keep that path open.
## Reference points
| System | What we learn from it |
|-----------------|--------------------------------------------------------------|
| ffmpeg | Embeddable core, hand-tuned hot paths, runtime CPU dispatch |
| VLC | Plugin architecture, portability, ubiquity through being a library too |
| Git | Content-addressed storage, Merkle DAG, pack files, integrity |
| restic | Single static binary, CDC + dedup, encryption by default |
| IPFS | Content-addressing, federation, partial replication |
| OCI Registry | Standardised manifest + blob model with broad ecosystem |
| Sigstore / cosign | Signed attestations as a first-class artifact property |
| MinIO | Operator ergonomics, S3 wire compat as adoption vector |
| SeaweedFS / Ceph | Separation of metadata plane from data plane |
| RocksDB / LMDB | Embeddable storage engines with predictable performance |
| Kafka | Log-as-source-of-truth, materialised views |
| BLAKE3 | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned |
We are not trying to reproduce any of these. We are trying to occupy a
specific gap between them with the best ideas from each.
## Non-goals (still)
The platform ambition does not change the v1 boundary in `INTENT.md` or
`SCOPE.md`. In particular it does not:
- replace StateHub as the work / decision system of record;
- encode producer-specific assessment semantics in the registry core;
- require any of the optimisations listed in the "near-horizon" section
below to land in v1;
- commit the project to writing assembly. The assembly-experiment line
(`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical.
## Architectural commitments — preserved by v1
The following decisions are taken now because reversing them later is
expensive. Each lands in v1 as a schema or contract decision; full
exploitation is later work.
### A1. Content as the primary address
Internal canonical key for stored bytes is `<algo>:<digest>`, not a logical
path. Files within a package keep a `relative_path` as logical metadata,
but the storage backend sees and addresses content hashes.
- Enables: global dedup, Merkle integrity proofs, partial mirrors,
federation, OCI compatibility.
- v1 cost: one schema column (`content_address`) and a deterministic key
derivation; no behaviour change.
### A2. BLAKE3 as native digest, SHA-256 retained for interop
`digest_algorithm` is a column on `artifact_files`. v1 default may remain
`sha256` to ship the pilot quickly; the column exists so `blake3` can ship
without migration.
- Enables: faster hashing, free Merkle root over a package, alignment with
modern signing tooling.
- v1 cost: column + adapter table mapping algo → hashing impl.
### A3. Append-only event log as source of truth
An `events` table with a monotonic sequence number is the authoritative
record of registry mutations. The current metadata tables are a
materialised view rebuildable from the log.
- Enables: CDC feeds, audit, replication, point-in-time recovery, signed
event streams.
- v1 cost: one extra table written on the same transaction as today's
mutations.
### A4. Signed manifests, canonicalisation pinned
Manifest serialisation uses a canonical form (recommendation: canonical
CBOR; JCS as alternative) so byte-identical signing is possible across
languages and time. v1 may not actually sign — the pin guarantees that
when signing lands, every prior manifest is re-signable byte-for-byte.
- Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style
manifest digests.
- v1 cost: pick one canonicalisation library and use it for manifest
writes. Zero runtime cost.
### A5. Control plane / data plane separation at the contract
Even if v1 implements both in one Python process, the boundary between
"registry / API / retention" (control plane) and "hash / chunk / store /
serve" (data plane) is a named contract. When the data plane is later
extracted into a Rust binary, the API does not change.
- Enables: native-speed ingestion, language flexibility on the hot path,
independent scaling.
- v1 cost: discipline (separate Python module with no API leakage), not
code.
### A6. Resumable upload wire shape
API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with
range, `POST /uploads/{id}/complete`. v1 implementation may still be
single-shot multipart under the hood, but the resource shape exists so
chunked / resumable upload is additive.
- Enables: streaming, retry-safe ingestion, very-large-package support.
- v1 cost: route definitions only; underlying logic can remain simple.
### A7. Tiering as a property of storage locations
`storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and
`restore_status` columns, nullable, default `hot`. The API can already
return "not immediately available" without changing artifact identity.
- Enables: future cold storage, Glacier-style restore flows.
- v1 cost: two nullable columns.
### A8. Schema-typed metadata with open escape hatch
Producers register a metadata schema (JSON Schema) per variant
(e.g. `guide-board.run.v1`). Stored as open JSON, validated against the
registered schema at ingest time. Queries can use typed views.
- Enables: tooling, search, GraphQL views, typed clients without losing
flexibility.
- v1 cost: a `metadata_schemas` table; v1 validation can be a no-op.
### A9. OCI compatibility kept reachable
We do not promise OCI compatibility in v1, but we do not adopt any
data model that prevents it. Concretely: keep content addresses as
`<algo>:<hex>`, keep manifest structure compatible with an OCI image
manifest (config + layers + annotations), and avoid invariants that the
OCI spec forbids.
- Enables: future `oras push` / `cosign sign` / Helm ecosystem entry.
- v1 cost: one design review per schema change against the OCI spec.
## Near-horizon technical roadmap (post-baseline)
Roughly ordered. Not commitments; planning hooks.
1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC +
optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a
minimal gRPC or framed-bincode protocol to the Python control plane over
a Unix socket.
2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup
unit. Package manifest references chunk digests; package digest is the
Merkle root.
3. **Cosign-compatible signing pipeline.** Every finalised manifest can be
signed; signatures stored alongside the manifest.
4. **Event stream out.** NATS or Kafka topic of registry events for
downstream consumers.
5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI
distribution spec on top of the same storage.
6. **WASM plugin host.** Producers and operators can ship signed `.wasm`
modules for content extraction, redaction, scorecard generation,
custom hashing, indexing. This is the "ffmpeg moment" — open extension
surface that does not require forking the core.
7. **Federation.** Signed manifest exchange between artifact-store
instances. Gossip or explicit peering.
8. **Cold tier adapters.** S3 Glacier, Tape, IA classes.
## How this document is used
- Every schema change in WP-0001 (or successors) is checked against
commitments A1A9. A change that violates one is either rejected or
documented as a deliberate revision of this document.
- Every "we could do this faster in native code" idea is filed against
`docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan.
- Every new producer integration is checked against the commercial
horizon: does it generalise, or does it bake in producer-specific
assumptions?
This document is allowed to be wrong. It is not allowed to be silent.
Update it when the thesis changes; do not let v1 quietly close doors that
the v3 needs open.