generated from coulomb/repo-seed
Captures the longer-horizon thesis (sovereign-cloud artifact substrate) alongside the carefully-scoped v1 INTENT. PLATFORM-AMBITION records nine schema/contract commitments the v1 must preserve to keep that horizon reachable. ASSEMBLY-EXPERIMENT frames an opt-in research line on ffmpeg-grade hand-tuned asm with an MIT-0 vs LGPL-aware reuse map. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
227 lines
10 KiB
Markdown
227 lines
10 KiB
Markdown
# Platform Ambition
|
||
|
||
Status: draft
|
||
Created: 2026-05-15
|
||
|
||
This document records the longer-horizon thesis behind `artifact-store` and
|
||
captures which decisions are taken now to keep that horizon reachable without
|
||
expanding the v1 workplan. It sits beside, not above, `INTENT.md` and
|
||
`SCOPE.md`. INTENT defines what we build first; this document defines what
|
||
the v1 must not foreclose.
|
||
|
||
## Thesis
|
||
|
||
Generated artifacts — evidence packages, build outputs, ML models, logs,
|
||
snapshots, reports, scorecards, exports — are first-class durable objects in
|
||
modern software work. They sit somewhere between source code (well-served by
|
||
Git) and binary releases (well-served by OCI registries). The space between
|
||
is currently filled by a fragmented mix of bespoke directories, ad-hoc S3
|
||
buckets, vendor registries (Artifactory, Nexus), and document-management
|
||
systems that were not built for machine producers.
|
||
|
||
`artifact-store` aims to occupy that gap with one substrate: a generic,
|
||
content-addressed, signed, deduplicated, retention-aware artifact registry
|
||
and storage gateway that other tools embed or speak to.
|
||
|
||
The reference points are deliberate. **VLC** and **ffmpeg** lead their domain
|
||
not by being the prettiest applications but by being correct, fast, embeddable,
|
||
portable, and indispensable infrastructure for everyone else. The same
|
||
strategy applies here: build a kernel that is so good at the bytes-and-
|
||
identity layer that every artifact-producing tool would rather speak its
|
||
protocol than reinvent it.
|
||
|
||
## Commercial horizon
|
||
|
||
The longer-horizon commercial target is **a sovereign artifact-storage
|
||
product line for European cloud providers** — Stack IT (Schwarz Group) is
|
||
the concrete example. The thesis is:
|
||
|
||
- Hyperscalers (AWS S3, GCS, Azure Blob) sell raw object storage. They do
|
||
not sell *artifact identity, retention, attestation, federation, evidence
|
||
preservation*. Customers either build it themselves or buy proprietary
|
||
registries on top.
|
||
- A European hyperscaler that ships a turnkey, sovereign, GDPR-aligned
|
||
artifact substrate on top of its own object storage has a defensible
|
||
differentiation against AWS — not in raw price-per-GB, which is a losing
|
||
race, but in regulated workloads (evidence retention, audit, signed
|
||
attestations, legal-hold, sovereign jurisdiction guarantees).
|
||
- Open source is the wedge. A widely-adopted upstream that the provider
|
||
ships, supports, and extends is far stronger than a proprietary stack.
|
||
|
||
This is a multi-year horizon, not a v1 deliverable. It is recorded here so
|
||
schema and protocol decisions made now keep that path open.
|
||
|
||
## Reference points
|
||
|
||
| System | What we learn from it |
|
||
|-----------------|--------------------------------------------------------------|
|
||
| ffmpeg | Embeddable core, hand-tuned hot paths, runtime CPU dispatch |
|
||
| VLC | Plugin architecture, portability, ubiquity through being a library too |
|
||
| Git | Content-addressed storage, Merkle DAG, pack files, integrity |
|
||
| restic | Single static binary, CDC + dedup, encryption by default |
|
||
| IPFS | Content-addressing, federation, partial replication |
|
||
| OCI Registry | Standardised manifest + blob model with broad ecosystem |
|
||
| Sigstore / cosign | Signed attestations as a first-class artifact property |
|
||
| MinIO | Operator ergonomics, S3 wire compat as adoption vector |
|
||
| SeaweedFS / Ceph | Separation of metadata plane from data plane |
|
||
| RocksDB / LMDB | Embeddable storage engines with predictable performance |
|
||
| Kafka | Log-as-source-of-truth, materialised views |
|
||
| BLAKE3 | Modern hash primitive: parallel, Merkle-tree-native, asm-tuned |
|
||
|
||
We are not trying to reproduce any of these. We are trying to occupy a
|
||
specific gap between them with the best ideas from each.
|
||
|
||
## Non-goals (still)
|
||
|
||
The platform ambition does not change the v1 boundary in `INTENT.md` or
|
||
`SCOPE.md`. In particular it does not:
|
||
|
||
- replace StateHub as the work / decision system of record;
|
||
- encode producer-specific assessment semantics in the registry core;
|
||
- require any of the optimisations listed in the "near-horizon" section
|
||
below to land in v1;
|
||
- commit the project to writing assembly. The assembly-experiment line
|
||
(`docs/ASSEMBLY-EXPERIMENT.md`) is opt-in research, not roadmap-critical.
|
||
|
||
## Architectural commitments — preserved by v1
|
||
|
||
The following decisions are taken now because reversing them later is
|
||
expensive. Each lands in v1 as a schema or contract decision; full
|
||
exploitation is later work.
|
||
|
||
### A1. Content as the primary address
|
||
|
||
Internal canonical key for stored bytes is `<algo>:<digest>`, not a logical
|
||
path. Files within a package keep a `relative_path` as logical metadata,
|
||
but the storage backend sees and addresses content hashes.
|
||
|
||
- Enables: global dedup, Merkle integrity proofs, partial mirrors,
|
||
federation, OCI compatibility.
|
||
- v1 cost: one schema column (`content_address`) and a deterministic key
|
||
derivation; no behaviour change.
|
||
|
||
### A2. BLAKE3 as native digest, SHA-256 retained for interop
|
||
|
||
`digest_algorithm` is a column on `artifact_files`. v1 default may remain
|
||
`sha256` to ship the pilot quickly; the column exists so `blake3` can ship
|
||
without migration.
|
||
|
||
- Enables: faster hashing, free Merkle root over a package, alignment with
|
||
modern signing tooling.
|
||
- v1 cost: column + adapter table mapping algo → hashing impl.
|
||
|
||
### A3. Append-only event log as source of truth
|
||
|
||
An `events` table with a monotonic sequence number is the authoritative
|
||
record of registry mutations. The current metadata tables are a
|
||
materialised view rebuildable from the log.
|
||
|
||
- Enables: CDC feeds, audit, replication, point-in-time recovery, signed
|
||
event streams.
|
||
- v1 cost: one extra table written on the same transaction as today's
|
||
mutations.
|
||
|
||
### A4. Signed manifests, canonicalisation pinned
|
||
|
||
Manifest serialisation uses a canonical form (recommendation: canonical
|
||
CBOR; JCS as alternative) so byte-identical signing is possible across
|
||
languages and time. v1 may not actually sign — the pin guarantees that
|
||
when signing lands, every prior manifest is re-signable byte-for-byte.
|
||
|
||
- Enables: cosign / Sigstore, in-toto, SLSA attestations, OCI-style
|
||
manifest digests.
|
||
- v1 cost: pick one canonicalisation library and use it for manifest
|
||
writes. Zero runtime cost.
|
||
|
||
### A5. Control plane / data plane separation at the contract
|
||
|
||
Even if v1 implements both in one Python process, the boundary between
|
||
"registry / API / retention" (control plane) and "hash / chunk / store /
|
||
serve" (data plane) is a named contract. When the data plane is later
|
||
extracted into a Rust binary, the API does not change.
|
||
|
||
- Enables: native-speed ingestion, language flexibility on the hot path,
|
||
independent scaling.
|
||
- v1 cost: discipline (separate Python module with no API leakage), not
|
||
code.
|
||
|
||
### A6. Resumable upload wire shape
|
||
|
||
API exposes upload sessions: `POST /uploads`, `PATCH /uploads/{id}` with
|
||
range, `POST /uploads/{id}/complete`. v1 implementation may still be
|
||
single-shot multipart under the hood, but the resource shape exists so
|
||
chunked / resumable upload is additive.
|
||
|
||
- Enables: streaming, retry-safe ingestion, very-large-package support.
|
||
- v1 cost: route definitions only; underlying logic can remain simple.
|
||
|
||
### A7. Tiering as a property of storage locations
|
||
|
||
`storage_location` carries `retrieval_tier` (`hot|warm|cold|archive`) and
|
||
`restore_status` columns, nullable, default `hot`. The API can already
|
||
return "not immediately available" without changing artifact identity.
|
||
|
||
- Enables: future cold storage, Glacier-style restore flows.
|
||
- v1 cost: two nullable columns.
|
||
|
||
### A8. Schema-typed metadata with open escape hatch
|
||
|
||
Producers register a metadata schema (JSON Schema) per variant
|
||
(e.g. `guide-board.run.v1`). Stored as open JSON, validated against the
|
||
registered schema at ingest time. Queries can use typed views.
|
||
|
||
- Enables: tooling, search, GraphQL views, typed clients without losing
|
||
flexibility.
|
||
- v1 cost: a `metadata_schemas` table; v1 validation can be a no-op.
|
||
|
||
### A9. OCI compatibility kept reachable
|
||
|
||
We do not promise OCI compatibility in v1, but we do not adopt any
|
||
data model that prevents it. Concretely: keep content addresses as
|
||
`<algo>:<hex>`, keep manifest structure compatible with an OCI image
|
||
manifest (config + layers + annotations), and avoid invariants that the
|
||
OCI spec forbids.
|
||
|
||
- Enables: future `oras push` / `cosign sign` / Helm ecosystem entry.
|
||
- v1 cost: one design review per schema change against the OCI spec.
|
||
|
||
## Near-horizon technical roadmap (post-baseline)
|
||
|
||
Roughly ordered. Not commitments; planning hooks.
|
||
|
||
1. **Rust data-plane binary.** Receives chunked uploads, runs BLAKE3 + CDC +
|
||
optional Zstd + optional AES-GCM, writes to storage adapter. Speaks a
|
||
minimal gRPC or framed-bincode protocol to the Python control plane over
|
||
a Unix socket.
|
||
2. **Content-defined chunking (FastCDC).** Stored chunks become the dedup
|
||
unit. Package manifest references chunk digests; package digest is the
|
||
Merkle root.
|
||
3. **Cosign-compatible signing pipeline.** Every finalised manifest can be
|
||
signed; signatures stored alongside the manifest.
|
||
4. **Event stream out.** NATS or Kafka topic of registry events for
|
||
downstream consumers.
|
||
5. **OCI artifact endpoint.** A `/v2/` namespace that speaks the OCI
|
||
distribution spec on top of the same storage.
|
||
6. **WASM plugin host.** Producers and operators can ship signed `.wasm`
|
||
modules for content extraction, redaction, scorecard generation,
|
||
custom hashing, indexing. This is the "ffmpeg moment" — open extension
|
||
surface that does not require forking the core.
|
||
7. **Federation.** Signed manifest exchange between artifact-store
|
||
instances. Gossip or explicit peering.
|
||
8. **Cold tier adapters.** S3 Glacier, Tape, IA classes.
|
||
|
||
## How this document is used
|
||
|
||
- Every schema change in WP-0001 (or successors) is checked against
|
||
commitments A1–A9. A change that violates one is either rejected or
|
||
documented as a deliberate revision of this document.
|
||
- Every "we could do this faster in native code" idea is filed against
|
||
`docs/ASSEMBLY-EXPERIMENT.md`, not bolted onto a workplan.
|
||
- Every new producer integration is checked against the commercial
|
||
horizon: does it generalise, or does it bake in producer-specific
|
||
assumptions?
|
||
|
||
This document is allowed to be wrong. It is not allowed to be silent.
|
||
Update it when the thesis changes; do not let v1 quietly close doors that
|
||
the v3 needs open.
|