Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15 KiB
Architecture Blueprint
Status: accepted (v2 — supersedes 2026-05-15 draft) Updated: 2026-05-15
This document operationalises INTENT.md, the docs/PLATFORM-AMBITION.md
thesis, and the decisions recorded in docs/adr/. Where a tension exists
between this blueprint and an ADR, the ADR wins; raise an issue or
supersede the ADR.
Architecture in one paragraph
artifact-store is a library-first artifact registry and storage
gateway. A small core library (artifactstore) implements identity,
manifests, retention, the storage adapter SPI, the data plane SPI, and
the registry orchestrator. The HTTP server and the CLI are thin
consumers of that library. Bytes are addressed by content
(blake3:<hex>) and stored through a pluggable adapter SPI. State is
authoritative in an append-only event log; queryable tables are
materialised views.
Design lineage
The shape is deliberately borrowed from ffmpeg and VLC: a tight
core of well-named modules with stable contracts, runtime-pluggable
backends, a thin orchestration binary, and an explicit hot-path
boundary that can be rewritten in faster code without changing the
consumer API. See docs/PLATFORM-AMBITION.md for the reference table.
Top-level shape
producers / operators / agents
|
v
+------------------------+
| HTTP API | CLI | <-- thin consumers
+------------------------+
|
v
+------------------------+
| registry orchestrator |
+------------------------+
| | |
v v v
+----------+ +---------+ +---------+
| identity | | events | |retention|
|/manifest | | (log + | | policy |
| | | views) | | engine |
+----------+ +---------+ +---------+
|
v
+-----------------------+
| data plane SPI | <-- ADR-0004 contract
+-----------------------+
|
v
+-----------------------+
| storage adapter SPI |
+-----------------------+
| | |
v v v
+-----+ +------+ +-------+
|local| | S3 | | Ceph | ... future backends
| FS | | RGW | | RGW |
+-----+ +------+ +-------+
Core modules
Mapped one-to-one to ADR-0005's project layout. Each module has a stable public surface; internals are free to evolve.
identity
Digest(algorithm, hex)— value object.ContentAddress—<algorithm>:<hex>(ADR-0001).digest_stream(reader) -> {primary, sha256}— single-pass dual digest.- Algorithm registry:
blake3(default primary),sha256(always computed).
manifest
Manifest— versioned dataclass: package metadata + ordered file list- retention summary + provenance + storage receipts.
manifest.codec.encode(manifest) -> bytes— canonical CBOR (ADR-0003).manifest.codec.decode(bytes) -> Manifest.manifest.projection.jcs(manifest) -> bytes— canonical-JSON projection for display and signing-tool interop.- Round-trip invariant:
decode(encode(m)) == mandencode(decode(jcs_to_cbor(jcs(m)))) == encode(m).
events
events.write(transaction, event)— appends one row with monotonic sequence (ADR-0002).events.tail(since_sequence) -> AsyncIterator[Event]— long-poll.events.replay(into=ViewWriter)— rebuild materialised views.- Event types (v1):
v1.package.created,v1.file.ingested,v1.package.finalized,v1.retention.default_applied,v1.retention.extended,v1.retention.hold_applied,v1.retention.hold_released,v1.retention.deletion_eligible,v1.storage.location_recorded,v1.storage.location_verified,v1.audit.access,v1.system.note.
retention
retention.classes—transient,raw-evidence,summary-evidence,release-evidence,permanent-record. Defined as data, not code.retention.policy.apply(package, class) -> RetentionDecision— computesexpires_atand the deletion eligibility rule.retention.extend(package, until, reason, actor)— emits an event; the materialised view updates on commit.retention.hold(package, reason, actor)/retention.release_hold(hold_id, actor).
audit
- A view over
eventsfiltered to access and lifecycle events. No separate write path; auditing happens by event emission elsewhere.
storage (adapter SPI)
class StorageBackend(Protocol):
backend_id: str
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
async def health(self) -> BackendStatus: ...
- Backend registry: backends register at import time; selection is per-package by configuration.
- v1 ships
local(filesystem);s3ships in WP-0004.
dataplane (SPI per ADR-0004)
class DataPlane(Protocol):
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
async def backend_health(self) -> BackendStatus: ...
- v1 implementation:
dataplane.inproc— wraps aStorageBackend, computes digests during streaming. - Future implementation:
dataplane.remote— gRPC or framed-bincode-over-Unix-socket client to a Rust daemon.
registry
The orchestrator. Combines identity + manifest + events + retention + dataplane into the operations the HTTP API and CLI consume:
create_package, ingest_file, finalize_package, get_manifest,
download_file, extend_retention, apply_hold, release_hold,
mark_deletion_eligible, tail_events. Each operation is one DB
transaction that writes one or more events and updates materialised
views.
api.http and cli
Thin. Their job is to translate transport (HTTP / argv) into calls on
registry. No business logic.
Data model
All tables exist as materialised views over events (ADR-0002),
except events itself, retention_classes (seed data), and
metadata_schemas (config).
events (source of truth)
| Column | Type | Notes |
|---|---|---|
sequence |
BIGSERIAL PRIMARY KEY |
monotonic, gapless |
created_at |
TIMESTAMPTZ NOT NULL |
UTC, set by DB default |
event_type |
TEXT NOT NULL |
versioned slug (v1.…) |
subject_kind |
TEXT NOT NULL |
package / file / retention / storage / system |
subject_id |
UUID NULL |
|
actor |
TEXT NOT NULL |
producer or operator identity |
payload |
BYTEA NOT NULL |
canonical CBOR |
payload_digest |
BYTEA NOT NULL |
BLAKE3 of payload |
Indexes: (subject_kind, subject_id), (event_type, sequence).
artifact_packages (materialised view)
| Column | Type | Notes |
|---|---|---|
id |
UUID PRIMARY KEY |
|
name |
TEXT NOT NULL |
|
producer |
TEXT NOT NULL |
|
subject |
TEXT NOT NULL |
|
retention_class |
TEXT NOT NULL |
FK to retention_classes |
metadata_schema_id |
UUID NULL |
FK to metadata_schemas |
metadata |
JSONB NOT NULL |
validated against schema if present |
status |
TEXT NOT NULL |
created / uploading / finalized / deletion_eligible / deleted / failed |
manifest_digest |
BYTEA NULL |
populated on finalize |
created_at, finalized_at, expires_at |
TIMESTAMPTZ |
|
last_event_sequence |
BIGINT NOT NULL |
for replay bookkeeping |
artifact_files (materialised view)
| Column | Type | Notes |
|---|---|---|
id |
UUID PRIMARY KEY |
|
package_id |
UUID NOT NULL |
FK |
relative_path |
TEXT NOT NULL |
logical path; unique within package |
media_type |
TEXT NOT NULL |
required (ADR-0006) |
size_bytes |
BIGINT NOT NULL |
|
digest_algorithm |
TEXT NOT NULL |
blake3 by default (ADR-0001) |
digest_primary |
BYTEA NOT NULL |
bytes of the primary digest |
digest_sha256 |
BYTEA NOT NULL |
always populated for interop |
created_at |
TIMESTAMPTZ NOT NULL |
storage_locations (materialised view)
| Column | Type | Notes |
|---|---|---|
id |
UUID PRIMARY KEY |
|
artifact_file_id |
UUID NOT NULL |
FK |
backend_id |
TEXT NOT NULL |
|
content_address |
TEXT NOT NULL |
<algo>:<hex> |
object_key |
TEXT NOT NULL |
backend-specific, usually derived from content_address |
storage_class |
TEXT NULL |
backend-specific label |
retrieval_tier |
TEXT NOT NULL DEFAULT 'hot' |
hot / warm / cold / archive |
restore_status |
TEXT NULL |
available / restore_requested / restoring / restored / expired |
status |
TEXT NOT NULL |
recorded / verified / failed / deleted |
created_at, last_verified_at |
TIMESTAMPTZ |
retention_state (materialised view)
| Column | Type | Notes |
|---|---|---|
package_id |
UUID PRIMARY KEY |
|
current_expires_at |
TIMESTAMPTZ NULL |
NULL = no expiry (permanent or held) |
effective_class |
TEXT NOT NULL |
|
active_hold_id |
UUID NULL |
|
eligible_for_deletion |
BOOLEAN NOT NULL |
retention_classes (seed data, not derived)
| Column | Type | Notes |
|---|---|---|
class_id |
TEXT PRIMARY KEY |
transient / raw-evidence / summary-evidence / release-evidence / permanent-record |
default_duration |
INTERVAL NULL |
NULL for permanent-record |
deletion_strategy |
TEXT NOT NULL |
mark_eligible / auto_delete_after_grace (v1 only uses the former) |
metadata_schemas (config table)
| Column | Type | Notes |
|---|---|---|
id |
UUID PRIMARY KEY |
|
slug |
TEXT NOT NULL UNIQUE |
e.g. guide-board.run.v1 |
json_schema |
JSONB NOT NULL |
|
created_at |
TIMESTAMPTZ NOT NULL |
API shape
Native v1 surface
GET /health
GET /backends
GET /retention-classes
POST /packages # create
GET /packages # list, query by metadata
GET /packages/{package_id} # metadata
POST /packages/{package_id}/files # single-shot file upload
POST /packages/{package_id}/finalize # produce manifest
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
GET /files/{file_id} # metadata
GET /files/{file_id}/download # bytes
POST /uploads # open an upload session (resource shape pinned now)
PATCH /uploads/{upload_id} # range body
POST /uploads/{upload_id}/complete # promote to /packages/.../files
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
GET /events?since={sequence} # long-poll registry change feed
The POST /uploads/... resource shape is committed now even if v1
implements it as single-shot internally; ADR per PLATFORM-AMBITION A6.
Deferred / not v1
/v2/…OCI Distribution endpoints (ADR-0006).- gRPC API.
- Streaming CDC topic (NATS / Kafka).
- Multi-tenant namespacing in URLs.
Package manifest content (v1)
A finalised manifest carries:
manifest_version: 1package: id, name, producer, subject, retention class, created_at, finalized_at, expires_at, metadata, metadata_schema_id (nullable).files: ordered list of{id, relative_path, media_type, size_bytes, digest_algorithm, digest_primary_hex, digest_sha256_hex}.storage_receipts: ordered list of{file_id, backend_id, content_address, retrieval_tier, status}per stored copy.retention_summary: current class, expires_at, holds, last retention event.provenance:{source_commits, tool_versions, environment, ingest_actor, ingest_timestamps}. Schema-driven; freeform under a registered schema or empty if none.
The manifest digest (blake3:<hex>) is the package's canonical
external identifier.
Storage backends
Local filesystem (v1)
- Root: configured directory.
- Object key layout:
<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>. - Atomic write via
fsync(tmpfile) + rename. No partial states visible. - Path traversal prevented at the SPI boundary; the local backend rejects any key that does not match the expected layout.
S3-compatible / Ceph RGW (WP-0004)
- Endpoint, bucket, region, access key ref, secret key ref, key prefix, storage class label, optional SSE config.
- Object key:
<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>. - Multipart upload for objects above a configurable threshold.
Security boundary (v1)
- Internal service. No anonymous public access.
- Authenticated producer / operator API. v1 ships shared-secret bearer tokens; OIDC integration is its own workplan.
- No secret values in artifact metadata.
- Upload paths are logical; never trusted filesystem paths. The
/uploads/...path-ingest endpoint is not offered in v1. - Download authorisation is checked at the registry layer, never at the backend.
Resolved open questions
- Deduplication scope. Global by content address (ADR-0001). Reference-counted deletion via a GC pass (WP-0006, TBD).
- Deletion ordering. Mark records
deletion_eligiblefirst via an event. Byte deletion is a separate, audited operation that emits a second event. Reverse order is forbidden. - Metadata schemas. Open JSON with optional producer-registered
JSON Schema; validation at ingest (ADR-0005,
metadata_schemas). - Statehub integration scope. Statehub keeps package IDs and
summary; never bytes. The
/eventslong-poll is the integration point.
Outstanding open questions (not blocking v1)
- Identity provider for shared deployments.
- Default retention durations per class (operator-configurable; needs one round of stakeholder input).
- WASM plugin host design (deferred to its own workplan; see
PLATFORM-AMBITION). - Federation / mirroring protocol (post-OCI-endpoint workplan).
Roadmap pointer
The implementation sequence is in docs/ROADMAP.md. The first
workplan is workplans/ARTIFACT-STORE-WP-0001-foundation.md.