Files
artifact-store/docs/ARCHITECTURE-BLUEPRINT.md
tegwick 747afc27a6 docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 21:16:17 +02:00

15 KiB

Architecture Blueprint

Status: accepted (v2 — supersedes 2026-05-15 draft) Updated: 2026-05-15

This document operationalises INTENT.md, the docs/PLATFORM-AMBITION.md thesis, and the decisions recorded in docs/adr/. Where a tension exists between this blueprint and an ADR, the ADR wins; raise an issue or supersede the ADR.

Architecture in one paragraph

artifact-store is a library-first artifact registry and storage gateway. A small core library (artifactstore) implements identity, manifests, retention, the storage adapter SPI, the data plane SPI, and the registry orchestrator. The HTTP server and the CLI are thin consumers of that library. Bytes are addressed by content (blake3:<hex>) and stored through a pluggable adapter SPI. State is authoritative in an append-only event log; queryable tables are materialised views.

Design lineage

The shape is deliberately borrowed from ffmpeg and VLC: a tight core of well-named modules with stable contracts, runtime-pluggable backends, a thin orchestration binary, and an explicit hot-path boundary that can be rewritten in faster code without changing the consumer API. See docs/PLATFORM-AMBITION.md for the reference table.

Top-level shape

                     producers / operators / agents
                                  |
                                  v
                       +------------------------+
                       |  HTTP API   |   CLI    |   <-- thin consumers
                       +------------------------+
                                  |
                                  v
                       +------------------------+
                       |   registry orchestrator |
                       +------------------------+
                          |          |        |
                          v          v        v
                   +----------+ +---------+ +---------+
                   | identity | | events  | |retention|
                   |/manifest | | (log +  | | policy  |
                   |          | |  views) | | engine  |
                   +----------+ +---------+ +---------+
                          |
                          v
                +-----------------------+
                | data plane SPI        |   <-- ADR-0004 contract
                +-----------------------+
                          |
                          v
                +-----------------------+
                | storage adapter SPI   |
                +-----------------------+
                   |        |        |
                   v        v        v
                +-----+ +------+ +-------+
                |local| |  S3  | | Ceph  |  ... future backends
                | FS  | | RGW  | |  RGW  |
                +-----+ +------+ +-------+

Core modules

Mapped one-to-one to ADR-0005's project layout. Each module has a stable public surface; internals are free to evolve.

identity

  • Digest(algorithm, hex) — value object.
  • ContentAddress<algorithm>:<hex> (ADR-0001).
  • digest_stream(reader) -> {primary, sha256} — single-pass dual digest.
  • Algorithm registry: blake3 (default primary), sha256 (always computed).

manifest

  • Manifest — versioned dataclass: package metadata + ordered file list
    • retention summary + provenance + storage receipts.
  • manifest.codec.encode(manifest) -> bytes — canonical CBOR (ADR-0003).
  • manifest.codec.decode(bytes) -> Manifest.
  • manifest.projection.jcs(manifest) -> bytes — canonical-JSON projection for display and signing-tool interop.
  • Round-trip invariant: decode(encode(m)) == m and encode(decode(jcs_to_cbor(jcs(m)))) == encode(m).

events

  • events.write(transaction, event) — appends one row with monotonic sequence (ADR-0002).
  • events.tail(since_sequence) -> AsyncIterator[Event] — long-poll.
  • events.replay(into=ViewWriter) — rebuild materialised views.
  • Event types (v1): v1.package.created, v1.file.ingested, v1.package.finalized, v1.retention.default_applied, v1.retention.extended, v1.retention.hold_applied, v1.retention.hold_released, v1.retention.deletion_eligible, v1.storage.location_recorded, v1.storage.location_verified, v1.audit.access, v1.system.note.

retention

  • retention.classestransient, raw-evidence, summary-evidence, release-evidence, permanent-record. Defined as data, not code.
  • retention.policy.apply(package, class) -> RetentionDecision — computes expires_at and the deletion eligibility rule.
  • retention.extend(package, until, reason, actor) — emits an event; the materialised view updates on commit.
  • retention.hold(package, reason, actor) / retention.release_hold(hold_id, actor).

audit

  • A view over events filtered to access and lifecycle events. No separate write path; auditing happens by event emission elsewhere.

storage (adapter SPI)

class StorageBackend(Protocol):
    backend_id: str
    async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
    async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
    async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
    async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
    async def health(self) -> BackendStatus: ...
  • Backend registry: backends register at import time; selection is per-package by configuration.
  • v1 ships local (filesystem); s3 ships in WP-0004.

dataplane (SPI per ADR-0004)

class DataPlane(Protocol):
    async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
    async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
    async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
    async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
    async def backend_health(self) -> BackendStatus: ...
  • v1 implementation: dataplane.inproc — wraps a StorageBackend, computes digests during streaming.
  • Future implementation: dataplane.remote — gRPC or framed-bincode-over-Unix-socket client to a Rust daemon.

registry

The orchestrator. Combines identity + manifest + events + retention + dataplane into the operations the HTTP API and CLI consume: create_package, ingest_file, finalize_package, get_manifest, download_file, extend_retention, apply_hold, release_hold, mark_deletion_eligible, tail_events. Each operation is one DB transaction that writes one or more events and updates materialised views.

api.http and cli

Thin. Their job is to translate transport (HTTP / argv) into calls on registry. No business logic.

Data model

All tables exist as materialised views over events (ADR-0002), except events itself, retention_classes (seed data), and metadata_schemas (config).

events (source of truth)

Column Type Notes
sequence BIGSERIAL PRIMARY KEY monotonic, gapless
created_at TIMESTAMPTZ NOT NULL UTC, set by DB default
event_type TEXT NOT NULL versioned slug (v1.…)
subject_kind TEXT NOT NULL package / file / retention / storage / system
subject_id UUID NULL
actor TEXT NOT NULL producer or operator identity
payload BYTEA NOT NULL canonical CBOR
payload_digest BYTEA NOT NULL BLAKE3 of payload

Indexes: (subject_kind, subject_id), (event_type, sequence).

artifact_packages (materialised view)

Column Type Notes
id UUID PRIMARY KEY
name TEXT NOT NULL
producer TEXT NOT NULL
subject TEXT NOT NULL
retention_class TEXT NOT NULL FK to retention_classes
metadata_schema_id UUID NULL FK to metadata_schemas
metadata JSONB NOT NULL validated against schema if present
status TEXT NOT NULL created / uploading / finalized / deletion_eligible / deleted / failed
manifest_digest BYTEA NULL populated on finalize
created_at, finalized_at, expires_at TIMESTAMPTZ
last_event_sequence BIGINT NOT NULL for replay bookkeeping

artifact_files (materialised view)

Column Type Notes
id UUID PRIMARY KEY
package_id UUID NOT NULL FK
relative_path TEXT NOT NULL logical path; unique within package
media_type TEXT NOT NULL required (ADR-0006)
size_bytes BIGINT NOT NULL
digest_algorithm TEXT NOT NULL blake3 by default (ADR-0001)
digest_primary BYTEA NOT NULL bytes of the primary digest
digest_sha256 BYTEA NOT NULL always populated for interop
created_at TIMESTAMPTZ NOT NULL

storage_locations (materialised view)

Column Type Notes
id UUID PRIMARY KEY
artifact_file_id UUID NOT NULL FK
backend_id TEXT NOT NULL
content_address TEXT NOT NULL <algo>:<hex>
object_key TEXT NOT NULL backend-specific, usually derived from content_address
storage_class TEXT NULL backend-specific label
retrieval_tier TEXT NOT NULL DEFAULT 'hot' hot / warm / cold / archive
restore_status TEXT NULL available / restore_requested / restoring / restored / expired
status TEXT NOT NULL recorded / verified / failed / deleted
created_at, last_verified_at TIMESTAMPTZ

retention_state (materialised view)

Column Type Notes
package_id UUID PRIMARY KEY
current_expires_at TIMESTAMPTZ NULL NULL = no expiry (permanent or held)
effective_class TEXT NOT NULL
active_hold_id UUID NULL
eligible_for_deletion BOOLEAN NOT NULL

retention_classes (seed data, not derived)

Column Type Notes
class_id TEXT PRIMARY KEY transient / raw-evidence / summary-evidence / release-evidence / permanent-record
default_duration INTERVAL NULL NULL for permanent-record
deletion_strategy TEXT NOT NULL mark_eligible / auto_delete_after_grace (v1 only uses the former)

metadata_schemas (config table)

Column Type Notes
id UUID PRIMARY KEY
slug TEXT NOT NULL UNIQUE e.g. guide-board.run.v1
json_schema JSONB NOT NULL
created_at TIMESTAMPTZ NOT NULL

API shape

Native v1 surface

GET   /health
GET   /backends
GET   /retention-classes

POST  /packages                                      # create
GET   /packages                                      # list, query by metadata
GET   /packages/{package_id}                         # metadata
POST  /packages/{package_id}/files                   # single-shot file upload
POST  /packages/{package_id}/finalize                # produce manifest
GET   /packages/{package_id}/manifest                # canonical CBOR (Accept: application/cbor)
GET   /packages/{package_id}/manifest.json           # JCS projection (Accept: application/json)

GET   /files/{file_id}                               # metadata
GET   /files/{file_id}/download                      # bytes

POST  /uploads                                       # open an upload session (resource shape pinned now)
PATCH /uploads/{upload_id}                           # range body
POST  /uploads/{upload_id}/complete                  # promote to /packages/.../files

POST  /packages/{package_id}/retention/extensions
POST  /packages/{package_id}/retention/holds
POST  /packages/{package_id}/retention/holds/{hold_id}/release

GET   /events?since={sequence}                       # long-poll registry change feed

The POST /uploads/... resource shape is committed now even if v1 implements it as single-shot internally; ADR per PLATFORM-AMBITION A6.

Deferred / not v1

  • /v2/… OCI Distribution endpoints (ADR-0006).
  • gRPC API.
  • Streaming CDC topic (NATS / Kafka).
  • Multi-tenant namespacing in URLs.

Package manifest content (v1)

A finalised manifest carries:

  • manifest_version: 1
  • package: id, name, producer, subject, retention class, created_at, finalized_at, expires_at, metadata, metadata_schema_id (nullable).
  • files: ordered list of {id, relative_path, media_type, size_bytes, digest_algorithm, digest_primary_hex, digest_sha256_hex}.
  • storage_receipts: ordered list of {file_id, backend_id, content_address, retrieval_tier, status} per stored copy.
  • retention_summary: current class, expires_at, holds, last retention event.
  • provenance: {source_commits, tool_versions, environment, ingest_actor, ingest_timestamps}. Schema-driven; freeform under a registered schema or empty if none.

The manifest digest (blake3:<hex>) is the package's canonical external identifier.

Storage backends

Local filesystem (v1)

  • Root: configured directory.
  • Object key layout: <root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>.
  • Atomic write via fsync(tmpfile) + rename. No partial states visible.
  • Path traversal prevented at the SPI boundary; the local backend rejects any key that does not match the expected layout.

S3-compatible / Ceph RGW (WP-0004)

  • Endpoint, bucket, region, access key ref, secret key ref, key prefix, storage class label, optional SSE config.
  • Object key: <prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>.
  • Multipart upload for objects above a configurable threshold.

Security boundary (v1)

  • Internal service. No anonymous public access.
  • Authenticated producer / operator API. v1 ships shared-secret bearer tokens; OIDC integration is its own workplan.
  • No secret values in artifact metadata.
  • Upload paths are logical; never trusted filesystem paths. The /uploads/... path-ingest endpoint is not offered in v1.
  • Download authorisation is checked at the registry layer, never at the backend.

Resolved open questions

  • Deduplication scope. Global by content address (ADR-0001). Reference-counted deletion via a GC pass (WP-0006, TBD).
  • Deletion ordering. Mark records deletion_eligible first via an event. Byte deletion is a separate, audited operation that emits a second event. Reverse order is forbidden.
  • Metadata schemas. Open JSON with optional producer-registered JSON Schema; validation at ingest (ADR-0005, metadata_schemas).
  • Statehub integration scope. Statehub keeps package IDs and summary; never bytes. The /events long-poll is the integration point.

Outstanding open questions (not blocking v1)

  • Identity provider for shared deployments.
  • Default retention durations per class (operator-configurable; needs one round of stakeholder input).
  • WASM plugin host design (deferred to its own workplan; see PLATFORM-AMBITION).
  • Federation / mirroring protocol (post-OCI-endpoint workplan).

Roadmap pointer

The implementation sequence is in docs/ROADMAP.md. The first workplan is workplans/ARTIFACT-STORE-WP-0001-foundation.md.