Files
artifact-store/docs/ARCHITECTURE-BLUEPRINT.md
tegwick 747afc27a6 docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 21:16:17 +02:00

379 lines
15 KiB
Markdown

# Architecture Blueprint
Status: accepted (v2 — supersedes 2026-05-15 draft)
Updated: 2026-05-15
This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md`
thesis, and the decisions recorded in `docs/adr/`. Where a tension exists
between this blueprint and an ADR, the ADR wins; raise an issue or
supersede the ADR.
## Architecture in one paragraph
`artifact-store` is a **library-first** artifact registry and storage
gateway. A small core library (`artifactstore`) implements identity,
manifests, retention, the storage adapter SPI, the data plane SPI, and
the registry orchestrator. The HTTP server and the CLI are thin
consumers of that library. Bytes are addressed by content
(`blake3:<hex>`) and stored through a pluggable adapter SPI. State is
authoritative in an append-only event log; queryable tables are
materialised views.
## Design lineage
The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight
core of well-named modules with stable contracts, runtime-pluggable
backends, a thin orchestration binary, and an explicit hot-path
boundary that can be rewritten in faster code without changing the
consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table.
## Top-level shape
```text
producers / operators / agents
|
v
+------------------------+
| HTTP API | CLI | <-- thin consumers
+------------------------+
|
v
+------------------------+
| registry orchestrator |
+------------------------+
| | |
v v v
+----------+ +---------+ +---------+
| identity | | events | |retention|
|/manifest | | (log + | | policy |
| | | views) | | engine |
+----------+ +---------+ +---------+
|
v
+-----------------------+
| data plane SPI | <-- ADR-0004 contract
+-----------------------+
|
v
+-----------------------+
| storage adapter SPI |
+-----------------------+
| | |
v v v
+-----+ +------+ +-------+
|local| | S3 | | Ceph | ... future backends
| FS | | RGW | | RGW |
+-----+ +------+ +-------+
```
## Core modules
Mapped one-to-one to ADR-0005's project layout. Each module has a
stable public surface; internals are free to evolve.
### `identity`
- `Digest(algorithm, hex)` — value object.
- `ContentAddress``<algorithm>:<hex>` (ADR-0001).
- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest.
- Algorithm registry: `blake3` (default primary), `sha256` (always
computed).
### `manifest`
- `Manifest` — versioned dataclass: package metadata + ordered file list
+ retention summary + provenance + storage receipts.
- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR
(ADR-0003).
- `manifest.codec.decode(bytes) -> Manifest`.
- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON
projection for display and signing-tool interop.
- Round-trip invariant: `decode(encode(m)) == m` and
`encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
### `events`
- `events.write(transaction, event)` — appends one row with monotonic
sequence (ADR-0002).
- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll.
- `events.replay(into=ViewWriter)` — rebuild materialised views.
- Event types (v1):
`v1.package.created`, `v1.file.ingested`, `v1.package.finalized`,
`v1.retention.default_applied`, `v1.retention.extended`,
`v1.retention.hold_applied`, `v1.retention.hold_released`,
`v1.retention.deletion_eligible`, `v1.storage.location_recorded`,
`v1.storage.location_verified`, `v1.audit.access`,
`v1.system.note`.
### `retention`
- `retention.classes``transient`, `raw-evidence`, `summary-evidence`,
`release-evidence`, `permanent-record`. Defined as data, not code.
- `retention.policy.apply(package, class) -> RetentionDecision`
computes `expires_at` and the deletion eligibility rule.
- `retention.extend(package, until, reason, actor)` — emits an event;
the materialised view updates on commit.
- `retention.hold(package, reason, actor)` /
`retention.release_hold(hold_id, actor)`.
### `audit`
- A view over `events` filtered to access and lifecycle events. No
separate write path; auditing happens by event emission elsewhere.
### `storage` (adapter SPI)
```python
class StorageBackend(Protocol):
backend_id: str
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
async def health(self) -> BackendStatus: ...
```
- Backend registry: backends register at import time; selection is
per-package by configuration.
- v1 ships `local` (filesystem); `s3` ships in WP-0004.
### `dataplane` (SPI per ADR-0004)
```python
class DataPlane(Protocol):
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
async def backend_health(self) -> BackendStatus: ...
```
- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`,
computes digests during streaming.
- Future implementation: `dataplane.remote` — gRPC or
framed-bincode-over-Unix-socket client to a Rust daemon.
### `registry`
The orchestrator. Combines `identity + manifest + events + retention +
dataplane` into the operations the HTTP API and CLI consume:
`create_package`, `ingest_file`, `finalize_package`, `get_manifest`,
`download_file`, `extend_retention`, `apply_hold`, `release_hold`,
`mark_deletion_eligible`, `tail_events`. Each operation is one DB
transaction that writes one or more events and updates materialised
views.
### `api.http` and `cli`
Thin. Their job is to translate transport (HTTP / argv) into calls on
`registry`. No business logic.
## Data model
All tables exist as **materialised views over `events`** (ADR-0002),
except `events` itself, `retention_classes` (seed data), and
`metadata_schemas` (config).
### `events` (source of truth)
| Column | Type | Notes |
|---|---|---|
| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless |
| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default |
| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) |
| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` |
| `subject_id` | `UUID NULL` | |
| `actor` | `TEXT NOT NULL` | producer or operator identity |
| `payload` | `BYTEA NOT NULL` | canonical CBOR |
| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` |
Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
### `artifact_packages` (materialised view)
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `name` | `TEXT NOT NULL` | |
| `producer` | `TEXT NOT NULL` | |
| `subject` | `TEXT NOT NULL` | |
| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` |
| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` |
| `metadata` | `JSONB NOT NULL` | validated against schema if present |
| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` |
| `manifest_digest` | `BYTEA NULL` | populated on finalize |
| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | |
| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping |
### `artifact_files` (materialised view)
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `package_id` | `UUID NOT NULL` | FK |
| `relative_path` | `TEXT NOT NULL` | logical path; unique within package |
| `media_type` | `TEXT NOT NULL` | required (ADR-0006) |
| `size_bytes` | `BIGINT NOT NULL` | |
| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) |
| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest |
| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop |
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
### `storage_locations` (materialised view)
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `artifact_file_id` | `UUID NOT NULL` | FK |
| `backend_id` | `TEXT NOT NULL` | |
| `content_address` | `TEXT NOT NULL` | `<algo>:<hex>` |
| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` |
| `storage_class` | `TEXT NULL` | backend-specific label |
| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` |
| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` |
| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` |
| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | |
### `retention_state` (materialised view)
| Column | Type | Notes |
|---|---|---|
| `package_id` | `UUID PRIMARY KEY` | |
| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) |
| `effective_class` | `TEXT NOT NULL` | |
| `active_hold_id` | `UUID NULL` | |
| `eligible_for_deletion` | `BOOLEAN NOT NULL` | |
### `retention_classes` (seed data, not derived)
| Column | Type | Notes |
|---|---|---|
| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` |
| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` |
| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) |
### `metadata_schemas` (config table)
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` |
| `json_schema` | `JSONB NOT NULL` | |
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
## API shape
### Native v1 surface
```text
GET /health
GET /backends
GET /retention-classes
POST /packages # create
GET /packages # list, query by metadata
GET /packages/{package_id} # metadata
POST /packages/{package_id}/files # single-shot file upload
POST /packages/{package_id}/finalize # produce manifest
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
GET /files/{file_id} # metadata
GET /files/{file_id}/download # bytes
POST /uploads # open an upload session (resource shape pinned now)
PATCH /uploads/{upload_id} # range body
POST /uploads/{upload_id}/complete # promote to /packages/.../files
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
GET /events?since={sequence} # long-poll registry change feed
```
The `POST /uploads/...` resource shape is committed now even if v1
implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6.
### Deferred / not v1
- `/v2/…` OCI Distribution endpoints (ADR-0006).
- gRPC API.
- Streaming CDC topic (NATS / Kafka).
- Multi-tenant namespacing in URLs.
## Package manifest content (v1)
A finalised manifest carries:
- `manifest_version: 1`
- `package`: id, name, producer, subject, retention class, created_at,
finalized_at, expires_at, metadata, metadata_schema_id (nullable).
- `files`: ordered list of `{id, relative_path, media_type, size_bytes,
digest_algorithm, digest_primary_hex, digest_sha256_hex}`.
- `storage_receipts`: ordered list of `{file_id, backend_id,
content_address, retrieval_tier, status}` per stored copy.
- `retention_summary`: current class, expires_at, holds, last
retention event.
- `provenance`: `{source_commits, tool_versions, environment,
ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a
registered schema or empty if none.
The manifest digest (`blake3:<hex>`) is the package's canonical
external identifier.
## Storage backends
### Local filesystem (v1)
- Root: configured directory.
- Object key layout: `<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Atomic write via `fsync(tmpfile) + rename`. No partial states visible.
- Path traversal prevented at the SPI boundary; the local backend
rejects any key that does not match the expected layout.
### S3-compatible / Ceph RGW (WP-0004)
- Endpoint, bucket, region, access key ref, secret key ref, key
prefix, storage class label, optional SSE config.
- Object key: `<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Multipart upload for objects above a configurable threshold.
## Security boundary (v1)
- Internal service. No anonymous public access.
- Authenticated producer / operator API. v1 ships shared-secret bearer
tokens; OIDC integration is its own workplan.
- No secret values in artifact metadata.
- Upload paths are logical; never trusted filesystem paths. The
`/uploads/...` path-ingest endpoint is *not* offered in v1.
- Download authorisation is checked at the registry layer, never at
the backend.
## Resolved open questions
- **Deduplication scope.** Global by content address (ADR-0001).
Reference-counted deletion via a GC pass (WP-0006, TBD).
- **Deletion ordering.** Mark records `deletion_eligible` first via an
event. Byte deletion is a separate, audited operation that emits a
second event. Reverse order is forbidden.
- **Metadata schemas.** Open JSON with optional producer-registered
JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`).
- **Statehub integration scope.** Statehub keeps package IDs and
summary; never bytes. The `/events` long-poll is the integration
point.
## Outstanding open questions (not blocking v1)
- Identity provider for shared deployments.
- Default retention durations per class (operator-configurable; needs
one round of stakeholder input).
- WASM plugin host design (deferred to its own workplan; see
`PLATFORM-AMBITION`).
- Federation / mirroring protocol (post-OCI-endpoint workplan).
## Roadmap pointer
The implementation sequence is in `docs/ROADMAP.md`. The first
workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`.