generated from coulomb/repo-seed
Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
379 lines
15 KiB
Markdown
379 lines
15 KiB
Markdown
# Architecture Blueprint
|
|
|
|
Status: accepted (v2 — supersedes 2026-05-15 draft)
|
|
Updated: 2026-05-15
|
|
|
|
This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md`
|
|
thesis, and the decisions recorded in `docs/adr/`. Where a tension exists
|
|
between this blueprint and an ADR, the ADR wins; raise an issue or
|
|
supersede the ADR.
|
|
|
|
## Architecture in one paragraph
|
|
|
|
`artifact-store` is a **library-first** artifact registry and storage
|
|
gateway. A small core library (`artifactstore`) implements identity,
|
|
manifests, retention, the storage adapter SPI, the data plane SPI, and
|
|
the registry orchestrator. The HTTP server and the CLI are thin
|
|
consumers of that library. Bytes are addressed by content
|
|
(`blake3:<hex>`) and stored through a pluggable adapter SPI. State is
|
|
authoritative in an append-only event log; queryable tables are
|
|
materialised views.
|
|
|
|
## Design lineage
|
|
|
|
The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight
|
|
core of well-named modules with stable contracts, runtime-pluggable
|
|
backends, a thin orchestration binary, and an explicit hot-path
|
|
boundary that can be rewritten in faster code without changing the
|
|
consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table.
|
|
|
|
## Top-level shape
|
|
|
|
```text
|
|
producers / operators / agents
|
|
|
|
|
v
|
|
+------------------------+
|
|
| HTTP API | CLI | <-- thin consumers
|
|
+------------------------+
|
|
|
|
|
v
|
|
+------------------------+
|
|
| registry orchestrator |
|
|
+------------------------+
|
|
| | |
|
|
v v v
|
|
+----------+ +---------+ +---------+
|
|
| identity | | events | |retention|
|
|
|/manifest | | (log + | | policy |
|
|
| | | views) | | engine |
|
|
+----------+ +---------+ +---------+
|
|
|
|
|
v
|
|
+-----------------------+
|
|
| data plane SPI | <-- ADR-0004 contract
|
|
+-----------------------+
|
|
|
|
|
v
|
|
+-----------------------+
|
|
| storage adapter SPI |
|
|
+-----------------------+
|
|
| | |
|
|
v v v
|
|
+-----+ +------+ +-------+
|
|
|local| | S3 | | Ceph | ... future backends
|
|
| FS | | RGW | | RGW |
|
|
+-----+ +------+ +-------+
|
|
```
|
|
|
|
## Core modules
|
|
|
|
Mapped one-to-one to ADR-0005's project layout. Each module has a
|
|
stable public surface; internals are free to evolve.
|
|
|
|
### `identity`
|
|
|
|
- `Digest(algorithm, hex)` — value object.
|
|
- `ContentAddress` — `<algorithm>:<hex>` (ADR-0001).
|
|
- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest.
|
|
- Algorithm registry: `blake3` (default primary), `sha256` (always
|
|
computed).
|
|
|
|
### `manifest`
|
|
|
|
- `Manifest` — versioned dataclass: package metadata + ordered file list
|
|
+ retention summary + provenance + storage receipts.
|
|
- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR
|
|
(ADR-0003).
|
|
- `manifest.codec.decode(bytes) -> Manifest`.
|
|
- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON
|
|
projection for display and signing-tool interop.
|
|
- Round-trip invariant: `decode(encode(m)) == m` and
|
|
`encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
|
|
|
|
### `events`
|
|
|
|
- `events.write(transaction, event)` — appends one row with monotonic
|
|
sequence (ADR-0002).
|
|
- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll.
|
|
- `events.replay(into=ViewWriter)` — rebuild materialised views.
|
|
- Event types (v1):
|
|
`v1.package.created`, `v1.file.ingested`, `v1.package.finalized`,
|
|
`v1.retention.default_applied`, `v1.retention.extended`,
|
|
`v1.retention.hold_applied`, `v1.retention.hold_released`,
|
|
`v1.retention.deletion_eligible`, `v1.storage.location_recorded`,
|
|
`v1.storage.location_verified`, `v1.audit.access`,
|
|
`v1.system.note`.
|
|
|
|
### `retention`
|
|
|
|
- `retention.classes` — `transient`, `raw-evidence`, `summary-evidence`,
|
|
`release-evidence`, `permanent-record`. Defined as data, not code.
|
|
- `retention.policy.apply(package, class) -> RetentionDecision` —
|
|
computes `expires_at` and the deletion eligibility rule.
|
|
- `retention.extend(package, until, reason, actor)` — emits an event;
|
|
the materialised view updates on commit.
|
|
- `retention.hold(package, reason, actor)` /
|
|
`retention.release_hold(hold_id, actor)`.
|
|
|
|
### `audit`
|
|
|
|
- A view over `events` filtered to access and lifecycle events. No
|
|
separate write path; auditing happens by event emission elsewhere.
|
|
|
|
### `storage` (adapter SPI)
|
|
|
|
```python
|
|
class StorageBackend(Protocol):
|
|
backend_id: str
|
|
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
|
|
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
|
|
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
|
|
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
|
|
async def health(self) -> BackendStatus: ...
|
|
```
|
|
|
|
- Backend registry: backends register at import time; selection is
|
|
per-package by configuration.
|
|
- v1 ships `local` (filesystem); `s3` ships in WP-0004.
|
|
|
|
### `dataplane` (SPI per ADR-0004)
|
|
|
|
```python
|
|
class DataPlane(Protocol):
|
|
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
|
|
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
|
|
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
|
|
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
|
|
async def backend_health(self) -> BackendStatus: ...
|
|
```
|
|
|
|
- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`,
|
|
computes digests during streaming.
|
|
- Future implementation: `dataplane.remote` — gRPC or
|
|
framed-bincode-over-Unix-socket client to a Rust daemon.
|
|
|
|
### `registry`
|
|
|
|
The orchestrator. Combines `identity + manifest + events + retention +
|
|
dataplane` into the operations the HTTP API and CLI consume:
|
|
`create_package`, `ingest_file`, `finalize_package`, `get_manifest`,
|
|
`download_file`, `extend_retention`, `apply_hold`, `release_hold`,
|
|
`mark_deletion_eligible`, `tail_events`. Each operation is one DB
|
|
transaction that writes one or more events and updates materialised
|
|
views.
|
|
|
|
### `api.http` and `cli`
|
|
|
|
Thin. Their job is to translate transport (HTTP / argv) into calls on
|
|
`registry`. No business logic.
|
|
|
|
## Data model
|
|
|
|
All tables exist as **materialised views over `events`** (ADR-0002),
|
|
except `events` itself, `retention_classes` (seed data), and
|
|
`metadata_schemas` (config).
|
|
|
|
### `events` (source of truth)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless |
|
|
| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default |
|
|
| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) |
|
|
| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` |
|
|
| `subject_id` | `UUID NULL` | |
|
|
| `actor` | `TEXT NOT NULL` | producer or operator identity |
|
|
| `payload` | `BYTEA NOT NULL` | canonical CBOR |
|
|
| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` |
|
|
|
|
Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
|
|
|
|
### `artifact_packages` (materialised view)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `id` | `UUID PRIMARY KEY` | |
|
|
| `name` | `TEXT NOT NULL` | |
|
|
| `producer` | `TEXT NOT NULL` | |
|
|
| `subject` | `TEXT NOT NULL` | |
|
|
| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` |
|
|
| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` |
|
|
| `metadata` | `JSONB NOT NULL` | validated against schema if present |
|
|
| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` |
|
|
| `manifest_digest` | `BYTEA NULL` | populated on finalize |
|
|
| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | |
|
|
| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping |
|
|
|
|
### `artifact_files` (materialised view)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `id` | `UUID PRIMARY KEY` | |
|
|
| `package_id` | `UUID NOT NULL` | FK |
|
|
| `relative_path` | `TEXT NOT NULL` | logical path; unique within package |
|
|
| `media_type` | `TEXT NOT NULL` | required (ADR-0006) |
|
|
| `size_bytes` | `BIGINT NOT NULL` | |
|
|
| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) |
|
|
| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest |
|
|
| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop |
|
|
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
|
|
|
|
### `storage_locations` (materialised view)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `id` | `UUID PRIMARY KEY` | |
|
|
| `artifact_file_id` | `UUID NOT NULL` | FK |
|
|
| `backend_id` | `TEXT NOT NULL` | |
|
|
| `content_address` | `TEXT NOT NULL` | `<algo>:<hex>` |
|
|
| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` |
|
|
| `storage_class` | `TEXT NULL` | backend-specific label |
|
|
| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` |
|
|
| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` |
|
|
| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` |
|
|
| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | |
|
|
|
|
### `retention_state` (materialised view)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `package_id` | `UUID PRIMARY KEY` | |
|
|
| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) |
|
|
| `effective_class` | `TEXT NOT NULL` | |
|
|
| `active_hold_id` | `UUID NULL` | |
|
|
| `eligible_for_deletion` | `BOOLEAN NOT NULL` | |
|
|
|
|
### `retention_classes` (seed data, not derived)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` |
|
|
| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` |
|
|
| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) |
|
|
|
|
### `metadata_schemas` (config table)
|
|
|
|
| Column | Type | Notes |
|
|
|---|---|---|
|
|
| `id` | `UUID PRIMARY KEY` | |
|
|
| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` |
|
|
| `json_schema` | `JSONB NOT NULL` | |
|
|
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
|
|
|
|
## API shape
|
|
|
|
### Native v1 surface
|
|
|
|
```text
|
|
GET /health
|
|
GET /backends
|
|
GET /retention-classes
|
|
|
|
POST /packages # create
|
|
GET /packages # list, query by metadata
|
|
GET /packages/{package_id} # metadata
|
|
POST /packages/{package_id}/files # single-shot file upload
|
|
POST /packages/{package_id}/finalize # produce manifest
|
|
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
|
|
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
|
|
|
|
GET /files/{file_id} # metadata
|
|
GET /files/{file_id}/download # bytes
|
|
|
|
POST /uploads # open an upload session (resource shape pinned now)
|
|
PATCH /uploads/{upload_id} # range body
|
|
POST /uploads/{upload_id}/complete # promote to /packages/.../files
|
|
|
|
POST /packages/{package_id}/retention/extensions
|
|
POST /packages/{package_id}/retention/holds
|
|
POST /packages/{package_id}/retention/holds/{hold_id}/release
|
|
|
|
GET /events?since={sequence} # long-poll registry change feed
|
|
```
|
|
|
|
The `POST /uploads/...` resource shape is committed now even if v1
|
|
implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6.
|
|
|
|
### Deferred / not v1
|
|
|
|
- `/v2/…` OCI Distribution endpoints (ADR-0006).
|
|
- gRPC API.
|
|
- Streaming CDC topic (NATS / Kafka).
|
|
- Multi-tenant namespacing in URLs.
|
|
|
|
## Package manifest content (v1)
|
|
|
|
A finalised manifest carries:
|
|
|
|
- `manifest_version: 1`
|
|
- `package`: id, name, producer, subject, retention class, created_at,
|
|
finalized_at, expires_at, metadata, metadata_schema_id (nullable).
|
|
- `files`: ordered list of `{id, relative_path, media_type, size_bytes,
|
|
digest_algorithm, digest_primary_hex, digest_sha256_hex}`.
|
|
- `storage_receipts`: ordered list of `{file_id, backend_id,
|
|
content_address, retrieval_tier, status}` per stored copy.
|
|
- `retention_summary`: current class, expires_at, holds, last
|
|
retention event.
|
|
- `provenance`: `{source_commits, tool_versions, environment,
|
|
ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a
|
|
registered schema or empty if none.
|
|
|
|
The manifest digest (`blake3:<hex>`) is the package's canonical
|
|
external identifier.
|
|
|
|
## Storage backends
|
|
|
|
### Local filesystem (v1)
|
|
|
|
- Root: configured directory.
|
|
- Object key layout: `<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
|
- Atomic write via `fsync(tmpfile) + rename`. No partial states visible.
|
|
- Path traversal prevented at the SPI boundary; the local backend
|
|
rejects any key that does not match the expected layout.
|
|
|
|
### S3-compatible / Ceph RGW (WP-0004)
|
|
|
|
- Endpoint, bucket, region, access key ref, secret key ref, key
|
|
prefix, storage class label, optional SSE config.
|
|
- Object key: `<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
|
- Multipart upload for objects above a configurable threshold.
|
|
|
|
## Security boundary (v1)
|
|
|
|
- Internal service. No anonymous public access.
|
|
- Authenticated producer / operator API. v1 ships shared-secret bearer
|
|
tokens; OIDC integration is its own workplan.
|
|
- No secret values in artifact metadata.
|
|
- Upload paths are logical; never trusted filesystem paths. The
|
|
`/uploads/...` path-ingest endpoint is *not* offered in v1.
|
|
- Download authorisation is checked at the registry layer, never at
|
|
the backend.
|
|
|
|
## Resolved open questions
|
|
|
|
- **Deduplication scope.** Global by content address (ADR-0001).
|
|
Reference-counted deletion via a GC pass (WP-0006, TBD).
|
|
- **Deletion ordering.** Mark records `deletion_eligible` first via an
|
|
event. Byte deletion is a separate, audited operation that emits a
|
|
second event. Reverse order is forbidden.
|
|
- **Metadata schemas.** Open JSON with optional producer-registered
|
|
JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`).
|
|
- **Statehub integration scope.** Statehub keeps package IDs and
|
|
summary; never bytes. The `/events` long-poll is the integration
|
|
point.
|
|
|
|
## Outstanding open questions (not blocking v1)
|
|
|
|
- Identity provider for shared deployments.
|
|
- Default retention durations per class (operator-configurable; needs
|
|
one round of stakeholder input).
|
|
- WASM plugin host design (deferred to its own workplan; see
|
|
`PLATFORM-AMBITION`).
|
|
- Federation / mirroring protocol (post-OCI-endpoint workplan).
|
|
|
|
## Roadmap pointer
|
|
|
|
The implementation sequence is in `docs/ROADMAP.md`. The first
|
|
workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`.
|