docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans

Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-15 21:16:17 +02:00
parent 403d903585
commit 747afc27a6
16 changed files with 1761 additions and 404 deletions

View File

@@ -1,330 +1,378 @@
# Artifact Store Architecture Blueprint
# Architecture Blueprint
Status: draft
Created: 2026-05-15
Status: accepted (v2 — supersedes 2026-05-15 draft)
Updated: 2026-05-15
## Purpose
This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md`
thesis, and the decisions recorded in `docs/adr/`. Where a tension exists
between this blueprint and an ADR, the ADR wins; raise an issue or
supersede the ADR.
`artifact-store` provides a generic registry and storage gateway for durable
generated artifacts. Producers register packages and files with metadata;
storage adapters persist the bytes; retention policy decides how long artifacts
remain eligible for retrieval.
## Architecture in one paragraph
The design keeps artifact identity and lifecycle separate from storage
implementation. This allows the first version to run against local filesystem
storage while the production path can use S3-compatible object storage such as
Ceph RGW.
`artifact-store` is a **library-first** artifact registry and storage
gateway. A small core library (`artifactstore`) implements identity,
manifests, retention, the storage adapter SPI, the data plane SPI, and
the registry orchestrator. The HTTP server and the CLI are thin
consumers of that library. Bytes are addressed by content
(`blake3:<hex>`) and stored through a pluggable adapter SPI. State is
authoritative in an append-only event log; queryable tables are
materialised views.
## Architecture Summary
## Design lineage
The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight
core of well-named modules with stable contracts, runtime-pluggable
backends, a thin orchestration binary, and an explicit hot-path
boundary that can be rewritten in faster code without changing the
consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table.
## Top-level shape
```text
producer
-> Artifact Registry API
-> metadata database
-> retention policy engine
-> audit event log
-> storage adapter interface
-> local filesystem backend
-> S3-compatible backend
-> Ceph RGW deployment
-> future cloud/blob/archive backends
producers / operators / agents
|
v
+------------------------+
| HTTP API | CLI | <-- thin consumers
+------------------------+
|
v
+------------------------+
| registry orchestrator |
+------------------------+
| | |
v v v
+----------+ +---------+ +---------+
| identity | | events | |retention|
|/manifest | | (log + | | policy |
| | | views) | | engine |
+----------+ +---------+ +---------+
|
v
+-----------------------+
| data plane SPI | <-- ADR-0004 contract
+-----------------------+
|
v
+-----------------------+
| storage adapter SPI |
+-----------------------+
| | |
v v v
+-----+ +------+ +-------+
|local| | S3 | | Ceph | ... future backends
| FS | | RGW | | RGW |
+-----+ +------+ +-------+
```
The registry is the authority for artifact metadata and lifecycle. Backends are
responsible for byte storage and retrieval.
## Core modules
## Design Principles
Mapped one-to-one to ADR-0005's project layout. Each module has a
stable public surface; internals are free to evolve.
- Backend-neutral registry: no producer should know whether bytes live in Ceph,
local disk, or a cloud bucket.
- Content-addressable confidence: every stored file has a digest and size.
- Retention by default: every package receives an expiry decision at ingestion.
- Extensions are explicit: retention extensions and holds are audit events, not
silent metadata edits.
- Packages remain portable: a manifest should be enough to understand a package
without calling the producer.
- Statehub links, it does not store bytes: Statehub records artifact IDs and
outcomes; artifact-store owns file persistence.
- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion
jobs must be auditable and reversible only when the backend still has data.
### `identity`
## Components
- `Digest(algorithm, hex)` — value object.
- `ContentAddress``<algorithm>:<hex>` (ADR-0001).
- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest.
- Algorithm registry: `blake3` (default primary), `sha256` (always
computed).
### Registry API
### `manifest`
HTTP API for producers and operators.
- `Manifest` — versioned dataclass: package metadata + ordered file list
+ retention summary + provenance + storage receipts.
- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR
(ADR-0003).
- `manifest.codec.decode(bytes) -> Manifest`.
- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON
projection for display and signing-tool interop.
- Round-trip invariant: `decode(encode(m)) == m` and
`encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
Initial responsibilities:
### `events`
- create artifact packages,
- upload or ingest files,
- finalize packages,
- retrieve package metadata,
- list/search packages by subject and producer metadata,
- create retention extensions and holds,
- expose download metadata or redirect/download endpoints,
- expose health and backend status.
- `events.write(transaction, event)` — appends one row with monotonic
sequence (ADR-0002).
- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll.
- `events.replay(into=ViewWriter)` — rebuild materialised views.
- Event types (v1):
`v1.package.created`, `v1.file.ingested`, `v1.package.finalized`,
`v1.retention.default_applied`, `v1.retention.extended`,
`v1.retention.hold_applied`, `v1.retention.hold_released`,
`v1.retention.deletion_eligible`, `v1.storage.location_recorded`,
`v1.storage.location_verified`, `v1.audit.access`,
`v1.system.note`.
### Metadata Store
### `retention`
Persistent database for registry state.
- `retention.classes``transient`, `raw-evidence`, `summary-evidence`,
`release-evidence`, `permanent-record`. Defined as data, not code.
- `retention.policy.apply(package, class) -> RetentionDecision`
computes `expires_at` and the deletion eligibility rule.
- `retention.extend(package, until, reason, actor)` — emits an event;
the materialised view updates on commit.
- `retention.hold(package, reason, actor)` /
`retention.release_hold(hold_id, actor)`.
Initial implementation can use SQLite for local development and PostgreSQL for
shared service deployments if that matches the surrounding service stack.
### `audit`
Core tables:
- A view over `events` filtered to access and lifecycle events. No
separate write path; auditing happens by event emission elsewhere.
- `artifact_packages`
- `artifact_files`
- `storage_locations`
- `retention_rules`
- `retention_events`
- `audit_events`
### `storage` (adapter SPI)
### Storage Adapter Interface
```python
class StorageBackend(Protocol):
backend_id: str
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
async def health(self) -> BackendStatus: ...
```
Small backend contract used by the API service.
- Backend registry: backends register at import time; selection is
per-package by configuration.
- v1 ships `local` (filesystem); `s3` ships in WP-0004.
Required operations:
### `dataplane` (SPI per ADR-0004)
- `put(object_key, stream, metadata) -> storage_location`
- `get(object_key) -> stream or signed_url`
- `head(object_key) -> object_metadata`
- `delete(object_key) -> deletion_result`
- `health() -> backend_status`
```python
class DataPlane(Protocol):
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
async def backend_health(self) -> BackendStatus: ...
```
Initial backends:
- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`,
computes digests during streaming.
- Future implementation: `dataplane.remote` — gRPC or
framed-bincode-over-Unix-socket client to a Rust daemon.
- local filesystem backend for tests and development,
- S3-compatible backend for Ceph RGW and cloud object stores.
### `registry`
### Retention Policy Engine
The orchestrator. Combines `identity + manifest + events + retention +
dataplane` into the operations the HTTP API and CLI consume:
`create_package`, `ingest_file`, `finalize_package`, `get_manifest`,
`download_file`, `extend_retention`, `apply_hold`, `release_hold`,
`mark_deletion_eligible`, `tail_events`. Each operation is one DB
transaction that writes one or more events and updates materialised
views.
Applies default rules at ingestion and records later changes.
### `api.http` and `cli`
Initial retention classes:
Thin. Their job is to translate transport (HTTP / argv) into calls on
`registry`. No business logic.
- `transient`: short-lived scratch artifacts,
- `raw-evidence`: raw logs and run output,
- `summary-evidence`: compact reports and summaries,
- `release-evidence`: release or customer-facing evidence packages,
- `permanent-record`: manually held records with no automatic expiry.
## Data model
Each package stores:
All tables exist as **materialised views over `events`** (ADR-0002),
except `events` itself, `retention_classes` (seed data), and
`metadata_schemas` (config).
- selected retention class,
- default retention rule,
- computed `expires_at`,
- extension records,
- hold records,
- deletion eligibility state.
### `events` (source of truth)
### Audit Log
| Column | Type | Notes |
|---|---|---|
| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless |
| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default |
| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) |
| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` |
| `subject_id` | `UUID NULL` | |
| `actor` | `TEXT NOT NULL` | producer or operator identity |
| `payload` | `BYTEA NOT NULL` | canonical CBOR |
| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` |
Append-only record of important events:
Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
- package created,
- file uploaded,
- package finalized,
- retrieval requested,
- retention extended,
- hold applied or released,
- deletion requested,
- deletion completed or failed.
### `artifact_packages` (materialised view)
The audit log does not need to be cryptographic in the first release, but the
schema should leave room for signed events or external write-once storage later.
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `name` | `TEXT NOT NULL` | |
| `producer` | `TEXT NOT NULL` | |
| `subject` | `TEXT NOT NULL` | |
| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` |
| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` |
| `metadata` | `JSONB NOT NULL` | validated against schema if present |
| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` |
| `manifest_digest` | `BYTEA NULL` | populated on finalize |
| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | |
| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping |
## Data Model
### `artifact_files` (materialised view)
### Artifact Package
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `package_id` | `UUID NOT NULL` | FK |
| `relative_path` | `TEXT NOT NULL` | logical path; unique within package |
| `media_type` | `TEXT NOT NULL` | required (ADR-0006) |
| `size_bytes` | `BIGINT NOT NULL` | |
| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) |
| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest |
| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop |
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
Required fields:
### `storage_locations` (materialised view)
- `id`
- `name`
- `producer`
- `subject`
- `retention_class`
- `status`
- `created_at`
- `finalized_at`
- `expires_at`
- `metadata`
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `artifact_file_id` | `UUID NOT NULL` | FK |
| `backend_id` | `TEXT NOT NULL` | |
| `content_address` | `TEXT NOT NULL` | `<algo>:<hex>` |
| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` |
| `storage_class` | `TEXT NULL` | backend-specific label |
| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` |
| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` |
| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` |
| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | |
Recommended metadata keys:
### `retention_state` (materialised view)
- `repo_slug`
- `run_id`
- `assessment_id`
- `target_profile_ref`
- `assessment_profile_ref`
- `source_commits`
- `tool_versions`
- `environment`
| Column | Type | Notes |
|---|---|---|
| `package_id` | `UUID PRIMARY KEY` | |
| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) |
| `effective_class` | `TEXT NOT NULL` | |
| `active_hold_id` | `UUID NULL` | |
| `eligible_for_deletion` | `BOOLEAN NOT NULL` | |
### Artifact File
### `retention_classes` (seed data, not derived)
Required fields:
| Column | Type | Notes |
|---|---|---|
| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` |
| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` |
| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) |
- `id`
- `package_id`
- `relative_path`
- `media_type`
- `size_bytes`
- `sha256`
- `created_at`
### `metadata_schemas` (config table)
### Storage Location
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` |
| `json_schema` | `JSONB NOT NULL` | |
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
Required fields:
## API shape
- `id`
- `artifact_file_id`
- `backend_id`
- `object_key`
- `storage_class`
- `status`
- `created_at`
- `last_verified_at`
### Retention Event
Required fields:
- `id`
- `package_id`
- `event_type`
- `reason`
- `created_by`
- `created_at`
- `previous_expires_at`
- `new_expires_at`
Event types:
- `default_rule_applied`
- `extended`
- `hold_applied`
- `hold_released`
- `deletion_eligible`
- `deleted`
## API Shape
Initial endpoints:
### Native v1 surface
```text
GET /health
GET /backends
POST /packages
GET /packages
GET /packages/{package_id}
POST /packages/{package_id}/files
POST /packages/{package_id}/finalize
GET /packages/{package_id}/manifest
GET /files/{file_id}/download
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
GET /health
GET /backends
GET /retention-classes
POST /packages # create
GET /packages # list, query by metadata
GET /packages/{package_id} # metadata
POST /packages/{package_id}/files # single-shot file upload
POST /packages/{package_id}/finalize # produce manifest
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
GET /files/{file_id} # metadata
GET /files/{file_id}/download # bytes
POST /uploads # open an upload session (resource shape pinned now)
PATCH /uploads/{upload_id} # range body
POST /uploads/{upload_id}/complete # promote to /packages/.../files
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
GET /events?since={sequence} # long-poll registry change feed
```
The first ingestion path can accept multipart file uploads. A later trusted-local
operator endpoint may ingest from server-local paths, but it should be disabled
by default because path ingestion changes the security boundary.
The `POST /uploads/...` resource shape is committed now even if v1
implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6.
## Package Manifest
### Deferred / not v1
Every finalized package should expose a JSON manifest containing:
- `/v2/…` OCI Distribution endpoints (ADR-0006).
- gRPC API.
- Streaming CDC topic (NATS / Kafka).
- Multi-tenant namespacing in URLs.
- package metadata,
- retention summary,
- file list,
- file digests and sizes,
- storage backend references,
- source metadata,
- created/finalized timestamps.
## Package manifest content (v1)
For guide-board runs, the manifest should preserve links to:
A finalised manifest carries:
- `run.json`
- `retention-summary.json`
- `reports/assessment-package.json`
- `reports/report.md`
- extension-generated scorecards or log reviews,
- raw artifact files captured by the assessment package manifest.
- `manifest_version: 1`
- `package`: id, name, producer, subject, retention class, created_at,
finalized_at, expires_at, metadata, metadata_schema_id (nullable).
- `files`: ordered list of `{id, relative_path, media_type, size_bytes,
digest_algorithm, digest_primary_hex, digest_sha256_hex}`.
- `storage_receipts`: ordered list of `{file_id, backend_id,
content_address, retrieval_tier, status}` per stored copy.
- `retention_summary`: current class, expires_at, holds, last
retention event.
- `provenance`: `{source_commits, tool_versions, environment,
ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a
registered schema or empty if none.
## Guide-Board Pilot Flow
The manifest digest (`blake3:<hex>`) is the package's canonical
external identifier.
```text
guide-board run directory
-> open-cmis-tck scorecard/log review
-> artifact-store package create
-> upload run files
-> finalize manifest
-> Statehub record links package id and summary
```
## Storage backends
The artifact package should carry:
### Local filesystem (v1)
- run id,
- target profile reference,
- assessment profile reference,
- result status,
- source commits for guide-board, open-cmis-tck, and the assessed repository,
- important report paths,
- retention class `raw-evidence` or `release-evidence`.
- Root: configured directory.
- Object key layout: `<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Atomic write via `fsync(tmpfile) + rename`. No partial states visible.
- Path traversal prevented at the SPI boundary; the local backend
rejects any key that does not match the expected layout.
## Ceph And S3-Compatible Storage
### S3-compatible / Ceph RGW (WP-0004)
Ceph should be introduced through the S3-compatible adapter, not as a special
case in producer logic.
- Endpoint, bucket, region, access key ref, secret key ref, key
prefix, storage class label, optional SSE config.
- Object key: `<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Multipart upload for objects above a configurable threshold.
Configuration should support:
## Security boundary (v1)
- endpoint URL,
- bucket,
- region,
- access key reference,
- secret key reference,
- optional server-side encryption settings,
- object key prefix,
- storage class label.
- Internal service. No anonymous public access.
- Authenticated producer / operator API. v1 ships shared-secret bearer
tokens; OIDC integration is its own workplan.
- No secret values in artifact metadata.
- Upload paths are logical; never trusted filesystem paths. The
`/uploads/...` path-ingest endpoint is *not* offered in v1.
- Download authorisation is checked at the registry layer, never at
the backend.
The service should never require credentials in producer request bodies. Use
environment variables, mounted secret files, or a local secret provider.
## Resolved open questions
## Future Retrieval Tiers
- **Deduplication scope.** Global by content address (ADR-0001).
Reference-counted deletion via a GC pass (WP-0006, TBD).
- **Deletion ordering.** Mark records `deletion_eligible` first via an
event. Byte deletion is a separate, audited operation that emits a
second event. Reverse order is forbidden.
- **Metadata schemas.** Open JSON with optional producer-registered
JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`).
- **Statehub integration scope.** Statehub keeps package IDs and
summary; never bytes. The `/events` long-poll is the integration
point.
The initial API can treat all stored files as immediately retrievable. Later,
storage locations can include:
## Outstanding open questions (not blocking v1)
- `retrieval_tier`: hot, warm, cold, archive,
- `restore_status`: available, restore_requested, restoring, restored, expired,
- `restore_requested_at`,
- `restore_expires_at`.
- Identity provider for shared deployments.
- Default retention durations per class (operator-configurable; needs
one round of stakeholder input).
- WASM plugin host design (deferred to its own workplan; see
`PLATFORM-AMBITION`).
- Federation / mirroring protocol (post-OCI-endpoint workplan).
The registry API should be able to return "not immediately available" without
changing artifact identity.
## Roadmap pointer
## Security Boundary
Initial service assumptions:
- internal service, not public internet exposed,
- authenticated producer/operator API before shared deployment,
- no secret values stored in artifact metadata,
- package paths are logical paths, not trusted filesystem paths,
- download authorization should be checked at the registry layer.
Files may contain sensitive evidence. The service must treat metadata and bytes
as confidential by default.
## Open Questions
- Which identity provider should guard shared deployments?
- Should package metadata schemas be open-ended JSON or typed by producer?
- Should deduplication be package-local only or global by content hash?
- Should deletion first mark records deleted, then delete bytes, or reverse that
order with compensating events?
- How much Statehub integration belongs in this repo versus in Statehub clients?
The implementation sequence is in `docs/ROADMAP.md`. The first
workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`.