docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans

Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-15 21:16:17 +02:00
parent 403d903585
commit 747afc27a6
16 changed files with 1761 additions and 404 deletions

View File

@@ -162,13 +162,21 @@ To create a new workplan:
## Current Repo Shape
This repository is in service-baseline planning. The current source of truth is:
This repository is in service-baseline planning. The current sources of truth are:
- `INTENT.md` for purpose, product thesis, scope, and service boundary
- `docs/ARCHITECTURE-BLUEPRINT.md` for the draft architecture
- `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` for implementation tasks
- `INTENT.md` purpose, product thesis, scope, service boundary.
- `SCOPE.md` — lightweight orientation.
- `docs/ARCHITECTURE-BLUEPRINT.md` — architecture v2: modules, data model, API shape.
- `docs/PLATFORM-AMBITION.md` — longer-horizon thesis and the v1 schema commitments (A1A9).
- `docs/adr/` — architecture decision records ADR-0001 … ADR-0006 (content-addressed storage, event log as source of truth, canonical CBOR manifests, control/data plane contract, v1 tech stack, OCI reachability).
- `docs/ROADMAP.md` — workplan sequencing across phases.
- `docs/ASSEMBLY-EXPERIMENT.md` — opt-in research line on hand-tuned asm for hot kernels.
- `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` — Foundation workplan; first to start.
- `workplans/ARTIFACT-STORE-WP-{0002..0005}-*.md` — planned next workplans.
No runnable service scaffold exists yet. Add install, dev-server, and test
No runnable service scaffold exists yet. The pinned tech stack is in ADR-0005
(Python 3.12, uv, FastAPI, SQLAlchemy Core + asyncpg/aiosqlite, Alembic,
cbor2, blake3, ruff, mypy, pytest, typer). Add install, dev-server, and test
commands here when `ARTIFACT-STORE-WP-0001-T001` lands.
## Repo Boundary

View File

@@ -1,17 +1,51 @@
# artifact-store
Generic artifact registry and storage gateway for generated outputs, evidence
packages, reports, logs, and release artifacts.
Generic artifact registry and storage gateway for generated outputs,
evidence packages, reports, logs, snapshots, exports, and release
artifacts.
The registry owns artifact identity, metadata, provenance, retention policy, and
retrieval records. Actual bytes are delegated to configured storage backends such
as a local filesystem, S3-compatible object storage, or Ceph RGW.
The registry owns artifact identity, metadata, provenance, retention
policy, and retrieval records. Bytes are delegated to configured
storage backends (local filesystem in v1, S3-compatible / Ceph RGW
next).
Start here:
The shape is library-first (`artifactstore` Python package); the HTTP
server and the CLI are thin consumers. Content is addressed by digest;
state is authoritative in an append-only event log; materialised views
are rebuildable.
- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary
- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — draft architecture
- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon thesis and the schema commitments v1 preserves
- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md) — SWOT and optimisation review
- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in research line on hand-tuned assembly for hot kernels
- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) — first implementation workplan
## Status
Concept / service-baseline planning. No runnable scaffold yet —
`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` is the next step.
## Start here
- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary.
- [SCOPE.md](SCOPE.md) — lightweight orientation.
- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — the
v2 architecture: modules, data model, API shape.
- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon
thesis, ffmpeg / VLC reference points, the schema commitments v1
preserves.
- [docs/ROADMAP.md](docs/ROADMAP.md) — workplan sequencing across
phases.
- [docs/adr/](docs/adr/) — architecture decision records (ADR-0001 …
ADR-0006).
- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in
research line on hand-tuned assembly for hot kernels.
- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md)
— the SWOT review that triggered this cleanup.
## Active workplans
- [WP-0001 — Foundation: scaffold, core kernels, local FS backend](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md)
- [WP-0002 — Ingestion API and manifest surface](workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md) (planned)
- [WP-0003 — Retention lifecycle](workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md) (planned)
- [WP-0004 — S3-compatible backend](workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md) (planned)
- [WP-0005 — Guide-board pilot ingestion](workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md) (planned)
## Agent operating notes
See [AGENTS.md](AGENTS.md) for the StateHub-integrated session
protocol, workplan conventions, and progress-logging contract.

View File

@@ -1,330 +1,378 @@
# Artifact Store Architecture Blueprint
# Architecture Blueprint
Status: draft
Created: 2026-05-15
Status: accepted (v2 — supersedes 2026-05-15 draft)
Updated: 2026-05-15
## Purpose
This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md`
thesis, and the decisions recorded in `docs/adr/`. Where a tension exists
between this blueprint and an ADR, the ADR wins; raise an issue or
supersede the ADR.
`artifact-store` provides a generic registry and storage gateway for durable
generated artifacts. Producers register packages and files with metadata;
storage adapters persist the bytes; retention policy decides how long artifacts
remain eligible for retrieval.
## Architecture in one paragraph
The design keeps artifact identity and lifecycle separate from storage
implementation. This allows the first version to run against local filesystem
storage while the production path can use S3-compatible object storage such as
Ceph RGW.
`artifact-store` is a **library-first** artifact registry and storage
gateway. A small core library (`artifactstore`) implements identity,
manifests, retention, the storage adapter SPI, the data plane SPI, and
the registry orchestrator. The HTTP server and the CLI are thin
consumers of that library. Bytes are addressed by content
(`blake3:<hex>`) and stored through a pluggable adapter SPI. State is
authoritative in an append-only event log; queryable tables are
materialised views.
## Architecture Summary
## Design lineage
The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight
core of well-named modules with stable contracts, runtime-pluggable
backends, a thin orchestration binary, and an explicit hot-path
boundary that can be rewritten in faster code without changing the
consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table.
## Top-level shape
```text
producer
-> Artifact Registry API
-> metadata database
-> retention policy engine
-> audit event log
-> storage adapter interface
-> local filesystem backend
-> S3-compatible backend
-> Ceph RGW deployment
-> future cloud/blob/archive backends
producers / operators / agents
|
v
+------------------------+
| HTTP API | CLI | <-- thin consumers
+------------------------+
|
v
+------------------------+
| registry orchestrator |
+------------------------+
| | |
v v v
+----------+ +---------+ +---------+
| identity | | events | |retention|
|/manifest | | (log + | | policy |
| | | views) | | engine |
+----------+ +---------+ +---------+
|
v
+-----------------------+
| data plane SPI | <-- ADR-0004 contract
+-----------------------+
|
v
+-----------------------+
| storage adapter SPI |
+-----------------------+
| | |
v v v
+-----+ +------+ +-------+
|local| | S3 | | Ceph | ... future backends
| FS | | RGW | | RGW |
+-----+ +------+ +-------+
```
The registry is the authority for artifact metadata and lifecycle. Backends are
responsible for byte storage and retrieval.
## Core modules
## Design Principles
Mapped one-to-one to ADR-0005's project layout. Each module has a
stable public surface; internals are free to evolve.
- Backend-neutral registry: no producer should know whether bytes live in Ceph,
local disk, or a cloud bucket.
- Content-addressable confidence: every stored file has a digest and size.
- Retention by default: every package receives an expiry decision at ingestion.
- Extensions are explicit: retention extensions and holds are audit events, not
silent metadata edits.
- Packages remain portable: a manifest should be enough to understand a package
without calling the producer.
- Statehub links, it does not store bytes: Statehub records artifact IDs and
outcomes; artifact-store owns file persistence.
- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion
jobs must be auditable and reversible only when the backend still has data.
### `identity`
## Components
- `Digest(algorithm, hex)` — value object.
- `ContentAddress``<algorithm>:<hex>` (ADR-0001).
- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest.
- Algorithm registry: `blake3` (default primary), `sha256` (always
computed).
### Registry API
### `manifest`
HTTP API for producers and operators.
- `Manifest` — versioned dataclass: package metadata + ordered file list
+ retention summary + provenance + storage receipts.
- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR
(ADR-0003).
- `manifest.codec.decode(bytes) -> Manifest`.
- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON
projection for display and signing-tool interop.
- Round-trip invariant: `decode(encode(m)) == m` and
`encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
Initial responsibilities:
### `events`
- create artifact packages,
- upload or ingest files,
- finalize packages,
- retrieve package metadata,
- list/search packages by subject and producer metadata,
- create retention extensions and holds,
- expose download metadata or redirect/download endpoints,
- expose health and backend status.
- `events.write(transaction, event)` — appends one row with monotonic
sequence (ADR-0002).
- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll.
- `events.replay(into=ViewWriter)` — rebuild materialised views.
- Event types (v1):
`v1.package.created`, `v1.file.ingested`, `v1.package.finalized`,
`v1.retention.default_applied`, `v1.retention.extended`,
`v1.retention.hold_applied`, `v1.retention.hold_released`,
`v1.retention.deletion_eligible`, `v1.storage.location_recorded`,
`v1.storage.location_verified`, `v1.audit.access`,
`v1.system.note`.
### Metadata Store
### `retention`
Persistent database for registry state.
- `retention.classes``transient`, `raw-evidence`, `summary-evidence`,
`release-evidence`, `permanent-record`. Defined as data, not code.
- `retention.policy.apply(package, class) -> RetentionDecision`
computes `expires_at` and the deletion eligibility rule.
- `retention.extend(package, until, reason, actor)` — emits an event;
the materialised view updates on commit.
- `retention.hold(package, reason, actor)` /
`retention.release_hold(hold_id, actor)`.
Initial implementation can use SQLite for local development and PostgreSQL for
shared service deployments if that matches the surrounding service stack.
### `audit`
Core tables:
- A view over `events` filtered to access and lifecycle events. No
separate write path; auditing happens by event emission elsewhere.
- `artifact_packages`
- `artifact_files`
- `storage_locations`
- `retention_rules`
- `retention_events`
- `audit_events`
### `storage` (adapter SPI)
### Storage Adapter Interface
```python
class StorageBackend(Protocol):
backend_id: str
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
async def health(self) -> BackendStatus: ...
```
Small backend contract used by the API service.
- Backend registry: backends register at import time; selection is
per-package by configuration.
- v1 ships `local` (filesystem); `s3` ships in WP-0004.
Required operations:
### `dataplane` (SPI per ADR-0004)
- `put(object_key, stream, metadata) -> storage_location`
- `get(object_key) -> stream or signed_url`
- `head(object_key) -> object_metadata`
- `delete(object_key) -> deletion_result`
- `health() -> backend_status`
```python
class DataPlane(Protocol):
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
async def backend_health(self) -> BackendStatus: ...
```
Initial backends:
- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`,
computes digests during streaming.
- Future implementation: `dataplane.remote` — gRPC or
framed-bincode-over-Unix-socket client to a Rust daemon.
- local filesystem backend for tests and development,
- S3-compatible backend for Ceph RGW and cloud object stores.
### `registry`
### Retention Policy Engine
The orchestrator. Combines `identity + manifest + events + retention +
dataplane` into the operations the HTTP API and CLI consume:
`create_package`, `ingest_file`, `finalize_package`, `get_manifest`,
`download_file`, `extend_retention`, `apply_hold`, `release_hold`,
`mark_deletion_eligible`, `tail_events`. Each operation is one DB
transaction that writes one or more events and updates materialised
views.
Applies default rules at ingestion and records later changes.
### `api.http` and `cli`
Initial retention classes:
Thin. Their job is to translate transport (HTTP / argv) into calls on
`registry`. No business logic.
- `transient`: short-lived scratch artifacts,
- `raw-evidence`: raw logs and run output,
- `summary-evidence`: compact reports and summaries,
- `release-evidence`: release or customer-facing evidence packages,
- `permanent-record`: manually held records with no automatic expiry.
## Data model
Each package stores:
All tables exist as **materialised views over `events`** (ADR-0002),
except `events` itself, `retention_classes` (seed data), and
`metadata_schemas` (config).
- selected retention class,
- default retention rule,
- computed `expires_at`,
- extension records,
- hold records,
- deletion eligibility state.
### `events` (source of truth)
### Audit Log
| Column | Type | Notes |
|---|---|---|
| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless |
| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default |
| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) |
| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` |
| `subject_id` | `UUID NULL` | |
| `actor` | `TEXT NOT NULL` | producer or operator identity |
| `payload` | `BYTEA NOT NULL` | canonical CBOR |
| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` |
Append-only record of important events:
Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
- package created,
- file uploaded,
- package finalized,
- retrieval requested,
- retention extended,
- hold applied or released,
- deletion requested,
- deletion completed or failed.
### `artifact_packages` (materialised view)
The audit log does not need to be cryptographic in the first release, but the
schema should leave room for signed events or external write-once storage later.
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `name` | `TEXT NOT NULL` | |
| `producer` | `TEXT NOT NULL` | |
| `subject` | `TEXT NOT NULL` | |
| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` |
| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` |
| `metadata` | `JSONB NOT NULL` | validated against schema if present |
| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` |
| `manifest_digest` | `BYTEA NULL` | populated on finalize |
| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | |
| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping |
## Data Model
### `artifact_files` (materialised view)
### Artifact Package
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `package_id` | `UUID NOT NULL` | FK |
| `relative_path` | `TEXT NOT NULL` | logical path; unique within package |
| `media_type` | `TEXT NOT NULL` | required (ADR-0006) |
| `size_bytes` | `BIGINT NOT NULL` | |
| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) |
| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest |
| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop |
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
Required fields:
### `storage_locations` (materialised view)
- `id`
- `name`
- `producer`
- `subject`
- `retention_class`
- `status`
- `created_at`
- `finalized_at`
- `expires_at`
- `metadata`
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `artifact_file_id` | `UUID NOT NULL` | FK |
| `backend_id` | `TEXT NOT NULL` | |
| `content_address` | `TEXT NOT NULL` | `<algo>:<hex>` |
| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` |
| `storage_class` | `TEXT NULL` | backend-specific label |
| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` |
| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` |
| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` |
| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | |
Recommended metadata keys:
### `retention_state` (materialised view)
- `repo_slug`
- `run_id`
- `assessment_id`
- `target_profile_ref`
- `assessment_profile_ref`
- `source_commits`
- `tool_versions`
- `environment`
| Column | Type | Notes |
|---|---|---|
| `package_id` | `UUID PRIMARY KEY` | |
| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) |
| `effective_class` | `TEXT NOT NULL` | |
| `active_hold_id` | `UUID NULL` | |
| `eligible_for_deletion` | `BOOLEAN NOT NULL` | |
### Artifact File
### `retention_classes` (seed data, not derived)
Required fields:
| Column | Type | Notes |
|---|---|---|
| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` |
| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` |
| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) |
- `id`
- `package_id`
- `relative_path`
- `media_type`
- `size_bytes`
- `sha256`
- `created_at`
### `metadata_schemas` (config table)
### Storage Location
| Column | Type | Notes |
|---|---|---|
| `id` | `UUID PRIMARY KEY` | |
| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` |
| `json_schema` | `JSONB NOT NULL` | |
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
Required fields:
## API shape
- `id`
- `artifact_file_id`
- `backend_id`
- `object_key`
- `storage_class`
- `status`
- `created_at`
- `last_verified_at`
### Retention Event
Required fields:
- `id`
- `package_id`
- `event_type`
- `reason`
- `created_by`
- `created_at`
- `previous_expires_at`
- `new_expires_at`
Event types:
- `default_rule_applied`
- `extended`
- `hold_applied`
- `hold_released`
- `deletion_eligible`
- `deleted`
## API Shape
Initial endpoints:
### Native v1 surface
```text
GET /health
GET /backends
POST /packages
GET /packages
GET /packages/{package_id}
POST /packages/{package_id}/files
POST /packages/{package_id}/finalize
GET /packages/{package_id}/manifest
GET /files/{file_id}/download
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
GET /health
GET /backends
GET /retention-classes
POST /packages # create
GET /packages # list, query by metadata
GET /packages/{package_id} # metadata
POST /packages/{package_id}/files # single-shot file upload
POST /packages/{package_id}/finalize # produce manifest
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
GET /files/{file_id} # metadata
GET /files/{file_id}/download # bytes
POST /uploads # open an upload session (resource shape pinned now)
PATCH /uploads/{upload_id} # range body
POST /uploads/{upload_id}/complete # promote to /packages/.../files
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
GET /events?since={sequence} # long-poll registry change feed
```
The first ingestion path can accept multipart file uploads. A later trusted-local
operator endpoint may ingest from server-local paths, but it should be disabled
by default because path ingestion changes the security boundary.
The `POST /uploads/...` resource shape is committed now even if v1
implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6.
## Package Manifest
### Deferred / not v1
Every finalized package should expose a JSON manifest containing:
- `/v2/…` OCI Distribution endpoints (ADR-0006).
- gRPC API.
- Streaming CDC topic (NATS / Kafka).
- Multi-tenant namespacing in URLs.
- package metadata,
- retention summary,
- file list,
- file digests and sizes,
- storage backend references,
- source metadata,
- created/finalized timestamps.
## Package manifest content (v1)
For guide-board runs, the manifest should preserve links to:
A finalised manifest carries:
- `run.json`
- `retention-summary.json`
- `reports/assessment-package.json`
- `reports/report.md`
- extension-generated scorecards or log reviews,
- raw artifact files captured by the assessment package manifest.
- `manifest_version: 1`
- `package`: id, name, producer, subject, retention class, created_at,
finalized_at, expires_at, metadata, metadata_schema_id (nullable).
- `files`: ordered list of `{id, relative_path, media_type, size_bytes,
digest_algorithm, digest_primary_hex, digest_sha256_hex}`.
- `storage_receipts`: ordered list of `{file_id, backend_id,
content_address, retrieval_tier, status}` per stored copy.
- `retention_summary`: current class, expires_at, holds, last
retention event.
- `provenance`: `{source_commits, tool_versions, environment,
ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a
registered schema or empty if none.
## Guide-Board Pilot Flow
The manifest digest (`blake3:<hex>`) is the package's canonical
external identifier.
```text
guide-board run directory
-> open-cmis-tck scorecard/log review
-> artifact-store package create
-> upload run files
-> finalize manifest
-> Statehub record links package id and summary
```
## Storage backends
The artifact package should carry:
### Local filesystem (v1)
- run id,
- target profile reference,
- assessment profile reference,
- result status,
- source commits for guide-board, open-cmis-tck, and the assessed repository,
- important report paths,
- retention class `raw-evidence` or `release-evidence`.
- Root: configured directory.
- Object key layout: `<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Atomic write via `fsync(tmpfile) + rename`. No partial states visible.
- Path traversal prevented at the SPI boundary; the local backend
rejects any key that does not match the expected layout.
## Ceph And S3-Compatible Storage
### S3-compatible / Ceph RGW (WP-0004)
Ceph should be introduced through the S3-compatible adapter, not as a special
case in producer logic.
- Endpoint, bucket, region, access key ref, secret key ref, key
prefix, storage class label, optional SSE config.
- Object key: `<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Multipart upload for objects above a configurable threshold.
Configuration should support:
## Security boundary (v1)
- endpoint URL,
- bucket,
- region,
- access key reference,
- secret key reference,
- optional server-side encryption settings,
- object key prefix,
- storage class label.
- Internal service. No anonymous public access.
- Authenticated producer / operator API. v1 ships shared-secret bearer
tokens; OIDC integration is its own workplan.
- No secret values in artifact metadata.
- Upload paths are logical; never trusted filesystem paths. The
`/uploads/...` path-ingest endpoint is *not* offered in v1.
- Download authorisation is checked at the registry layer, never at
the backend.
The service should never require credentials in producer request bodies. Use
environment variables, mounted secret files, or a local secret provider.
## Resolved open questions
## Future Retrieval Tiers
- **Deduplication scope.** Global by content address (ADR-0001).
Reference-counted deletion via a GC pass (WP-0006, TBD).
- **Deletion ordering.** Mark records `deletion_eligible` first via an
event. Byte deletion is a separate, audited operation that emits a
second event. Reverse order is forbidden.
- **Metadata schemas.** Open JSON with optional producer-registered
JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`).
- **Statehub integration scope.** Statehub keeps package IDs and
summary; never bytes. The `/events` long-poll is the integration
point.
The initial API can treat all stored files as immediately retrievable. Later,
storage locations can include:
## Outstanding open questions (not blocking v1)
- `retrieval_tier`: hot, warm, cold, archive,
- `restore_status`: available, restore_requested, restoring, restored, expired,
- `restore_requested_at`,
- `restore_expires_at`.
- Identity provider for shared deployments.
- Default retention durations per class (operator-configurable; needs
one round of stakeholder input).
- WASM plugin host design (deferred to its own workplan; see
`PLATFORM-AMBITION`).
- Federation / mirroring protocol (post-OCI-endpoint workplan).
The registry API should be able to return "not immediately available" without
changing artifact identity.
## Roadmap pointer
## Security Boundary
Initial service assumptions:
- internal service, not public internet exposed,
- authenticated producer/operator API before shared deployment,
- no secret values stored in artifact metadata,
- package paths are logical paths, not trusted filesystem paths,
- download authorization should be checked at the registry layer.
Files may contain sensitive evidence. The service must treat metadata and bytes
as confidential by default.
## Open Questions
- Which identity provider should guard shared deployments?
- Should package metadata schemas be open-ended JSON or typed by producer?
- Should deduplication be package-local only or global by content hash?
- Should deletion first mark records deleted, then delete bytes, or reverse that
order with compensating events?
- How much Statehub integration belongs in this repo versus in Statehub clients?
The implementation sequence is in `docs/ROADMAP.md`. The first
workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`.

93
docs/ROADMAP.md Normal file
View File

@@ -0,0 +1,93 @@
# Roadmap
Status: living document
Updated: 2026-05-15
The roadmap sequences `artifact-store` from "no code" to a credible
production v1 to the longer-horizon platform shape recorded in
`docs/PLATFORM-AMBITION.md`. Each row is a self-contained workplan with
its own acceptance criteria; nothing here is a binding milestone.
The sequencing principle is **library-first** (ffmpeg-shaped):
foundational kernels and contracts before any consumer code. The HTTP
server and CLI exist only after the core library can be exercised
end-to-end against a local filesystem backend.
## Phase 0 — Cleanup (done 2026-05-15)
- ADR-0001 through ADR-0006 accepted.
- Architecture blueprint rewritten to v2.
- Platform ambition and assembly experiment documented.
- Workplans re-sequenced.
## Phase 1 — Foundation and pilot (v0.1)
Goal: ingest a real guide-board run end-to-end, against a local
filesystem backend, with retention applied and events logged.
| ID | Title | Carries existing task IDs | Notes |
|---|---|---|---|
| WP-0001 | Foundation: scaffold, core kernels, local FS backend | T001, T002, T003, T008 | All of the library-shaped modules; no HTTP API yet beyond `/health`. |
| WP-0002 | Ingestion API + manifest surface | T004 | The HTTP API. Builds on WP-0001's library. |
| WP-0003 | Retention lifecycle | T005 | Retention engine, extensions, holds, deletion eligibility. |
| WP-0004 | S3-compatible backend (Ceph RGW target) | T006 | Second concrete adapter. |
| WP-0005 | Guide-board pilot ingestion | T007 | First real producer wired up. |
Exit criteria for v0.1: WP-0001 through WP-0005 done; a guide-board
CMIS run round-trips through artifact-store with manifest, retention,
and Statehub linkage; backend swappable between local FS and an
S3-compatible store.
## Phase 2 — Production hardening (v0.2 v0.3)
| ID | Title | Notes |
|---|---|---|
| WP-0006 | Garbage collection + reference counting | Required by ADR-0001 global dedup. Mark-eligible already lands in WP-0003; this workplan does the byte-deletion pass. |
| WP-0007 | Resumable / chunked upload implementation | The wire shape lands in WP-0002; this workplan makes the implementation actually streaming. |
| WP-0008 | Auth, multi-tenancy, quota | OIDC integration; tenant namespacing; per-tenant rate limit and storage quota. |
| WP-0009 | Observability: metrics, tracing, structured logs | OpenTelemetry SDK; latency / throughput SLOs published. |
| WP-0010 | Event stream out (CDC) | NATS or Kafka topic of registry events; long-poll `/events` becomes a fallback. |
| WP-0011 | Signed manifests | Sigstore / cosign integration; signature recorded alongside manifest digest. |
Exit criteria for v0.3: a deployment is operatable by humans without
internal knowledge; SLOs are measurable; access is authenticated;
artifacts can be signed and verified.
## Phase 3 — Platform features (v0.4 v1.0)
| ID | Title | Notes |
|---|---|---|
| WP-0012 | OCI artifact `/v2/` endpoint | Implements OCI Distribution Spec on top of the same storage (ADR-0006). |
| WP-0013 | Content-defined chunking + global dedup at chunk level | FastCDC; chunked storage. Builds toward `docs/ASSEMBLY-EXPERIMENT.md`. |
| WP-0014 | Rust data plane extraction | Move `dataplane.inproc` to `dataplane.remote` (ADR-0004). |
| WP-0015 | WASM plugin host | Extension surface for indexers, redactors, scorecard generators. |
| WP-0016 | Cold-tier adapters | Glacier / Tape / IA classes; restore flow. |
| WP-0017 | Federation and replication | Signed manifest exchange between artifact-store instances. |
Exit criteria for v1.0: artifact-store is embeddable as a library, runs
as a single-binary CLI, runs as a server, speaks OCI, federates between
instances, and is fast enough to be a credible commercial substrate.
## What this roadmap deliberately does NOT promise
- Specific calendar dates. Cadence is set by sessions, not quarters.
- A UI. UIs are out-of-tree (see `docs/PLATFORM-AMBITION.md`).
- ML-specific or container-specific features. Use OCI compatibility.
- A storage backend for every cloud. Adapters are community surface.
## How to add a workplan
1. Pick the next free `ARTIFACT-STORE-WP-NNNN` number.
2. Create `workplans/ARTIFACT-STORE-WP-NNNN-<slug>.md` with the
frontmatter and task block format in `AGENTS.md`.
3. Cite the ADRs the workplan depends on in its `## Constraints`
section.
4. Append a row to the appropriate phase table in this file.
5. Notify the custodian operator to run
`make fix-consistency REPO=artifact-store`.
## How to retire a workplan
1. Set `status: done` in the frontmatter when all tasks are `done`.
2. Move the file to `workplans/archived/YYMMDD-ARTIFACT-STORE-WP-NNNN-<slug>.md`.
3. Update this roadmap to reflect the new state.

View File

@@ -0,0 +1,80 @@
# ADR-0001 — Content-Addressed Storage with Dual Digest
Status: accepted
Date: 2026-05-15
Supersedes: —
Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9
## Context
The architecture blueprint as originally drafted addresses stored bytes by
logical `(package, relative_path)`. That is sufficient for v1 ingestion but
forecloses global deduplication, Merkle integrity proofs, partial
replication, federation, and OCI artifact compatibility — all of which the
platform ambition requires to remain reachable.
Independently, the original blueprint pins SHA-256 as the only file digest.
SHA-256 with SHA-NI on modern x86 reaches ~1.52 GB/s/core. BLAKE3 on the
same hardware reaches 610+ GB/s/core, parallelises across cores, and its
construction *is* a Merkle tree — package-level integrity becomes free.
SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we
cannot drop it.
## Decision
1. The canonical storage key for any byte sequence is its content address
in the form `<algorithm>:<lowercase-hex-digest>`. Storage backends store
and retrieve by this key. `relative_path` is logical metadata recorded
in the manifest, not a storage-layer concept.
2. Every `artifact_files` row carries two digest columns:
- `digest_primary` — the native digest; default algorithm `blake3`.
- `digest_sha256` — always populated for interop, even when `blake3`
is the primary.
Both are computed in a single ingest pass (one read of the input).
3. The schema also carries a `digest_algorithm` column naming the primary
algorithm. Additional algorithms are added by new columns or a side
table, never by overloading `digest_primary`.
4. Storage backend object keys are derived from `digest_primary` only.
Migrations between primary algorithms are explicit and audited; they
are not silent.
## Consequences
Positive:
- Global deduplication is automatic — two identical files in two packages
share one backend object.
- Merkle integrity over a package is free with BLAKE3 (use the tree mode).
- Federation, partial mirrors, and OCI compatibility (ADR-0006) become
reachable without schema migration.
- Verification of a single file does not require fetching its package.
Negative:
- Two digests must be computed per ingest. Mitigated by streaming both
through one buffer; the bottleneck is I/O, not hashing.
- Reference counting: deletion of an `artifact_file` row cannot
unconditionally delete the backend object. A garbage-collector pass
reconciles references before deleting bytes. This is correct anyway
(deletion should be deliberate, per the blueprint).
- Producers requesting "store these N bytes at path P" must understand
that their P is logical. This is a documentation problem, not a
technical one.
## Implementation notes
- v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated;
no asm we maintain).
- v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython
build links against OpenSSL with SHA-NI support).
- A `Digest` value object wraps `(algorithm, hex)`; serialised forms
always include the algorithm prefix.
- A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not
delete bytes automatically — it marks them eligible.
## Status of the original blueprint pin
The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by
`digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup
blueprint's implicit path-keyed storage is replaced by content-keyed
storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`.

View File

@@ -0,0 +1,76 @@
# ADR-0002 — Append-Only Event Log as Source of Truth
Status: accepted
Date: 2026-05-15
Related: `docs/PLATFORM-AMBITION.md` commitment A3
## Context
The original blueprint defines `audit_events` and `retention_events` as
separate tables. Both are useful, but neither is a complete authoritative
record of how registry state was produced. Several downstream needs share
one underlying primitive:
- audit (who did what when, with what result),
- change-data-capture feed for downstream consumers (Statehub, search),
- replication and federation between instances,
- point-in-time replay and disaster recovery,
- materialised view rebuilds when schemas evolve.
Each can be served by an append-only log of registry events with a
monotonic sequence number. Two separate tables cannot.
## Decision
1. The registry persists an append-only `events` table. Every state-
changing operation writes one row in the same database transaction as
the operation. Once written, rows are immutable.
2. Each row has a strictly monotonic, gapless sequence number scoped to
the registry instance, and a UTC ingest timestamp.
3. The current `artifact_packages`, `artifact_files`, `storage_locations`,
and `retention_state` tables are materialised views over `events`.
They are rebuildable by replay.
4. Event payloads are stored as canonical CBOR (ADR-0003), keyed by
`event_type` (string slug). The `event_type` namespace is versioned
(`v1.package.created`, `v1.file.ingested`, `v1.retention.extended`,
etc.).
5. `audit_events` and `retention_events` cease to exist as standalone
tables; their semantics are subsets of `events` filtered by
`event_type`.
## Consequences
Positive:
- One primitive serves audit, CDC, replication, replay, and rebuild.
- A consumer can tail by `sequence > N` and never miss an event.
- Forward-compatibility: new view columns can be derived from existing
events by adding a replay path; no migration required.
- Signed event chains are reachable later by adding a signature column.
Negative:
- Replays cost wall-clock time on large datasets. Snapshots of
materialised views (with the highest applied sequence stamped on them)
are used to bound replay cost.
- Schema migrations on materialised views still happen; they just no
longer touch the source of truth.
- Discipline required: any write that bypasses the event log is a bug.
Enforced by code review and a runtime invariant check on the
materialised tables.
## Implementation notes
- `events` schema (v1):
- `sequence BIGSERIAL PRIMARY KEY`
- `created_at TIMESTAMPTZ NOT NULL DEFAULT now()`
- `event_type TEXT NOT NULL`
- `subject_kind TEXT NOT NULL``package` | `file` | `retention` | `storage` | `system`
- `subject_id UUID` — nullable for system-level events
- `actor TEXT NOT NULL` — producer or operator identity
- `payload BYTEA NOT NULL` — canonical CBOR
- `payload_digest BYTEA NOT NULL` — BLAKE3 of `payload`
- Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
- Replay tool ships in v1 as a CLI subcommand (`artifactstore replay`).
- Outbound CDC stream (NATS / Kafka) is its own workplan; v1 only exposes
long-poll over `GET /events?since=<sequence>`.

View File

@@ -0,0 +1,78 @@
# ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)
Status: accepted
Date: 2026-05-15
Related: ADR-0001, ADR-0002, ADR-0006, `docs/PLATFORM-AMBITION.md` commitment A4
## Context
Manifests describe a package's identity, contents, retention, and
provenance. They are the durable, portable, signable summary of a package.
Three downstream features depend on byte-identical manifest serialisation:
1. Manifest digest (used as the package's content address — ADR-0001).
2. Signatures (cosign, Sigstore, in-toto, SLSA).
3. Cross-language / cross-version reproducibility (any client must be
able to verify a manifest produced by any other client).
JSON does not guarantee byte-identical output without an explicit
canonicalisation profile. The candidates are:
- **JCS** (JSON Canonicalization Scheme, RFC 8785) — JSON-shaped, widely
available, text-format, signs cleanly.
- **Canonical CBOR** (RFC 8949 §4.2.2) — binary, smaller, lower overhead
to canonicalise, native in cosign / Sigstore tooling, used by COSE.
- **DAG-CBOR** (IPLD profile) — canonical CBOR plus content-addressing
conventions; useful if we later integrate with IPLD/IPFS, but pulls in
ecosystem assumptions we don't yet need.
Canonical CBOR wins on size, parser surface, and direct compatibility
with the tooling we will adopt for signing (ADR commitments A4, A9). JCS
is a reasonable alternative; we keep an emit-JCS path for human-readable
display but the signed form is CBOR.
## Decision
1. Manifests are serialised as **canonical CBOR** per RFC 8949 §4.2.2:
- definite-length encoding throughout,
- shortest-form integer encoding,
- map keys sorted bytewise lexicographically,
- no floating-point unless explicitly required (we do not require it),
- no semantic tags except those we explicitly enumerate.
2. The manifest's content address is `blake3:<hex>` of its canonical
CBOR bytes. This is the package's primary identifier in storage.
3. A canonical JSON projection (JCS) of the same manifest is available
for display, signing-tool interop, and human inspection. The
projection is deterministic: round-tripping through it must yield
byte-identical CBOR.
4. The manifest schema is itself versioned (`manifest_version: 1`).
Unknown fields are preserved on read and re-emitted on write (forward
compatibility); breaking schema changes bump the version.
## Consequences
Positive:
- Manifests are signable today by any tool that consumes CBOR (cosign,
ssh-keygen `-Y sign`, COSE libraries).
- The manifest digest is stable across languages, OS, and compiler.
- Smaller on disk and on the wire than JSON.
- Replay (ADR-0002) is unambiguous because event payloads are also CBOR.
Negative:
- Less human-readable in raw form; the CLI must offer a `pretty` projection.
- One more dependency (a CBOR library). We pin one in ADR-0005.
- Future schema evolution requires the same canonicalisation discipline.
Enforced by a property-based test: any manifest must round-trip
CBOR → JCS → CBOR with byte equality.
## Implementation notes
- v1 library: `cbor2` (PyPI; pure-Python with optional C extension).
Wrapped behind `artifactstore.manifest.codec` so swapping to a faster
impl is transparent.
- JCS projection: `jcs` (PyPI) or hand-rolled — decision deferred to
WP-0001-T003.
- A `Manifest` value class enforces field order on emit, not just on
encode. This catches non-canonical producers at the API boundary.

View File

@@ -0,0 +1,79 @@
# ADR-0004 — Control Plane / Data Plane Contract
Status: accepted
Date: 2026-05-15
Related: ADR-0005, `docs/PLATFORM-AMBITION.md` commitment A5,
`docs/ASSEMBLY-EXPERIMENT.md`
## Context
The platform ambition expects a Rust (eventually asm-tuned) data plane
to handle hot ingest paths — hashing, chunking, optional compression and
encryption, storage backend I/O. The v1 service is written entirely in
Python (ADR-0005). The cost of conflating control and data planes at the
code level is that extracting the data plane later requires API churn,
test rework, and producer migrations.
The cost of separating them now is one named module boundary and one
in-process protocol shape. That cost is essentially free if taken
before any consumer exists.
## Decision
1. The Python package is organised so that *every byte-handling
operation* lives behind a named contract:
- `artifactstore.dataplane.spi` — the abstract surface (typed
dataclasses, async iterator protocols).
- `artifactstore.dataplane.inproc` — the v1 implementation, running
in the same process as the control plane.
2. The control plane (`artifactstore.registry`, `artifactstore.api.http`,
`artifactstore.retention`, `artifactstore.audit`) interacts with
bytes *only* through the SPI. No HTTP handler, no DB writer, no
retention rule ever reads or writes file bytes directly.
3. The SPI exposes exactly these operations:
- `ingest_stream(stream, hints) -> IngestResult` — consumes an
upload, returns content addresses, sizes, and storage receipts.
- `serve_object(content_address, range?) -> AsyncIterator[bytes]`
produces bytes for a download.
- `verify_object(content_address) -> VerifyResult` — re-reads bytes,
re-digests, returns mismatches.
- `delete_object(content_address) -> DeletionResult` — best-effort,
idempotent.
- `backend_health() -> BackendStatus` — readiness, latency, free
capacity.
4. The SPI surface is the contract a future Rust daemon must satisfy.
When that daemon ships, `artifactstore.dataplane.inproc` is replaced
by `artifactstore.dataplane.remote` (a thin gRPC or
framed-bincode-over-Unix-socket client). The control plane sees no
change.
5. SPI parameter and return types are CBOR-serialisable today, even when
nothing serialises them. This lets us toggle to RPC without rewriting
types.
## Consequences
Positive:
- The data plane can be rewritten in Rust later with zero API churn.
- Tests can fake the SPI cheaply; integration tests pin the contract.
- The CLI in `artifactstore.cli` is a second consumer of the SPI on
equal footing with the HTTP server.
- Operators with strong embedding requirements can use the in-process
data plane forever; nothing forces the RPC hop.
Negative:
- One extra abstraction layer in v1. Mitigated by the contract being
narrow (five operations).
- Discipline required: PRs that bypass the SPI are rejected. A linter
rule (forbidden import: `artifactstore.api.* -> filesystem`) makes
this mechanical.
## Implementation notes
- The SPI is a `Protocol` (typing.Protocol) in `dataplane/spi.py` so the
in-process and future remote impls don't share an inheritance tree.
- Streaming returns `AsyncIterator[bytes]` so neither full-file buffering
nor `sendfile()` zero-copy is foreclosed.
- The `IngestResult` payload is the canonical CBOR-able value used in
events (ADR-0002). The same byte sequence flows API → SPI → event.

View File

@@ -0,0 +1,117 @@
# ADR-0005 — V1 Technology Stack
Status: accepted
Date: 2026-05-15
Related: ADR-0001, ADR-0002, ADR-0003, ADR-0004
## Context
WP-0001 ("Foundation") cannot start without a pinned stack. The decision
needs to balance:
- ffmpeg / VLC philosophy: minimal dependency budget, sharp boundaries,
native code at the hot edges, plain tools.
- Python is already implied by `.gitignore` and ecosystem fit (StateHub,
guide-board, open-cmis-tck are all Python-leaning).
- The data plane will eventually be Rust (ADR-0004); the control plane
stays in Python and must stay approachable.
## Decision
| Concern | Choice | Rationale |
|---|---|---|
| Language (control plane) | **Python 3.12+** | Async ecosystem, type hints, matches sibling repos. 3.12 specifically: PEP 695 generics, faster CPython, `sys.monitoring`. |
| Package / project manager | **uv** | Single static binary, fast resolver, lockfile-first, replaces `pip + pip-tools + venv + pipx` in one tool. |
| Build backend | **hatchling** (via `pyproject.toml`) | Standards-track PEP 517 backend. No magic. |
| HTTP framework | **FastAPI** (Starlette + Pydantic v2) | OpenAPI generation, async-native, broad community. |
| ASGI server | **uvicorn** (dev), **gunicorn + uvicorn workers** (prod) | Plain, well-understood. |
| Database (prod) | **PostgreSQL 16+** | Source-of-truth event log (ADR-0002) wants `BIGSERIAL`, `BYTEA`, advisory locks, logical replication. |
| Database (dev/embedded) | **SQLite (WAL mode)** | Zero-dependency local. Schema is portable when we use SQLAlchemy Core. |
| DB access | **SQLAlchemy 2.0 Core** + **asyncpg** (prod) / **aiosqlite** (dev) | Core, not ORM — explicit SQL, async drivers. Migrations live below the API surface. |
| Migrations | **Alembic** | Standard, integrates with SQLAlchemy Core, supports both pg and sqlite. |
| Hashing | stdlib **`hashlib`** for SHA-256, **`blake3`** PyPI wheel for BLAKE3 | `blake3` wheel embeds the SIMD-tuned Rust impl with no build-time toolchain. |
| Serialisation | **`cbor2`** for canonical CBOR (ADR-0003); stdlib `json` for JCS or `jcs` PyPI | Smallest deps that satisfy ADR-0003. |
| CLI | **typer** (atop click) | Sits on FastAPI's Pydantic types cleanly; type-driven CLI surface. |
| Tests | **pytest** + **httpx** + **trio-asyncio**-free `pytest-asyncio` | Standard. |
| Lint / format | **ruff** (lint + format) | One tool replaces black + isort + flake8 + pyupgrade. |
| Type checker | **mypy** in `--strict` | Pyright is acceptable for editor support; CI gate is mypy. |
| Logging | stdlib `logging` + `structlog` for structured output | No exotic deps. |
| Metrics / tracing | OpenTelemetry SDK (deferred to its own workplan) | Listed for forward-compatibility; not a v1 dep. |
### Project layout
```
artifact-store/
├── pyproject.toml
├── uv.lock
├── Makefile # thin shim: make dev / test / lint / type / migrate
├── alembic.ini
├── src/
│ └── artifactstore/
│ ├── __init__.py
│ ├── identity/ # content address, digest abstraction (ADR-0001)
│ ├── manifest/ # canonical CBOR, JCS projection (ADR-0003)
│ ├── events/ # append-only log + replayer (ADR-0002)
│ ├── retention/ # policy engine
│ ├── audit/ # audit emission as event subset
│ ├── storage/ # adapter SPI + backend registry
│ │ ├── spi.py
│ │ └── backends/
│ │ ├── local.py # filesystem backend
│ │ └── s3.py # placeholder, WP-0004
│ ├── dataplane/ # SPI + in-process impl (ADR-0004)
│ │ ├── spi.py
│ │ └── inproc.py
│ ├── registry/ # high-level orchestrator
│ ├── api/
│ │ └── http/ # FastAPI app
│ ├── cli/ # typer CLI (thin)
│ └── config.py
├── tests/
│ ├── unit/
│ ├── integration/
│ └── conftest.py
├── migrations/ # alembic
└── docs/
```
### Commands (T001 acceptance)
```
make dev # uvicorn with reload, sqlite backend, local FS storage
make test # pytest -q
make lint # ruff check + ruff format --check
make type # mypy --strict src tests
make migrate # alembic upgrade head
artifactstore # CLI entry point installed by uv
```
## Consequences
Positive:
- Dependency budget is small and each dep is best-in-class for its slot.
- The same toolchain works on Linux, macOS, and CI without special cases.
- `uv.lock` is checked in; builds are reproducible.
- Every layer maps one-to-one to a docs concept (identity, manifest,
events, dataplane, etc.), so the codebase remains navigable.
Negative:
- Pydantic v2 is the heaviest non-DB dep; acceptable for the OpenAPI win.
- Choosing SQLAlchemy Core over ORM costs some convenience; we accept
it because explicit SQL is easier to migrate to Rust later (ADR-0004).
- mypy `--strict` is a per-PR tax; bounded by keeping the codebase small.
## Revision policy
This ADR is the most likely candidate for revision once we have profile
data from real ingestion. Candidates we are already watching:
- Replace `cbor2` with a Rust-backed CBOR codec if profile shows it on
the hot path.
- Replace `uvicorn` with `granian` (Rust ASGI server) if perf demands.
- Replace `SQLAlchemy Core` with raw `asyncpg` + a tiny query builder
if Core's abstractions show up in flame graphs.
Each replacement is its own ADR. None of them are v1 work.

View File

@@ -0,0 +1,69 @@
# ADR-0006 — OCI Artifact Compatibility Kept Reachable
Status: accepted
Date: 2026-05-15
Related: ADR-0001, ADR-0003, `docs/PLATFORM-AMBITION.md` commitment A9
## Context
The OCI Distribution Specification and the OCI Artifact Manifest define
a widely-deployed wire format for content-addressed artifact exchange.
The ecosystem includes `oras`, `cosign`, `crane`, Helm, ChartMuseum,
ML-model packaging tools, and most container registries. Compatibility
with this ecosystem is the single highest-leverage opportunity in
`docs/PLATFORM-AMBITION.md`.
We do not implement OCI compatibility in v1. We do refuse to take any
v1 decision that prevents it.
## Decision
1. The internal data model is structurally compatible with an OCI
artifact manifest. Concretely:
- Storage addresses content as `<algorithm>:<lowercase-hex>`
(ADR-0001). OCI requires exactly this shape.
- Manifests have a `config` blob plus an ordered list of `layers`,
each with `mediaType`, `digest`, `size`, and optional
`annotations`. Our `Manifest` value class includes all of these
fields, even when v1 has no use for `mediaType` or `annotations`.
- Manifest serialisation produces byte-identical output across
callers (ADR-0003). OCI requires this for the manifest digest.
2. The native API may be richer than OCI, but v1 reviews every schema
change against the OCI spec and rejects changes that would block
later OCI compatibility.
3. A future `/v2/` namespace will speak the OCI Distribution Spec on
top of the same storage. This is its own workplan; it does not
modify v1 endpoints, only add new ones.
## Consequences
Positive:
- `oras push`, `cosign sign`, `crane copy`, Helm `chart pull` become
reachable additions, not rewrites.
- Customers who already speak OCI can adopt incrementally.
- The `mediaType` discipline forces v1 producers to label their files,
which improves the manifest's value as a portable record.
Negative:
- v1 carries some otherwise-unnecessary manifest fields. Acceptable;
the cost is bytes, not complexity.
- The OCI manifest model uses SHA-256 as the canonical digest in
practice. ADR-0001's `digest_sha256` column satisfies this; the
native primary digest can still be BLAKE3.
## What this ADR does NOT commit to
- It does not commit to implementing OCI Distribution in v1.
- It does not commit to OCI as the *only* wire format. The native API
remains the richer interface.
- It does not commit to specific OCI media types for evidence packages.
Media-type assignment is the subject of a later workplan.
## Review trigger
Every schema-affecting workplan (anything that touches the data model
or the manifest shape) must include an explicit one-paragraph review
against this ADR. Reject changes that introduce OCI-incompatible
invariants without superseding this ADR.

32
docs/adr/README.md Normal file
View File

@@ -0,0 +1,32 @@
# Architecture Decision Records
This directory holds the architectural decisions that govern `artifact-store`.
Each ADR is a small Markdown file with a status (`proposed`, `accepted`,
`superseded`, `deprecated`), a concise statement of the decision, the
forces that pushed it, and the consequences.
ADRs are the canonical home for "we are doing X" statements that survive
multiple workplans. `INTENT.md` says what we build; `SCOPE.md` says where
the boundary is; `docs/PLATFORM-AMBITION.md` says where we are pointed;
ADRs say how — and they are the only document that records a *changeable*
decision in a form that can be superseded cleanly.
Workplans cite the ADRs they depend on. The architecture blueprint cites
the ADRs it operationalises.
## Index
- [ADR-0001 — Content-Addressed Storage with Dual Digest](0001-content-addressed-storage.md) — accepted
- [ADR-0002 — Append-Only Event Log as Source of Truth](0002-event-log-source-of-truth.md) — accepted
- [ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)](0003-manifest-canonical-cbor.md) — accepted
- [ADR-0004 — Control Plane / Data Plane Contract](0004-control-plane-data-plane-contract.md) — accepted
- [ADR-0005 — V1 Technology Stack](0005-v1-tech-stack.md) — accepted
- [ADR-0006 — OCI Artifact Compatibility Kept Reachable](0006-oci-compatibility-reachable.md) — accepted
## Conventions
- Filenames: `NNNN-kebab-case-slug.md`, numbered in acceptance order.
- Status transitions: `proposed → accepted → (superseded | deprecated)`.
- Supersession is explicit: the new ADR links the old; the old ADR links
forward and changes status. Never delete an ADR.
- Each ADR is short. If it is long, it is wrong: split it.

View File

@@ -1,7 +1,7 @@
---
id: ARTIFACT-STORE-WP-0001
type: workplan
title: "Artifact Store Service Baseline"
title: "Foundation: Scaffold, Core Kernels, Local FS Backend"
repo: artifact-store
domain: stack
status: active
@@ -14,51 +14,53 @@ updated: "2026-05-15"
state_hub_workstream_id: "aebf996c-8721-4e8c-9e56-61d5e4bf8dcb"
---
# ARTIFACT-STORE-WP-0001: Artifact Store Service Baseline
# ARTIFACT-STORE-WP-0001: Foundation — Scaffold, Core Kernels, Local FS Backend
## Purpose
Implement the first usable artifact registry and storage gateway. The service
should preserve artifact packages, index their metadata, delegate bytes to a
configured storage backend, apply default retention rules, and expose stable
package identifiers that Statehub and producer repositories can link to.
Stand up the smallest credible `artifact-store` core. By the end of
this workplan, the library can ingest a directory of files into a
package, compute dual digests, write canonical-CBOR manifests, persist
state through the append-only event log, store bytes on local
filesystem, and replay materialised views from the event log. No HTTP
API yet (that lands in WP-0002); a `/health` endpoint exists so that
the dev loop has something to hit.
The first producer target is a guide-board assessment run, including OpenCMIS TCK
reports and raw assessment artifacts.
The shape is **library-first** (ffmpeg-style). HTTP server and CLI are
explicitly thin consumers of `artifactstore.registry`.
## Background
## Constraints (must satisfy)
Guide-board can already produce self-contained run directories with retention
summaries, assessment packages, raw artifacts, scorecards, and log reviews. Those
directories should not live only in `/tmp`, and committing raw evidence into
producer repositories is the wrong long-term shape.
`artifact-store` becomes the shared preservation layer:
- producers generate files,
- artifact-store registers and stores them,
- Statehub records the work outcome and links to the registry package,
- storage backends handle durable bytes.
Ceph is the likely self-hosted production backend through its S3-compatible RGW
interface, but the service must keep the backend interface generic.
## Target Architecture
```text
producer package
-> registry API
-> metadata database
-> retention policy engine
-> storage adapter
-> local filesystem or S3-compatible object storage
```
- ADR-0001 — content-addressed storage with dual digest.
- ADR-0002 — append-only event log as source of truth.
- ADR-0003 — manifest canonicalisation = canonical CBOR.
- ADR-0004 — control plane / data plane SPI named.
- ADR-0005 — v1 technology stack pinned (Python 3.12, uv, FastAPI,
SQLAlchemy Core, asyncpg, alembic, cbor2, blake3, ruff, mypy, pytest).
- ADR-0006 — OCI compatibility kept reachable.
- `docs/ARCHITECTURE-BLUEPRINT.md` data model and module layout.
## Boundary
This workplan owns the first service implementation and API contract. It does
not need to build a UI, implement cold-storage restore tiers, replace Statehub,
or provide formal records-management certification.
This workplan builds the library and a minimal `/health` endpoint. It
does NOT implement: package CRUD HTTP API (WP-0002), retention rules
beyond the seed (WP-0003), S3-compatible backend (WP-0004), guide-board
producer wiring (WP-0005), GC of unreferenced bytes (WP-0006).
## Target architecture (this workplan)
```text
artifactstore (library)
identity ──┐
manifest ──┼──> registry (orchestrator) ──> events (WAL + views)
events ───┘ │
retention (seed only) └──> dataplane.spi ──> dataplane.inproc ──> storage.spi ──> storage.backends.local
audit (view) └──> filesystem
storage.spi
dataplane.spi + inproc
api.http (just /health)
cli (just `artifactstore version`, `artifactstore migrate`, `artifactstore replay`)
```
## D1.1 - Service Scaffold And Repository Identity
@@ -71,14 +73,71 @@ state_hub_task_id: "84209430-ec3b-4c5e-924e-019c25434230"
Acceptance:
- Replace the seed README with artifact-store service instructions.
- Add a Python service scaffold with a clear package/module layout.
- Provide a local development command.
- Provide a test command.
- Keep generated artifact bytes and local databases ignored by git.
- Document required environment variables.
- `pyproject.toml` with `hatchling` build backend, pinned dependencies
per ADR-0005.
- `uv.lock` committed.
- `Makefile` exposes: `make dev`, `make test`, `make lint`, `make
type`, `make migrate`. Each target is a thin shim, no logic inline.
- `src/artifactstore/` package skeleton matches ADR-0005's layout
(empty `__init__.py` and one placeholder module per top-level
concern: `identity`, `manifest`, `events`, `retention`, `audit`,
`storage`, `dataplane`, `registry`, `api/http`, `cli`, `config`).
- `tests/{unit,integration}/conftest.py` in place.
- `.env.example` documents required environment variables:
`ARTIFACTSTORE_DATABASE_URL`, `ARTIFACTSTORE_STORAGE_LOCAL_ROOT`,
`ARTIFACTSTORE_LOG_LEVEL`.
- CI-equivalent local commands: `make lint && make type && make test`
pass on a clean checkout.
- `README.md` replaces the seed README: install with `uv sync`, run
with `make dev`, test with `make test`, links to ADRs and blueprint.
## D1.2 - Registry Data Model
## D1.2 - Digest Abstraction And Content Address
```task
id: ARTIFACT-STORE-WP-0001-T009
status: todo
priority: high
```
Acceptance:
- `identity.Digest` value type with `algorithm: str` and `hex: str`,
immutable, hashable.
- `identity.ContentAddress` — string-form `<algorithm>:<hex>` with
validating parser and emitter.
- `identity.digest_stream(reader) -> {primary: Digest, sha256: Digest}` —
single-pass dual-hash over an `AsyncIterator[bytes]`. Default primary
algorithm: `blake3`.
- Algorithm registry with `blake3` and `sha256` registered at import.
- Property test: digest over random byte sequences round-trips through
serialisation; `sha256` matches `hashlib.sha256(...).hexdigest()`;
`blake3` matches `blake3.blake3(...).hexdigest()`.
## D1.3 - Manifest Codec (Canonical CBOR + JCS Projection)
```task
id: ARTIFACT-STORE-WP-0001-T010
status: todo
priority: high
```
Acceptance:
- `manifest.Manifest` dataclass with the v1 fields enumerated in the
blueprint (`manifest_version=1`, package, files, storage_receipts,
retention_summary, provenance).
- `manifest.codec.encode(m) -> bytes` produces canonical CBOR
(RFC 8949 §4.2.2): definite-length, shortest-form integers,
sorted map keys.
- `manifest.codec.decode(b) -> Manifest`.
- `manifest.projection.jcs(m) -> bytes` produces RFC 8785 canonical
JSON.
- Property test: `decode(encode(m)) == m` for randomly-generated
manifests; `encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
- Manifest digest helper: `manifest_digest(m) -> ContentAddress` using
BLAKE3 over the canonical CBOR bytes.
## D1.4 - Registry Data Model And Migrations
```task
id: ARTIFACT-STORE-WP-0001-T002
@@ -89,16 +148,44 @@ state_hub_task_id: "e5249a39-46a2-4b56-813e-0339c52cd14e"
Acceptance:
- Define persistent models for artifact packages, files, storage locations,
retention rules, retention events, and audit events.
- Store package metadata as structured JSON while keeping core query fields
explicit.
- Record package lifecycle status: created, uploading, finalized, deleted, and
failed.
- Record file `sha256`, size, media type, and logical relative path.
- Add migrations or a reproducible schema initialization path.
- Alembic configured with `migrations/` directory; `alembic upgrade
head` works against both SQLite (dev) and PostgreSQL (prod).
- `events`, `artifact_packages`, `artifact_files`, `storage_locations`,
`retention_classes`, `retention_state`, `metadata_schemas` tables
match the blueprint schema.
- Seed migration populates `retention_classes` with the five v1 entries.
- A `make migrate` and `make migrate-fresh` target work end-to-end on
a clean DB.
- All schema columns required by ADR-0001 (`digest_algorithm`,
`digest_primary`, `digest_sha256`, `content_address`), ADR-0002
(full `events` table), and the blueprint's `retrieval_tier` and
`restore_status` are present.
## D1.3 - Local Filesystem Storage Backend
## D1.5 - Event Log Persistence And Replay
```task
id: ARTIFACT-STORE-WP-0001-T011
status: todo
priority: high
```
Acceptance:
- `events.write(transaction, Event)` writes one row in the given DB
transaction. Sequence numbers are assigned by the DB
(`BIGSERIAL`) and are guaranteed monotonic and gapless within a
registry instance.
- `events.tail(since_sequence) -> AsyncIterator[Event]` long-polls
the table (notify-style on PostgreSQL via `LISTEN/NOTIFY`,
poll-style on SQLite).
- `events.replay(into=ViewWriter)` rebuilds all materialised view
tables from `events` deterministically.
- Test: ingesting a fixed sequence of events, then rebuilding the
views from scratch, yields byte-identical materialised state.
- Event payloads use canonical CBOR (`manifest.codec`) so the same
bytes flow through registry → DB → tail consumer without re-encoding.
## D1.6 - Storage Adapter SPI And Local Filesystem Backend
```task
id: ARTIFACT-STORE-WP-0001-T003
@@ -109,90 +196,81 @@ state_hub_task_id: "68f9a752-0012-4cc1-8768-ec3f75295e7a"
Acceptance:
- Implement a storage adapter interface.
- Implement a local filesystem backend for development and tests.
- Store objects under deterministic package/file keys.
- Prevent path traversal and accidental writes outside the configured storage
root.
- Add backend health reporting.
- Add tests for put, get, head, and delete operations.
- `storage.spi.StorageBackend` Protocol matches the blueprint.
- `storage.backends.local.LocalBackend` implements the SPI:
- Object key layout `<root>/<algo>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- Atomic write via `fsync(tmpfile) + rename`.
- Path traversal rejected at the SPI boundary.
- `health()` returns disk usage and root accessibility.
- Backend registry resolves by `backend_id` string (per ADR-0004).
- Unit tests cover: put, get, head, delete, double-put idempotency,
delete-of-missing, range read.
## D1.4 - Package Ingestion API
## D1.7 - Data Plane SPI And In-Process Implementation
```task
id: ARTIFACT-STORE-WP-0001-T004
id: ARTIFACT-STORE-WP-0001-T012
status: todo
priority: high
state_hub_task_id: "e3879111-4be9-4731-8aea-15abb874f960"
```
Acceptance:
- Add endpoints to create a package, upload files, finalize a package, retrieve
package metadata, list packages, and download files.
- Compute file hashes server-side during ingestion.
- Reject duplicate logical paths within one package unless explicitly replacing
a non-finalized file.
- Produce a package manifest after finalization.
- Add API tests covering successful ingestion and validation failures.
- `dataplane.spi.DataPlane` Protocol matches ADR-0004.
- `dataplane.inproc.InProcessDataPlane` implements all five operations
on top of a configured `StorageBackend`.
- `ingest_stream` computes both digests in a single pass, writes to
the backend keyed by the primary content address, and returns an
`IngestResult` containing both digests, size, and the
`StorageReceipt`.
- `serve_object` and `verify_object` re-read bytes through the
backend; `verify_object` re-digests and returns mismatches if any.
- Lint rule (or test): no code outside `dataplane.*` imports
`storage.backends.*` directly.
## D1.5 - Retention Baseline
## D1.8 - Registry Orchestrator (Library Surface)
```task
id: ARTIFACT-STORE-WP-0001-T005
id: ARTIFACT-STORE-WP-0001-T013
status: todo
priority: high
state_hub_task_id: "2d6cbd83-c348-45ad-a223-7870a3412225"
```
Acceptance:
- Seed default retention classes for transient, raw-evidence, summary-evidence,
release-evidence, and permanent-record.
- Apply a default `expires_at` when a package is created or finalized.
- Add endpoints to extend retention and apply or release holds.
- Record retention changes as retention events and audit events.
- Expose deletion eligibility without deleting bytes automatically in the first
implementation.
- `registry.Registry` exposes: `create_package`, `ingest_file`,
`finalize_package`, `get_manifest_bytes` (CBOR + JCS), `get_file`,
`tail_events`. Plus stubs for the retention operations that lighten
WP-0003.
- Each mutating operation is one DB transaction that writes events
AND updates materialised views.
- Finalisation writes one `v1.package.finalized` event whose payload
*is* the canonical CBOR manifest, and stamps `manifest_digest` on
`artifact_packages`.
- Duplicate `relative_path` within one not-yet-finalised package is
rejected unless an explicit replace is requested.
- Integration test: end-to-end ingest of a 3-file package against
local backend → finalize → read manifest → verify digests
→ tail events → replay rebuilds identical state.
## D1.6 - S3-Compatible Backend Design Hook
## D1.9 - Minimal HTTP App And CLI
```task
id: ARTIFACT-STORE-WP-0001-T006
id: ARTIFACT-STORE-WP-0001-T014
status: todo
priority: medium
state_hub_task_id: "7b980a55-2364-48c3-98ac-081629a8d2b7"
```
Acceptance:
- Define configuration fields for an S3-compatible backend.
- Keep the adapter contract compatible with Ceph RGW.
- Add an implementation stub or feature-flagged backend if dependencies are not
ready.
- Document expected Ceph/S3 configuration without requiring a live Ceph service
for baseline tests.
- `api.http.app` is a FastAPI app with one route: `GET /health`
reporting registry liveness, DB connectivity, and backend health.
- `cli` exposes `artifactstore version`, `artifactstore migrate`,
`artifactstore replay`, `artifactstore health`.
- `make dev` starts the API on `127.0.0.1:8000` with SQLite +
local FS backend by default.
## D1.7 - Guide-Board Pilot Ingestion
```task
id: ARTIFACT-STORE-WP-0001-T007
status: todo
priority: high
state_hub_task_id: "eb822821-353c-4cd2-95bf-acb2f084b7ea"
```
Acceptance:
- Provide a CLI helper or documented curl flow to register a guide-board run
directory as one package.
- Preserve guide-board run metadata: run id, target profile, assessment profile,
evidence result counts, finding counts, source commits, and report paths.
- Ingest the CMIS pilot run shape, including scorecard and log-review reports.
- Return a package id suitable for recording in Statehub.
- Add a fixture-based test that does not require the real OpenCMIS TCK.
## D1.8 - Operator Documentation And Handoff
## D1.10 - Operator Documentation And ADR Cross-Linking
```task
id: ARTIFACT-STORE-WP-0001-T008
@@ -203,27 +281,33 @@ state_hub_task_id: "9b60036c-61f2-4c22-ad31-7213473d42d0"
Acceptance:
- Document local run, test, and package ingestion commands.
- Document retention behavior and extension flow.
- Document the boundary between artifact-store and Statehub.
- Include a dev-agent handoff section listing the first implementation order.
- Keep architecture docs aligned with the implemented API.
- `README.md` updated with current run / test / migrate commands.
- `AGENTS.md` "Current Repo Shape" section reflects the scaffold.
- An `docs/OPERATOR.md` page documents environment variables, local
vs PostgreSQL setup, replay command, and a smoke-test recipe.
- Every ADR is cross-linked from at least one of: blueprint, this
workplan, or `OPERATOR.md`.
## Suggested Implementation Order
## Suggested implementation order
1. Service scaffold, test harness, and README.
2. Metadata models and local database setup.
3. Local filesystem storage adapter.
4. Package create/upload/finalize/download API.
5. Retention defaults, extension, hold, and audit events.
6. Guide-board run ingestion helper.
7. S3-compatible backend configuration and Ceph notes.
1. T001 — scaffold and tooling (no other task can start without this).
2. T009 — digest abstraction (unblocks T010, T012).
3. T010 — manifest codec (unblocks T013).
4. T002 — schema and migrations (unblocks T011, T013).
5. T011 — event log + replay.
6. T003 — storage SPI + local backend.
7. T012 — data plane SPI + in-process impl.
8. T013 — registry orchestrator.
9. T014 — minimal HTTP app and CLI.
10. T008 — docs.
## First Pilot Success Criteria
## Success criteria
- A completed guide-board CMIS run can be ingested from a local directory.
- The package manifest lists every stored file with SHA-256 and size.
- The registry returns a stable package id.
- Files can be downloaded through the service.
- Default retention is visible and can be extended.
- Statehub can record the package id and summary without storing artifact bytes.
- `make dev && make test` round-trips on a clean checkout.
- A scripted integration test ingests a directory of fixture files,
finalises the package, reads the manifest, downloads each file, and
verifies digests end-to-end against the local backend.
- Replaying events from sequence 1 reproduces the materialised view
state byte-for-byte.
- The library can be imported and exercised without an HTTP server
running (embedding test).

View File

@@ -0,0 +1,150 @@
---
id: ARTIFACT-STORE-WP-0002
type: workplan
title: "Ingestion API And Manifest Surface"
repo: artifact-store
domain: stack
status: planned
owner: codex
topic_slug: stack
planning_priority: high
planning_order: 2
created: "2026-05-15"
updated: "2026-05-15"
---
# ARTIFACT-STORE-WP-0002: Ingestion API And Manifest Surface
## Purpose
Expose the WP-0001 library as a complete HTTP API. Producers can create
packages, ingest files (single-shot or via the upload-session resource
shape), finalise to produce a manifest, list and search packages,
download files, and tail the event stream.
## Constraints
- ADR-0001, ADR-0002, ADR-0003, ADR-0004, ADR-0005, ADR-0006.
- `docs/ARCHITECTURE-BLUEPRINT.md` API shape section.
- All handlers must be thin: translate transport → `registry.*` calls.
## Prerequisites
- WP-0001 done (library is functional against local backend).
## D2.1 - Package CRUD Endpoints
```task
id: ARTIFACT-STORE-WP-0002-T001
status: todo
priority: high
state_hub_task_id: "e3879111-4be9-4731-8aea-15abb874f960"
```
Acceptance:
- `POST /packages`, `GET /packages` (filterable by producer / subject /
retention_class / metadata key), `GET /packages/{id}`,
`POST /packages/{id}/files` (single-shot multipart),
`POST /packages/{id}/finalize`.
- `GET /packages/{id}/manifest` (`Accept: application/cbor`) and
`GET /packages/{id}/manifest.json` (JCS projection).
- Validation errors return RFC 7807 problem documents.
- OpenAPI is generated automatically (FastAPI default) and served at
`/openapi.json` + `/docs`.
## D2.2 - File Download And Range Reads
```task
id: ARTIFACT-STORE-WP-0002-T002
status: todo
priority: high
```
Acceptance:
- `GET /files/{file_id}` returns metadata.
- `GET /files/{file_id}/download` streams bytes; supports `Range`
request headers (single contiguous range; multi-range is out of
scope for v1).
- ETag is the file's primary content address; `If-None-Match` returns
`304`.
- Streaming uses `AsyncIterator[bytes]` end-to-end; no full-file
buffering.
## D2.3 - Upload Session Resource (Wire Shape Pinned)
```task
id: ARTIFACT-STORE-WP-0002-T003
status: todo
priority: medium
```
Acceptance:
- `POST /uploads` opens a session, returns an upload id and content
upload URL.
- `PATCH /uploads/{upload_id}` accepts a body with `Content-Range`;
v1 implementation may accept the whole body in one call.
- `POST /uploads/{upload_id}/complete` promotes the upload into a
file under a given package id and relative path.
- Implementation is allowed to be single-shot internally; the wire
shape and resource lifecycle must be the final one (per
PLATFORM-AMBITION A6).
## D2.4 - Event Stream Long-Poll
```task
id: ARTIFACT-STORE-WP-0002-T004
status: todo
priority: medium
```
Acceptance:
- `GET /events?since=<sequence>&limit=N` returns events in order with
a long-poll wait when the tail is reached.
- Events are CBOR by default; `Accept: application/json` returns the
JCS projection of each event payload.
- Test: a consumer that tails from sequence 1 never misses an event
produced during the test.
## D2.5 - Auth Scaffolding (Shared-Secret Bearer)
```task
id: ARTIFACT-STORE-WP-0002-T005
status: todo
priority: medium
```
Acceptance:
- Bearer token auth on all mutating endpoints; configurable per-tenant
token list via env / config file.
- Read endpoints are also gated by default; an explicit
`ARTIFACTSTORE_ANON_READ=true` opt-in for dev.
- Health endpoint remains anonymous.
## D2.6 - Integration Tests Through The Full HTTP Surface
```task
id: ARTIFACT-STORE-WP-0002-T006
status: todo
priority: high
```
Acceptance:
- httpx-based test suite exercises every endpoint.
- A scripted test ingests a 50-file package, finalises it, downloads
every file, verifies digests, and tails events.
- A property-based test fuzzes the upload session lifecycle.
## Success criteria
- A producer can run the full ingest-and-retrieve flow against
`make dev` with curl.
- All blueprint endpoints in the v1 native surface are implemented.
- The CLI gains `artifactstore push <dir>` and
`artifactstore manifest <package_id>` subcommands as thin clients
over the HTTP API.

View File

@@ -0,0 +1,132 @@
---
id: ARTIFACT-STORE-WP-0003
type: workplan
title: "Retention Lifecycle: Defaults, Extensions, Holds, Deletion Eligibility"
repo: artifact-store
domain: stack
status: planned
owner: codex
topic_slug: stack
planning_priority: high
planning_order: 3
created: "2026-05-15"
updated: "2026-05-15"
---
# ARTIFACT-STORE-WP-0003: Retention Lifecycle
## Purpose
Implement the retention engine. By the end of this workplan, every
package has a computed `expires_at`, operators can extend retention or
apply / release holds, and the system can mark expired packages as
eligible for deletion — without actually deleting bytes (GC is
WP-0006).
## Constraints
- ADR-0002 (every retention change is an event).
- `docs/ARCHITECTURE-BLUEPRINT.md` retention sections.
## Prerequisites
- WP-0001 done (`retention_classes` seeded, `retention_state` view
exists).
- WP-0002 done (HTTP surface exists to attach the new endpoints to).
## D3.1 - Default Retention Application
```task
id: ARTIFACT-STORE-WP-0003-T001
status: todo
priority: high
state_hub_task_id: "2d6cbd83-c348-45ad-a223-7870a3412225"
```
Acceptance:
- On `POST /packages`, the requested `retention_class` is validated
and the `v1.retention.default_applied` event is written with the
computed `expires_at`.
- Default durations per class are operator-configurable via a
config file (TOML); the file path is documented in `OPERATOR.md`.
- `permanent-record` packages have `expires_at = NULL` and
`eligible_for_deletion = false`.
## D3.2 - Retention Extensions
```task
id: ARTIFACT-STORE-WP-0003-T002
status: todo
priority: high
```
Acceptance:
- `POST /packages/{id}/retention/extensions` accepts
`{new_expires_at, reason}`. The new value must be strictly later
than the current; reason is mandatory.
- Each extension writes a `v1.retention.extended` event;
`retention_state.current_expires_at` updates on the same
transaction.
- A package's full extension history is recoverable from `events`.
## D3.3 - Holds (Apply And Release)
```task
id: ARTIFACT-STORE-WP-0003-T003
status: todo
priority: high
```
Acceptance:
- `POST /packages/{id}/retention/holds` records a hold with a reason
and actor; emits `v1.retention.hold_applied`.
- A package with at least one active hold is never
`eligible_for_deletion` regardless of `expires_at`.
- `POST /packages/{id}/retention/holds/{hold_id}/release` requires a
reason; emits `v1.retention.hold_released`.
- Test: hold applied → expiry passes → eligibility stays `false`;
hold released → eligibility flips to `true`.
## D3.4 - Deletion Eligibility Sweeper
```task
id: ARTIFACT-STORE-WP-0003-T004
status: todo
priority: medium
```
Acceptance:
- A scheduled task (cron-style configurable interval; default 1 hour)
scans packages whose `expires_at` has passed and no active hold
exists, and emits `v1.retention.deletion_eligible` events.
- The sweeper is idempotent: events are emitted at most once per
package per eligibility transition.
- The sweeper is invokable as a CLI subcommand for tests:
`artifactstore retention sweep`.
## D3.5 - Audit Surface For Retention
```task
id: ARTIFACT-STORE-WP-0003-T005
status: todo
priority: medium
```
Acceptance:
- `GET /packages/{id}/retention/history` returns the ordered list of
retention events for a package.
- The default response is the JCS projection; CBOR is available via
`Accept: application/cbor`.
## Success criteria
- A guide-board run can be ingested, given `release-evidence`, later
extended once, held for a quarter, released, swept, and marked
eligible — all visible through both `retention_state` and the
event log.
- No bytes are deleted by this workplan; that is WP-0006.

View File

@@ -0,0 +1,131 @@
---
id: ARTIFACT-STORE-WP-0004
type: workplan
title: "S3-Compatible Backend (Ceph RGW Target)"
repo: artifact-store
domain: stack
status: planned
owner: codex
topic_slug: stack
planning_priority: medium
planning_order: 4
created: "2026-05-15"
updated: "2026-05-15"
---
# ARTIFACT-STORE-WP-0004: S3-Compatible Backend
## Purpose
Add a second concrete storage backend that speaks the S3 protocol.
Validated targets: Ceph RGW (primary self-hosted production target),
MinIO (dev / CI), AWS S3 (interop check). The backend must satisfy
the storage SPI without any leaks of S3-specific concepts into the
registry.
## Constraints
- `storage.spi.StorageBackend` Protocol from WP-0001 is the contract.
- No S3 vocabulary leaks into `registry.*` or `api.*`.
- `docs/ARCHITECTURE-BLUEPRINT.md` storage-backend section.
## Prerequisites
- WP-0001 done (SPI exists, local backend exists as a reference).
## D4.1 - Configuration Surface
```task
id: ARTIFACT-STORE-WP-0004-T001
status: todo
priority: high
state_hub_task_id: "7b980a55-2364-48c3-98ac-081629a8d2b7"
```
Acceptance:
- `s3` backend configuration accepts: `endpoint_url`, `region`,
`bucket`, `key_prefix`, `access_key_ref`, `secret_key_ref`,
`storage_class`, `sse` (optional), `multipart_threshold_bytes`,
`multipart_chunk_bytes`.
- Credential references resolve from env vars or mounted files; never
from request bodies.
- Documented Ceph RGW configuration example checked in under
`docs/OPERATOR.md`.
## D4.2 - S3 Backend Implementation
```task
id: ARTIFACT-STORE-WP-0004-T002
status: todo
priority: high
```
Acceptance:
- `storage.backends.s3.S3Backend` implements the SPI using `aioboto3`
or `aiobotocore` (decision recorded in the workplan; whichever is
better-maintained at implementation time).
- Object key layout
`<key_prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
- `put` uses multipart for objects above the configured threshold.
- `get` supports `Range`.
- `head`, `delete`, `health` implemented.
- `delete` is idempotent (delete-of-missing returns success).
## D4.3 - Backend Selection And Routing
```task
id: ARTIFACT-STORE-WP-0004-T003
status: todo
priority: medium
```
Acceptance:
- A registry can have multiple backends configured; package creation
records which backend a file is stored in.
- Per-package backend selection rule: configurable function of
`retention_class` + producer; default routes everything to a single
backend.
- `storage_locations.backend_id` reflects the actual storage.
## D4.4 - Test Strategy: MinIO In CI, RGW As Documented Manual Smoke
```task
id: ARTIFACT-STORE-WP-0004-T004
status: todo
priority: high
```
Acceptance:
- Integration tests run against MinIO via `testcontainers-python`
(or a docker-compose fixture if testcontainers fights the WSL2
environment).
- A documented manual procedure tests against a real Ceph RGW
endpoint; results recorded in `docs/OPERATOR.md`.
- No CI dependency on a live Ceph or AWS account.
## D4.5 - Verification Pass
```task
id: ARTIFACT-STORE-WP-0004-T005
status: todo
priority: medium
```
Acceptance:
- `artifactstore storage verify --backend s3` re-reads every object in
the backend, recomputes its primary digest, and emits
`v1.storage.location_verified` events.
- Mismatches are reported as `failed` locations and surfaced via the
health endpoint.
## Success criteria
- The same package ingestion flow that worked against `local` in
WP-0001 works unchanged against `s3`.
- Switching backend by config — without code changes in the registry
or API layers — is the smoke test.

View File

@@ -0,0 +1,146 @@
---
id: ARTIFACT-STORE-WP-0005
type: workplan
title: "Guide-Board Pilot Ingestion"
repo: artifact-store
domain: stack
status: planned
owner: codex
topic_slug: stack
planning_priority: high
planning_order: 5
created: "2026-05-15"
updated: "2026-05-15"
---
# ARTIFACT-STORE-WP-0005: Guide-Board Pilot Ingestion
## Purpose
Wire the first real producer end-to-end. A guide-board CMIS
assessment run directory is registered as one artifact package, its
files are stored through a configured backend, retention is applied,
and Statehub records a stable package id and summary without storing
bytes itself. This is the pilot success criterion in INTENT.md.
## Constraints
- WP-0001 — WP-0004 must be done.
- `docs/ARCHITECTURE-BLUEPRINT.md` guide-board manifest fields.
- No guide-board-specific code lives in `artifactstore.registry`;
pilot-specific glue lives in `artifactstore.pilots.guide_board` or
in a separate small package.
## Prerequisites
- WP-0001, WP-0002, WP-0003 done. WP-0004 only required for the
production target; local FS is sufficient for the pilot test.
## D5.1 - Pilot Metadata Schema Registration
```task
id: ARTIFACT-STORE-WP-0005-T001
status: todo
priority: high
state_hub_task_id: "eb822821-353c-4cd2-95bf-acb2f084b7ea"
```
Acceptance:
- A JSON Schema for `guide-board.run.v1` package metadata is checked
in under `schemas/guide-board.run.v1.json`.
- A bootstrap script registers it via `POST /metadata-schemas`
(an endpoint added in this workplan).
- Required keys: `run_id`, `target_profile_ref`,
`assessment_profile_ref`, `result_status`, `source_commits`
(object of slug → SHA), `report_paths`, `evidence_counts`,
`finding_counts`.
## D5.2 - Pilot Ingest Helper (CLI + Library Function)
```task
id: ARTIFACT-STORE-WP-0005-T002
status: todo
priority: high
```
Acceptance:
- `artifactstore guide-board ingest <run-dir>` walks a guide-board
run directory, builds the package metadata from `run.json` and
`retention-summary.json`, uploads every file declared in the
assessment package manifest (and the manifest itself), and
finalises the package.
- Library entry point `pilots.guide_board.ingest_run(path, ...)`
exposes the same behaviour for embedding.
- Output: the package id (UUID) and the package manifest digest
(`blake3:<hex>`).
## D5.3 - Fixture-Based Test
```task
id: ARTIFACT-STORE-WP-0005-T003
status: todo
priority: high
```
Acceptance:
- A trimmed-down guide-board run fixture (under 1 MB total) lives in
`tests/fixtures/guide-board/` with realistic file shapes:
`run.json`, `retention-summary.json`,
`reports/assessment-package.json`, `reports/report.md`, one
scorecard, one log-review summary, and a couple of raw artifact
files.
- The test runs the CLI / library helper end-to-end against an
in-memory SQLite + tempdir local backend, then verifies:
1. package id returned,
2. manifest digest stable across two runs of the same fixture,
3. every file downloadable with correct bytes,
4. retention class applied as configured.
## D5.4 - Statehub Linkage Recipe
```task
id: ARTIFACT-STORE-WP-0005-T004
status: todo
priority: medium
```
Acceptance:
- `docs/OPERATOR.md` (or a new `docs/pilots/guide-board.md`)
documents the exact `POST /progress/` or `record_decision` call
shape Statehub clients should use to link a guide-board run to
its artifact-store package id and manifest digest.
- A reference Statehub client snippet is checked in, parameterised
by env vars.
## D5.5 - Operator Smoke Procedure For The Real Producer
```task
id: ARTIFACT-STORE-WP-0005-T005
status: todo
priority: medium
```
Acceptance:
- A documented procedure ingests a real (non-fixture) guide-board run
produced from `~/guide-board` / `~/open-cmis-tck`.
- Procedure includes: starting `make dev`, registering the schema,
running the ingest CLI, verifying the manifest, and
recording the package id in Statehub.
- Procedure runs end-to-end on a developer workstation under 5
minutes.
## Success criteria
- A real guide-board CMIS run is ingested with one CLI invocation.
- The package manifest lists every stored file with both digests and
the canonical CBOR digest of the manifest itself.
- Statehub records the package id and summary; no artifact bytes
live in Statehub.
- Retention can be extended on the package without touching bytes.
- The pilot path validates the storage adapter swap: the same
command works against `local` and against `s3` (if WP-0004 done).