From 747afc27a6cd17292292fef07514bd8f3b188863 Mon Sep 17 00:00:00 2001 From: tegwick Date: Fri, 15 May 2026 21:16:17 +0200 Subject: [PATCH] docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 --- AGENTS.md | 18 +- README.md | 58 +- docs/ARCHITECTURE-BLUEPRINT.md | 564 ++++++++++-------- docs/ROADMAP.md | 93 +++ docs/adr/0001-content-addressed-storage.md | 80 +++ docs/adr/0002-event-log-source-of-truth.md | 76 +++ docs/adr/0003-manifest-canonical-cbor.md | 78 +++ .../0004-control-plane-data-plane-contract.md | 79 +++ docs/adr/0005-v1-tech-stack.md | 117 ++++ docs/adr/0006-oci-compatibility-reachable.md | 69 +++ docs/adr/README.md | 32 + ...ARTIFACT-STORE-WP-0001-service-baseline.md | 342 +++++++---- .../ARTIFACT-STORE-WP-0002-ingestion-api.md | 150 +++++ ...IFACT-STORE-WP-0003-retention-lifecycle.md | 132 ++++ ...ACT-STORE-WP-0004-s3-compatible-backend.md | 131 ++++ ...RTIFACT-STORE-WP-0005-guide-board-pilot.md | 146 +++++ 16 files changed, 1761 insertions(+), 404 deletions(-) create mode 100644 docs/ROADMAP.md create mode 100644 docs/adr/0001-content-addressed-storage.md create mode 100644 docs/adr/0002-event-log-source-of-truth.md create mode 100644 docs/adr/0003-manifest-canonical-cbor.md create mode 100644 docs/adr/0004-control-plane-data-plane-contract.md create mode 100644 docs/adr/0005-v1-tech-stack.md create mode 100644 docs/adr/0006-oci-compatibility-reachable.md create mode 100644 docs/adr/README.md create mode 100644 workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md create mode 100644 workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md create mode 100644 workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md create mode 100644 workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md diff --git a/AGENTS.md b/AGENTS.md index d11aa0e..77a277d 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -162,13 +162,21 @@ To create a new workplan: ## Current Repo Shape -This repository is in service-baseline planning. The current source of truth is: +This repository is in service-baseline planning. The current sources of truth are: -- `INTENT.md` for purpose, product thesis, scope, and service boundary -- `docs/ARCHITECTURE-BLUEPRINT.md` for the draft architecture -- `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` for implementation tasks +- `INTENT.md` — purpose, product thesis, scope, service boundary. +- `SCOPE.md` — lightweight orientation. +- `docs/ARCHITECTURE-BLUEPRINT.md` — architecture v2: modules, data model, API shape. +- `docs/PLATFORM-AMBITION.md` — longer-horizon thesis and the v1 schema commitments (A1–A9). +- `docs/adr/` — architecture decision records ADR-0001 … ADR-0006 (content-addressed storage, event log as source of truth, canonical CBOR manifests, control/data plane contract, v1 tech stack, OCI reachability). +- `docs/ROADMAP.md` — workplan sequencing across phases. +- `docs/ASSEMBLY-EXPERIMENT.md` — opt-in research line on hand-tuned asm for hot kernels. +- `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` — Foundation workplan; first to start. +- `workplans/ARTIFACT-STORE-WP-{0002..0005}-*.md` — planned next workplans. -No runnable service scaffold exists yet. Add install, dev-server, and test +No runnable service scaffold exists yet. The pinned tech stack is in ADR-0005 +(Python 3.12, uv, FastAPI, SQLAlchemy Core + asyncpg/aiosqlite, Alembic, +cbor2, blake3, ruff, mypy, pytest, typer). Add install, dev-server, and test commands here when `ARTIFACT-STORE-WP-0001-T001` lands. ## Repo Boundary diff --git a/README.md b/README.md index dd304bd..969ab16 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,51 @@ # artifact-store -Generic artifact registry and storage gateway for generated outputs, evidence -packages, reports, logs, and release artifacts. +Generic artifact registry and storage gateway for generated outputs, +evidence packages, reports, logs, snapshots, exports, and release +artifacts. -The registry owns artifact identity, metadata, provenance, retention policy, and -retrieval records. Actual bytes are delegated to configured storage backends such -as a local filesystem, S3-compatible object storage, or Ceph RGW. +The registry owns artifact identity, metadata, provenance, retention +policy, and retrieval records. Bytes are delegated to configured +storage backends (local filesystem in v1, S3-compatible / Ceph RGW +next). -Start here: +The shape is library-first (`artifactstore` Python package); the HTTP +server and the CLI are thin consumers. Content is addressed by digest; +state is authoritative in an append-only event log; materialised views +are rebuildable. -- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary -- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — draft architecture -- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon thesis and the schema commitments v1 preserves -- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md) — SWOT and optimisation review -- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in research line on hand-tuned assembly for hot kernels -- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) — first implementation workplan +## Status + +Concept / service-baseline planning. No runnable scaffold yet — +`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` is the next step. + +## Start here + +- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary. +- [SCOPE.md](SCOPE.md) — lightweight orientation. +- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — the + v2 architecture: modules, data model, API shape. +- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon + thesis, ffmpeg / VLC reference points, the schema commitments v1 + preserves. +- [docs/ROADMAP.md](docs/ROADMAP.md) — workplan sequencing across + phases. +- [docs/adr/](docs/adr/) — architecture decision records (ADR-0001 … + ADR-0006). +- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in + research line on hand-tuned assembly for hot kernels. +- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md) + — the SWOT review that triggered this cleanup. + +## Active workplans + +- [WP-0001 — Foundation: scaffold, core kernels, local FS backend](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) +- [WP-0002 — Ingestion API and manifest surface](workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md) (planned) +- [WP-0003 — Retention lifecycle](workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md) (planned) +- [WP-0004 — S3-compatible backend](workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md) (planned) +- [WP-0005 — Guide-board pilot ingestion](workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md) (planned) + +## Agent operating notes + +See [AGENTS.md](AGENTS.md) for the StateHub-integrated session +protocol, workplan conventions, and progress-logging contract. diff --git a/docs/ARCHITECTURE-BLUEPRINT.md b/docs/ARCHITECTURE-BLUEPRINT.md index d374d88..c51ad54 100644 --- a/docs/ARCHITECTURE-BLUEPRINT.md +++ b/docs/ARCHITECTURE-BLUEPRINT.md @@ -1,330 +1,378 @@ -# Artifact Store Architecture Blueprint +# Architecture Blueprint -Status: draft -Created: 2026-05-15 +Status: accepted (v2 — supersedes 2026-05-15 draft) +Updated: 2026-05-15 -## Purpose +This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md` +thesis, and the decisions recorded in `docs/adr/`. Where a tension exists +between this blueprint and an ADR, the ADR wins; raise an issue or +supersede the ADR. -`artifact-store` provides a generic registry and storage gateway for durable -generated artifacts. Producers register packages and files with metadata; -storage adapters persist the bytes; retention policy decides how long artifacts -remain eligible for retrieval. +## Architecture in one paragraph -The design keeps artifact identity and lifecycle separate from storage -implementation. This allows the first version to run against local filesystem -storage while the production path can use S3-compatible object storage such as -Ceph RGW. +`artifact-store` is a **library-first** artifact registry and storage +gateway. A small core library (`artifactstore`) implements identity, +manifests, retention, the storage adapter SPI, the data plane SPI, and +the registry orchestrator. The HTTP server and the CLI are thin +consumers of that library. Bytes are addressed by content +(`blake3:`) and stored through a pluggable adapter SPI. State is +authoritative in an append-only event log; queryable tables are +materialised views. -## Architecture Summary +## Design lineage + +The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight +core of well-named modules with stable contracts, runtime-pluggable +backends, a thin orchestration binary, and an explicit hot-path +boundary that can be rewritten in faster code without changing the +consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table. + +## Top-level shape ```text -producer - -> Artifact Registry API - -> metadata database - -> retention policy engine - -> audit event log - -> storage adapter interface - -> local filesystem backend - -> S3-compatible backend - -> Ceph RGW deployment - -> future cloud/blob/archive backends + producers / operators / agents + | + v + +------------------------+ + | HTTP API | CLI | <-- thin consumers + +------------------------+ + | + v + +------------------------+ + | registry orchestrator | + +------------------------+ + | | | + v v v + +----------+ +---------+ +---------+ + | identity | | events | |retention| + |/manifest | | (log + | | policy | + | | | views) | | engine | + +----------+ +---------+ +---------+ + | + v + +-----------------------+ + | data plane SPI | <-- ADR-0004 contract + +-----------------------+ + | + v + +-----------------------+ + | storage adapter SPI | + +-----------------------+ + | | | + v v v + +-----+ +------+ +-------+ + |local| | S3 | | Ceph | ... future backends + | FS | | RGW | | RGW | + +-----+ +------+ +-------+ ``` -The registry is the authority for artifact metadata and lifecycle. Backends are -responsible for byte storage and retrieval. +## Core modules -## Design Principles +Mapped one-to-one to ADR-0005's project layout. Each module has a +stable public surface; internals are free to evolve. -- Backend-neutral registry: no producer should know whether bytes live in Ceph, - local disk, or a cloud bucket. -- Content-addressable confidence: every stored file has a digest and size. -- Retention by default: every package receives an expiry decision at ingestion. -- Extensions are explicit: retention extensions and holds are audit events, not - silent metadata edits. -- Packages remain portable: a manifest should be enough to understand a package - without calling the producer. -- Statehub links, it does not store bytes: Statehub records artifact IDs and - outcomes; artifact-store owns file persistence. -- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion - jobs must be auditable and reversible only when the backend still has data. +### `identity` -## Components +- `Digest(algorithm, hex)` — value object. +- `ContentAddress` — `:` (ADR-0001). +- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest. +- Algorithm registry: `blake3` (default primary), `sha256` (always + computed). -### Registry API +### `manifest` -HTTP API for producers and operators. +- `Manifest` — versioned dataclass: package metadata + ordered file list + + retention summary + provenance + storage receipts. +- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR + (ADR-0003). +- `manifest.codec.decode(bytes) -> Manifest`. +- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON + projection for display and signing-tool interop. +- Round-trip invariant: `decode(encode(m)) == m` and + `encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`. -Initial responsibilities: +### `events` -- create artifact packages, -- upload or ingest files, -- finalize packages, -- retrieve package metadata, -- list/search packages by subject and producer metadata, -- create retention extensions and holds, -- expose download metadata or redirect/download endpoints, -- expose health and backend status. +- `events.write(transaction, event)` — appends one row with monotonic + sequence (ADR-0002). +- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll. +- `events.replay(into=ViewWriter)` — rebuild materialised views. +- Event types (v1): + `v1.package.created`, `v1.file.ingested`, `v1.package.finalized`, + `v1.retention.default_applied`, `v1.retention.extended`, + `v1.retention.hold_applied`, `v1.retention.hold_released`, + `v1.retention.deletion_eligible`, `v1.storage.location_recorded`, + `v1.storage.location_verified`, `v1.audit.access`, + `v1.system.note`. -### Metadata Store +### `retention` -Persistent database for registry state. +- `retention.classes` — `transient`, `raw-evidence`, `summary-evidence`, + `release-evidence`, `permanent-record`. Defined as data, not code. +- `retention.policy.apply(package, class) -> RetentionDecision` — + computes `expires_at` and the deletion eligibility rule. +- `retention.extend(package, until, reason, actor)` — emits an event; + the materialised view updates on commit. +- `retention.hold(package, reason, actor)` / + `retention.release_hold(hold_id, actor)`. -Initial implementation can use SQLite for local development and PostgreSQL for -shared service deployments if that matches the surrounding service stack. +### `audit` -Core tables: +- A view over `events` filtered to access and lifecycle events. No + separate write path; auditing happens by event emission elsewhere. -- `artifact_packages` -- `artifact_files` -- `storage_locations` -- `retention_rules` -- `retention_events` -- `audit_events` +### `storage` (adapter SPI) -### Storage Adapter Interface +```python +class StorageBackend(Protocol): + backend_id: str + async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ... + async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ... + async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ... + async def delete(self, content_address: ContentAddress) -> DeletionResult: ... + async def health(self) -> BackendStatus: ... +``` -Small backend contract used by the API service. +- Backend registry: backends register at import time; selection is + per-package by configuration. +- v1 ships `local` (filesystem); `s3` ships in WP-0004. -Required operations: +### `dataplane` (SPI per ADR-0004) -- `put(object_key, stream, metadata) -> storage_location` -- `get(object_key) -> stream or signed_url` -- `head(object_key) -> object_metadata` -- `delete(object_key) -> deletion_result` -- `health() -> backend_status` +```python +class DataPlane(Protocol): + async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ... + async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ... + async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ... + async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ... + async def backend_health(self) -> BackendStatus: ... +``` -Initial backends: +- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`, + computes digests during streaming. +- Future implementation: `dataplane.remote` — gRPC or + framed-bincode-over-Unix-socket client to a Rust daemon. -- local filesystem backend for tests and development, -- S3-compatible backend for Ceph RGW and cloud object stores. +### `registry` -### Retention Policy Engine +The orchestrator. Combines `identity + manifest + events + retention + +dataplane` into the operations the HTTP API and CLI consume: +`create_package`, `ingest_file`, `finalize_package`, `get_manifest`, +`download_file`, `extend_retention`, `apply_hold`, `release_hold`, +`mark_deletion_eligible`, `tail_events`. Each operation is one DB +transaction that writes one or more events and updates materialised +views. -Applies default rules at ingestion and records later changes. +### `api.http` and `cli` -Initial retention classes: +Thin. Their job is to translate transport (HTTP / argv) into calls on +`registry`. No business logic. -- `transient`: short-lived scratch artifacts, -- `raw-evidence`: raw logs and run output, -- `summary-evidence`: compact reports and summaries, -- `release-evidence`: release or customer-facing evidence packages, -- `permanent-record`: manually held records with no automatic expiry. +## Data model -Each package stores: +All tables exist as **materialised views over `events`** (ADR-0002), +except `events` itself, `retention_classes` (seed data), and +`metadata_schemas` (config). -- selected retention class, -- default retention rule, -- computed `expires_at`, -- extension records, -- hold records, -- deletion eligibility state. +### `events` (source of truth) -### Audit Log +| Column | Type | Notes | +|---|---|---| +| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless | +| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default | +| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) | +| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` | +| `subject_id` | `UUID NULL` | | +| `actor` | `TEXT NOT NULL` | producer or operator identity | +| `payload` | `BYTEA NOT NULL` | canonical CBOR | +| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` | -Append-only record of important events: +Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`. -- package created, -- file uploaded, -- package finalized, -- retrieval requested, -- retention extended, -- hold applied or released, -- deletion requested, -- deletion completed or failed. +### `artifact_packages` (materialised view) -The audit log does not need to be cryptographic in the first release, but the -schema should leave room for signed events or external write-once storage later. +| Column | Type | Notes | +|---|---|---| +| `id` | `UUID PRIMARY KEY` | | +| `name` | `TEXT NOT NULL` | | +| `producer` | `TEXT NOT NULL` | | +| `subject` | `TEXT NOT NULL` | | +| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` | +| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` | +| `metadata` | `JSONB NOT NULL` | validated against schema if present | +| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` | +| `manifest_digest` | `BYTEA NULL` | populated on finalize | +| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | | +| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping | -## Data Model +### `artifact_files` (materialised view) -### Artifact Package +| Column | Type | Notes | +|---|---|---| +| `id` | `UUID PRIMARY KEY` | | +| `package_id` | `UUID NOT NULL` | FK | +| `relative_path` | `TEXT NOT NULL` | logical path; unique within package | +| `media_type` | `TEXT NOT NULL` | required (ADR-0006) | +| `size_bytes` | `BIGINT NOT NULL` | | +| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) | +| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest | +| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop | +| `created_at` | `TIMESTAMPTZ NOT NULL` | | -Required fields: +### `storage_locations` (materialised view) -- `id` -- `name` -- `producer` -- `subject` -- `retention_class` -- `status` -- `created_at` -- `finalized_at` -- `expires_at` -- `metadata` +| Column | Type | Notes | +|---|---|---| +| `id` | `UUID PRIMARY KEY` | | +| `artifact_file_id` | `UUID NOT NULL` | FK | +| `backend_id` | `TEXT NOT NULL` | | +| `content_address` | `TEXT NOT NULL` | `:` | +| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` | +| `storage_class` | `TEXT NULL` | backend-specific label | +| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` | +| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` | +| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` | +| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | | -Recommended metadata keys: +### `retention_state` (materialised view) -- `repo_slug` -- `run_id` -- `assessment_id` -- `target_profile_ref` -- `assessment_profile_ref` -- `source_commits` -- `tool_versions` -- `environment` +| Column | Type | Notes | +|---|---|---| +| `package_id` | `UUID PRIMARY KEY` | | +| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) | +| `effective_class` | `TEXT NOT NULL` | | +| `active_hold_id` | `UUID NULL` | | +| `eligible_for_deletion` | `BOOLEAN NOT NULL` | | -### Artifact File +### `retention_classes` (seed data, not derived) -Required fields: +| Column | Type | Notes | +|---|---|---| +| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` | +| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` | +| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) | -- `id` -- `package_id` -- `relative_path` -- `media_type` -- `size_bytes` -- `sha256` -- `created_at` +### `metadata_schemas` (config table) -### Storage Location +| Column | Type | Notes | +|---|---|---| +| `id` | `UUID PRIMARY KEY` | | +| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` | +| `json_schema` | `JSONB NOT NULL` | | +| `created_at` | `TIMESTAMPTZ NOT NULL` | | -Required fields: +## API shape -- `id` -- `artifact_file_id` -- `backend_id` -- `object_key` -- `storage_class` -- `status` -- `created_at` -- `last_verified_at` - -### Retention Event - -Required fields: - -- `id` -- `package_id` -- `event_type` -- `reason` -- `created_by` -- `created_at` -- `previous_expires_at` -- `new_expires_at` - -Event types: - -- `default_rule_applied` -- `extended` -- `hold_applied` -- `hold_released` -- `deletion_eligible` -- `deleted` - -## API Shape - -Initial endpoints: +### Native v1 surface ```text -GET /health -GET /backends -POST /packages -GET /packages -GET /packages/{package_id} -POST /packages/{package_id}/files -POST /packages/{package_id}/finalize -GET /packages/{package_id}/manifest -GET /files/{file_id}/download -POST /packages/{package_id}/retention/extensions -POST /packages/{package_id}/retention/holds -POST /packages/{package_id}/retention/holds/{hold_id}/release +GET /health +GET /backends +GET /retention-classes + +POST /packages # create +GET /packages # list, query by metadata +GET /packages/{package_id} # metadata +POST /packages/{package_id}/files # single-shot file upload +POST /packages/{package_id}/finalize # produce manifest +GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor) +GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json) + +GET /files/{file_id} # metadata +GET /files/{file_id}/download # bytes + +POST /uploads # open an upload session (resource shape pinned now) +PATCH /uploads/{upload_id} # range body +POST /uploads/{upload_id}/complete # promote to /packages/.../files + +POST /packages/{package_id}/retention/extensions +POST /packages/{package_id}/retention/holds +POST /packages/{package_id}/retention/holds/{hold_id}/release + +GET /events?since={sequence} # long-poll registry change feed ``` -The first ingestion path can accept multipart file uploads. A later trusted-local -operator endpoint may ingest from server-local paths, but it should be disabled -by default because path ingestion changes the security boundary. +The `POST /uploads/...` resource shape is committed now even if v1 +implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6. -## Package Manifest +### Deferred / not v1 -Every finalized package should expose a JSON manifest containing: +- `/v2/…` OCI Distribution endpoints (ADR-0006). +- gRPC API. +- Streaming CDC topic (NATS / Kafka). +- Multi-tenant namespacing in URLs. -- package metadata, -- retention summary, -- file list, -- file digests and sizes, -- storage backend references, -- source metadata, -- created/finalized timestamps. +## Package manifest content (v1) -For guide-board runs, the manifest should preserve links to: +A finalised manifest carries: -- `run.json` -- `retention-summary.json` -- `reports/assessment-package.json` -- `reports/report.md` -- extension-generated scorecards or log reviews, -- raw artifact files captured by the assessment package manifest. +- `manifest_version: 1` +- `package`: id, name, producer, subject, retention class, created_at, + finalized_at, expires_at, metadata, metadata_schema_id (nullable). +- `files`: ordered list of `{id, relative_path, media_type, size_bytes, + digest_algorithm, digest_primary_hex, digest_sha256_hex}`. +- `storage_receipts`: ordered list of `{file_id, backend_id, + content_address, retrieval_tier, status}` per stored copy. +- `retention_summary`: current class, expires_at, holds, last + retention event. +- `provenance`: `{source_commits, tool_versions, environment, + ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a + registered schema or empty if none. -## Guide-Board Pilot Flow +The manifest digest (`blake3:`) is the package's canonical +external identifier. -```text -guide-board run directory - -> open-cmis-tck scorecard/log review - -> artifact-store package create - -> upload run files - -> finalize manifest - -> Statehub record links package id and summary -``` +## Storage backends -The artifact package should carry: +### Local filesystem (v1) -- run id, -- target profile reference, -- assessment profile reference, -- result status, -- source commits for guide-board, open-cmis-tck, and the assessed repository, -- important report paths, -- retention class `raw-evidence` or `release-evidence`. +- Root: configured directory. +- Object key layout: `////`. +- Atomic write via `fsync(tmpfile) + rename`. No partial states visible. +- Path traversal prevented at the SPI boundary; the local backend + rejects any key that does not match the expected layout. -## Ceph And S3-Compatible Storage +### S3-compatible / Ceph RGW (WP-0004) -Ceph should be introduced through the S3-compatible adapter, not as a special -case in producer logic. +- Endpoint, bucket, region, access key ref, secret key ref, key + prefix, storage class label, optional SSE config. +- Object key: `////`. +- Multipart upload for objects above a configurable threshold. -Configuration should support: +## Security boundary (v1) -- endpoint URL, -- bucket, -- region, -- access key reference, -- secret key reference, -- optional server-side encryption settings, -- object key prefix, -- storage class label. +- Internal service. No anonymous public access. +- Authenticated producer / operator API. v1 ships shared-secret bearer + tokens; OIDC integration is its own workplan. +- No secret values in artifact metadata. +- Upload paths are logical; never trusted filesystem paths. The + `/uploads/...` path-ingest endpoint is *not* offered in v1. +- Download authorisation is checked at the registry layer, never at + the backend. -The service should never require credentials in producer request bodies. Use -environment variables, mounted secret files, or a local secret provider. +## Resolved open questions -## Future Retrieval Tiers +- **Deduplication scope.** Global by content address (ADR-0001). + Reference-counted deletion via a GC pass (WP-0006, TBD). +- **Deletion ordering.** Mark records `deletion_eligible` first via an + event. Byte deletion is a separate, audited operation that emits a + second event. Reverse order is forbidden. +- **Metadata schemas.** Open JSON with optional producer-registered + JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`). +- **Statehub integration scope.** Statehub keeps package IDs and + summary; never bytes. The `/events` long-poll is the integration + point. -The initial API can treat all stored files as immediately retrievable. Later, -storage locations can include: +## Outstanding open questions (not blocking v1) -- `retrieval_tier`: hot, warm, cold, archive, -- `restore_status`: available, restore_requested, restoring, restored, expired, -- `restore_requested_at`, -- `restore_expires_at`. +- Identity provider for shared deployments. +- Default retention durations per class (operator-configurable; needs + one round of stakeholder input). +- WASM plugin host design (deferred to its own workplan; see + `PLATFORM-AMBITION`). +- Federation / mirroring protocol (post-OCI-endpoint workplan). -The registry API should be able to return "not immediately available" without -changing artifact identity. +## Roadmap pointer -## Security Boundary - -Initial service assumptions: - -- internal service, not public internet exposed, -- authenticated producer/operator API before shared deployment, -- no secret values stored in artifact metadata, -- package paths are logical paths, not trusted filesystem paths, -- download authorization should be checked at the registry layer. - -Files may contain sensitive evidence. The service must treat metadata and bytes -as confidential by default. - -## Open Questions - -- Which identity provider should guard shared deployments? -- Should package metadata schemas be open-ended JSON or typed by producer? -- Should deduplication be package-local only or global by content hash? -- Should deletion first mark records deleted, then delete bytes, or reverse that - order with compensating events? -- How much Statehub integration belongs in this repo versus in Statehub clients? +The implementation sequence is in `docs/ROADMAP.md`. The first +workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`. diff --git a/docs/ROADMAP.md b/docs/ROADMAP.md new file mode 100644 index 0000000..6d8e292 --- /dev/null +++ b/docs/ROADMAP.md @@ -0,0 +1,93 @@ +# Roadmap + +Status: living document +Updated: 2026-05-15 + +The roadmap sequences `artifact-store` from "no code" to a credible +production v1 to the longer-horizon platform shape recorded in +`docs/PLATFORM-AMBITION.md`. Each row is a self-contained workplan with +its own acceptance criteria; nothing here is a binding milestone. + +The sequencing principle is **library-first** (ffmpeg-shaped): +foundational kernels and contracts before any consumer code. The HTTP +server and CLI exist only after the core library can be exercised +end-to-end against a local filesystem backend. + +## Phase 0 — Cleanup (done 2026-05-15) + +- ADR-0001 through ADR-0006 accepted. +- Architecture blueprint rewritten to v2. +- Platform ambition and assembly experiment documented. +- Workplans re-sequenced. + +## Phase 1 — Foundation and pilot (v0.1) + +Goal: ingest a real guide-board run end-to-end, against a local +filesystem backend, with retention applied and events logged. + +| ID | Title | Carries existing task IDs | Notes | +|---|---|---|---| +| WP-0001 | Foundation: scaffold, core kernels, local FS backend | T001, T002, T003, T008 | All of the library-shaped modules; no HTTP API yet beyond `/health`. | +| WP-0002 | Ingestion API + manifest surface | T004 | The HTTP API. Builds on WP-0001's library. | +| WP-0003 | Retention lifecycle | T005 | Retention engine, extensions, holds, deletion eligibility. | +| WP-0004 | S3-compatible backend (Ceph RGW target) | T006 | Second concrete adapter. | +| WP-0005 | Guide-board pilot ingestion | T007 | First real producer wired up. | + +Exit criteria for v0.1: WP-0001 through WP-0005 done; a guide-board +CMIS run round-trips through artifact-store with manifest, retention, +and Statehub linkage; backend swappable between local FS and an +S3-compatible store. + +## Phase 2 — Production hardening (v0.2 – v0.3) + +| ID | Title | Notes | +|---|---|---| +| WP-0006 | Garbage collection + reference counting | Required by ADR-0001 global dedup. Mark-eligible already lands in WP-0003; this workplan does the byte-deletion pass. | +| WP-0007 | Resumable / chunked upload implementation | The wire shape lands in WP-0002; this workplan makes the implementation actually streaming. | +| WP-0008 | Auth, multi-tenancy, quota | OIDC integration; tenant namespacing; per-tenant rate limit and storage quota. | +| WP-0009 | Observability: metrics, tracing, structured logs | OpenTelemetry SDK; latency / throughput SLOs published. | +| WP-0010 | Event stream out (CDC) | NATS or Kafka topic of registry events; long-poll `/events` becomes a fallback. | +| WP-0011 | Signed manifests | Sigstore / cosign integration; signature recorded alongside manifest digest. | + +Exit criteria for v0.3: a deployment is operatable by humans without +internal knowledge; SLOs are measurable; access is authenticated; +artifacts can be signed and verified. + +## Phase 3 — Platform features (v0.4 – v1.0) + +| ID | Title | Notes | +|---|---|---| +| WP-0012 | OCI artifact `/v2/` endpoint | Implements OCI Distribution Spec on top of the same storage (ADR-0006). | +| WP-0013 | Content-defined chunking + global dedup at chunk level | FastCDC; chunked storage. Builds toward `docs/ASSEMBLY-EXPERIMENT.md`. | +| WP-0014 | Rust data plane extraction | Move `dataplane.inproc` to `dataplane.remote` (ADR-0004). | +| WP-0015 | WASM plugin host | Extension surface for indexers, redactors, scorecard generators. | +| WP-0016 | Cold-tier adapters | Glacier / Tape / IA classes; restore flow. | +| WP-0017 | Federation and replication | Signed manifest exchange between artifact-store instances. | + +Exit criteria for v1.0: artifact-store is embeddable as a library, runs +as a single-binary CLI, runs as a server, speaks OCI, federates between +instances, and is fast enough to be a credible commercial substrate. + +## What this roadmap deliberately does NOT promise + +- Specific calendar dates. Cadence is set by sessions, not quarters. +- A UI. UIs are out-of-tree (see `docs/PLATFORM-AMBITION.md`). +- ML-specific or container-specific features. Use OCI compatibility. +- A storage backend for every cloud. Adapters are community surface. + +## How to add a workplan + +1. Pick the next free `ARTIFACT-STORE-WP-NNNN` number. +2. Create `workplans/ARTIFACT-STORE-WP-NNNN-.md` with the + frontmatter and task block format in `AGENTS.md`. +3. Cite the ADRs the workplan depends on in its `## Constraints` + section. +4. Append a row to the appropriate phase table in this file. +5. Notify the custodian operator to run + `make fix-consistency REPO=artifact-store`. + +## How to retire a workplan + +1. Set `status: done` in the frontmatter when all tasks are `done`. +2. Move the file to `workplans/archived/YYMMDD-ARTIFACT-STORE-WP-NNNN-.md`. +3. Update this roadmap to reflect the new state. diff --git a/docs/adr/0001-content-addressed-storage.md b/docs/adr/0001-content-addressed-storage.md new file mode 100644 index 0000000..062b3da --- /dev/null +++ b/docs/adr/0001-content-addressed-storage.md @@ -0,0 +1,80 @@ +# ADR-0001 — Content-Addressed Storage with Dual Digest + +Status: accepted +Date: 2026-05-15 +Supersedes: — +Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9 + +## Context + +The architecture blueprint as originally drafted addresses stored bytes by +logical `(package, relative_path)`. That is sufficient for v1 ingestion but +forecloses global deduplication, Merkle integrity proofs, partial +replication, federation, and OCI artifact compatibility — all of which the +platform ambition requires to remain reachable. + +Independently, the original blueprint pins SHA-256 as the only file digest. +SHA-256 with SHA-NI on modern x86 reaches ~1.5–2 GB/s/core. BLAKE3 on the +same hardware reaches 6–10+ GB/s/core, parallelises across cores, and its +construction *is* a Merkle tree — package-level integrity becomes free. +SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we +cannot drop it. + +## Decision + +1. The canonical storage key for any byte sequence is its content address + in the form `:`. Storage backends store + and retrieve by this key. `relative_path` is logical metadata recorded + in the manifest, not a storage-layer concept. +2. Every `artifact_files` row carries two digest columns: + - `digest_primary` — the native digest; default algorithm `blake3`. + - `digest_sha256` — always populated for interop, even when `blake3` + is the primary. + Both are computed in a single ingest pass (one read of the input). +3. The schema also carries a `digest_algorithm` column naming the primary + algorithm. Additional algorithms are added by new columns or a side + table, never by overloading `digest_primary`. +4. Storage backend object keys are derived from `digest_primary` only. + Migrations between primary algorithms are explicit and audited; they + are not silent. + +## Consequences + +Positive: + +- Global deduplication is automatic — two identical files in two packages + share one backend object. +- Merkle integrity over a package is free with BLAKE3 (use the tree mode). +- Federation, partial mirrors, and OCI compatibility (ADR-0006) become + reachable without schema migration. +- Verification of a single file does not require fetching its package. + +Negative: + +- Two digests must be computed per ingest. Mitigated by streaming both + through one buffer; the bottleneck is I/O, not hashing. +- Reference counting: deletion of an `artifact_file` row cannot + unconditionally delete the backend object. A garbage-collector pass + reconciles references before deleting bytes. This is correct anyway + (deletion should be deliberate, per the blueprint). +- Producers requesting "store these N bytes at path P" must understand + that their P is logical. This is a documentation problem, not a + technical one. + +## Implementation notes + +- v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated; + no asm we maintain). +- v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython + build links against OpenSSL with SHA-NI support). +- A `Digest` value object wraps `(algorithm, hex)`; serialised forms + always include the algorithm prefix. +- A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not + delete bytes automatically — it marks them eligible. + +## Status of the original blueprint pin + +The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by +`digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup +blueprint's implicit path-keyed storage is replaced by content-keyed +storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`. diff --git a/docs/adr/0002-event-log-source-of-truth.md b/docs/adr/0002-event-log-source-of-truth.md new file mode 100644 index 0000000..9f8ae24 --- /dev/null +++ b/docs/adr/0002-event-log-source-of-truth.md @@ -0,0 +1,76 @@ +# ADR-0002 — Append-Only Event Log as Source of Truth + +Status: accepted +Date: 2026-05-15 +Related: `docs/PLATFORM-AMBITION.md` commitment A3 + +## Context + +The original blueprint defines `audit_events` and `retention_events` as +separate tables. Both are useful, but neither is a complete authoritative +record of how registry state was produced. Several downstream needs share +one underlying primitive: + +- audit (who did what when, with what result), +- change-data-capture feed for downstream consumers (Statehub, search), +- replication and federation between instances, +- point-in-time replay and disaster recovery, +- materialised view rebuilds when schemas evolve. + +Each can be served by an append-only log of registry events with a +monotonic sequence number. Two separate tables cannot. + +## Decision + +1. The registry persists an append-only `events` table. Every state- + changing operation writes one row in the same database transaction as + the operation. Once written, rows are immutable. +2. Each row has a strictly monotonic, gapless sequence number scoped to + the registry instance, and a UTC ingest timestamp. +3. The current `artifact_packages`, `artifact_files`, `storage_locations`, + and `retention_state` tables are materialised views over `events`. + They are rebuildable by replay. +4. Event payloads are stored as canonical CBOR (ADR-0003), keyed by + `event_type` (string slug). The `event_type` namespace is versioned + (`v1.package.created`, `v1.file.ingested`, `v1.retention.extended`, + etc.). +5. `audit_events` and `retention_events` cease to exist as standalone + tables; their semantics are subsets of `events` filtered by + `event_type`. + +## Consequences + +Positive: + +- One primitive serves audit, CDC, replication, replay, and rebuild. +- A consumer can tail by `sequence > N` and never miss an event. +- Forward-compatibility: new view columns can be derived from existing + events by adding a replay path; no migration required. +- Signed event chains are reachable later by adding a signature column. + +Negative: + +- Replays cost wall-clock time on large datasets. Snapshots of + materialised views (with the highest applied sequence stamped on them) + are used to bound replay cost. +- Schema migrations on materialised views still happen; they just no + longer touch the source of truth. +- Discipline required: any write that bypasses the event log is a bug. + Enforced by code review and a runtime invariant check on the + materialised tables. + +## Implementation notes + +- `events` schema (v1): + - `sequence BIGSERIAL PRIMARY KEY` + - `created_at TIMESTAMPTZ NOT NULL DEFAULT now()` + - `event_type TEXT NOT NULL` + - `subject_kind TEXT NOT NULL` — `package` | `file` | `retention` | `storage` | `system` + - `subject_id UUID` — nullable for system-level events + - `actor TEXT NOT NULL` — producer or operator identity + - `payload BYTEA NOT NULL` — canonical CBOR + - `payload_digest BYTEA NOT NULL` — BLAKE3 of `payload` +- Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`. +- Replay tool ships in v1 as a CLI subcommand (`artifactstore replay`). +- Outbound CDC stream (NATS / Kafka) is its own workplan; v1 only exposes + long-poll over `GET /events?since=`. diff --git a/docs/adr/0003-manifest-canonical-cbor.md b/docs/adr/0003-manifest-canonical-cbor.md new file mode 100644 index 0000000..32e4869 --- /dev/null +++ b/docs/adr/0003-manifest-canonical-cbor.md @@ -0,0 +1,78 @@ +# ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2) + +Status: accepted +Date: 2026-05-15 +Related: ADR-0001, ADR-0002, ADR-0006, `docs/PLATFORM-AMBITION.md` commitment A4 + +## Context + +Manifests describe a package's identity, contents, retention, and +provenance. They are the durable, portable, signable summary of a package. +Three downstream features depend on byte-identical manifest serialisation: + +1. Manifest digest (used as the package's content address — ADR-0001). +2. Signatures (cosign, Sigstore, in-toto, SLSA). +3. Cross-language / cross-version reproducibility (any client must be + able to verify a manifest produced by any other client). + +JSON does not guarantee byte-identical output without an explicit +canonicalisation profile. The candidates are: + +- **JCS** (JSON Canonicalization Scheme, RFC 8785) — JSON-shaped, widely + available, text-format, signs cleanly. +- **Canonical CBOR** (RFC 8949 §4.2.2) — binary, smaller, lower overhead + to canonicalise, native in cosign / Sigstore tooling, used by COSE. +- **DAG-CBOR** (IPLD profile) — canonical CBOR plus content-addressing + conventions; useful if we later integrate with IPLD/IPFS, but pulls in + ecosystem assumptions we don't yet need. + +Canonical CBOR wins on size, parser surface, and direct compatibility +with the tooling we will adopt for signing (ADR commitments A4, A9). JCS +is a reasonable alternative; we keep an emit-JCS path for human-readable +display but the signed form is CBOR. + +## Decision + +1. Manifests are serialised as **canonical CBOR** per RFC 8949 §4.2.2: + - definite-length encoding throughout, + - shortest-form integer encoding, + - map keys sorted bytewise lexicographically, + - no floating-point unless explicitly required (we do not require it), + - no semantic tags except those we explicitly enumerate. +2. The manifest's content address is `blake3:` of its canonical + CBOR bytes. This is the package's primary identifier in storage. +3. A canonical JSON projection (JCS) of the same manifest is available + for display, signing-tool interop, and human inspection. The + projection is deterministic: round-tripping through it must yield + byte-identical CBOR. +4. The manifest schema is itself versioned (`manifest_version: 1`). + Unknown fields are preserved on read and re-emitted on write (forward + compatibility); breaking schema changes bump the version. + +## Consequences + +Positive: + +- Manifests are signable today by any tool that consumes CBOR (cosign, + ssh-keygen `-Y sign`, COSE libraries). +- The manifest digest is stable across languages, OS, and compiler. +- Smaller on disk and on the wire than JSON. +- Replay (ADR-0002) is unambiguous because event payloads are also CBOR. + +Negative: + +- Less human-readable in raw form; the CLI must offer a `pretty` projection. +- One more dependency (a CBOR library). We pin one in ADR-0005. +- Future schema evolution requires the same canonicalisation discipline. + Enforced by a property-based test: any manifest must round-trip + CBOR → JCS → CBOR with byte equality. + +## Implementation notes + +- v1 library: `cbor2` (PyPI; pure-Python with optional C extension). + Wrapped behind `artifactstore.manifest.codec` so swapping to a faster + impl is transparent. +- JCS projection: `jcs` (PyPI) or hand-rolled — decision deferred to + WP-0001-T003. +- A `Manifest` value class enforces field order on emit, not just on + encode. This catches non-canonical producers at the API boundary. diff --git a/docs/adr/0004-control-plane-data-plane-contract.md b/docs/adr/0004-control-plane-data-plane-contract.md new file mode 100644 index 0000000..07a620d --- /dev/null +++ b/docs/adr/0004-control-plane-data-plane-contract.md @@ -0,0 +1,79 @@ +# ADR-0004 — Control Plane / Data Plane Contract + +Status: accepted +Date: 2026-05-15 +Related: ADR-0005, `docs/PLATFORM-AMBITION.md` commitment A5, +`docs/ASSEMBLY-EXPERIMENT.md` + +## Context + +The platform ambition expects a Rust (eventually asm-tuned) data plane +to handle hot ingest paths — hashing, chunking, optional compression and +encryption, storage backend I/O. The v1 service is written entirely in +Python (ADR-0005). The cost of conflating control and data planes at the +code level is that extracting the data plane later requires API churn, +test rework, and producer migrations. + +The cost of separating them now is one named module boundary and one +in-process protocol shape. That cost is essentially free if taken +before any consumer exists. + +## Decision + +1. The Python package is organised so that *every byte-handling + operation* lives behind a named contract: + - `artifactstore.dataplane.spi` — the abstract surface (typed + dataclasses, async iterator protocols). + - `artifactstore.dataplane.inproc` — the v1 implementation, running + in the same process as the control plane. +2. The control plane (`artifactstore.registry`, `artifactstore.api.http`, + `artifactstore.retention`, `artifactstore.audit`) interacts with + bytes *only* through the SPI. No HTTP handler, no DB writer, no + retention rule ever reads or writes file bytes directly. +3. The SPI exposes exactly these operations: + - `ingest_stream(stream, hints) -> IngestResult` — consumes an + upload, returns content addresses, sizes, and storage receipts. + - `serve_object(content_address, range?) -> AsyncIterator[bytes]` — + produces bytes for a download. + - `verify_object(content_address) -> VerifyResult` — re-reads bytes, + re-digests, returns mismatches. + - `delete_object(content_address) -> DeletionResult` — best-effort, + idempotent. + - `backend_health() -> BackendStatus` — readiness, latency, free + capacity. +4. The SPI surface is the contract a future Rust daemon must satisfy. + When that daemon ships, `artifactstore.dataplane.inproc` is replaced + by `artifactstore.dataplane.remote` (a thin gRPC or + framed-bincode-over-Unix-socket client). The control plane sees no + change. +5. SPI parameter and return types are CBOR-serialisable today, even when + nothing serialises them. This lets us toggle to RPC without rewriting + types. + +## Consequences + +Positive: + +- The data plane can be rewritten in Rust later with zero API churn. +- Tests can fake the SPI cheaply; integration tests pin the contract. +- The CLI in `artifactstore.cli` is a second consumer of the SPI on + equal footing with the HTTP server. +- Operators with strong embedding requirements can use the in-process + data plane forever; nothing forces the RPC hop. + +Negative: + +- One extra abstraction layer in v1. Mitigated by the contract being + narrow (five operations). +- Discipline required: PRs that bypass the SPI are rejected. A linter + rule (forbidden import: `artifactstore.api.* -> filesystem`) makes + this mechanical. + +## Implementation notes + +- The SPI is a `Protocol` (typing.Protocol) in `dataplane/spi.py` so the + in-process and future remote impls don't share an inheritance tree. +- Streaming returns `AsyncIterator[bytes]` so neither full-file buffering + nor `sendfile()` zero-copy is foreclosed. +- The `IngestResult` payload is the canonical CBOR-able value used in + events (ADR-0002). The same byte sequence flows API → SPI → event. diff --git a/docs/adr/0005-v1-tech-stack.md b/docs/adr/0005-v1-tech-stack.md new file mode 100644 index 0000000..3489474 --- /dev/null +++ b/docs/adr/0005-v1-tech-stack.md @@ -0,0 +1,117 @@ +# ADR-0005 — V1 Technology Stack + +Status: accepted +Date: 2026-05-15 +Related: ADR-0001, ADR-0002, ADR-0003, ADR-0004 + +## Context + +WP-0001 ("Foundation") cannot start without a pinned stack. The decision +needs to balance: + +- ffmpeg / VLC philosophy: minimal dependency budget, sharp boundaries, + native code at the hot edges, plain tools. +- Python is already implied by `.gitignore` and ecosystem fit (StateHub, + guide-board, open-cmis-tck are all Python-leaning). +- The data plane will eventually be Rust (ADR-0004); the control plane + stays in Python and must stay approachable. + +## Decision + +| Concern | Choice | Rationale | +|---|---|---| +| Language (control plane) | **Python 3.12+** | Async ecosystem, type hints, matches sibling repos. 3.12 specifically: PEP 695 generics, faster CPython, `sys.monitoring`. | +| Package / project manager | **uv** | Single static binary, fast resolver, lockfile-first, replaces `pip + pip-tools + venv + pipx` in one tool. | +| Build backend | **hatchling** (via `pyproject.toml`) | Standards-track PEP 517 backend. No magic. | +| HTTP framework | **FastAPI** (Starlette + Pydantic v2) | OpenAPI generation, async-native, broad community. | +| ASGI server | **uvicorn** (dev), **gunicorn + uvicorn workers** (prod) | Plain, well-understood. | +| Database (prod) | **PostgreSQL 16+** | Source-of-truth event log (ADR-0002) wants `BIGSERIAL`, `BYTEA`, advisory locks, logical replication. | +| Database (dev/embedded) | **SQLite (WAL mode)** | Zero-dependency local. Schema is portable when we use SQLAlchemy Core. | +| DB access | **SQLAlchemy 2.0 Core** + **asyncpg** (prod) / **aiosqlite** (dev) | Core, not ORM — explicit SQL, async drivers. Migrations live below the API surface. | +| Migrations | **Alembic** | Standard, integrates with SQLAlchemy Core, supports both pg and sqlite. | +| Hashing | stdlib **`hashlib`** for SHA-256, **`blake3`** PyPI wheel for BLAKE3 | `blake3` wheel embeds the SIMD-tuned Rust impl with no build-time toolchain. | +| Serialisation | **`cbor2`** for canonical CBOR (ADR-0003); stdlib `json` for JCS or `jcs` PyPI | Smallest deps that satisfy ADR-0003. | +| CLI | **typer** (atop click) | Sits on FastAPI's Pydantic types cleanly; type-driven CLI surface. | +| Tests | **pytest** + **httpx** + **trio-asyncio**-free `pytest-asyncio` | Standard. | +| Lint / format | **ruff** (lint + format) | One tool replaces black + isort + flake8 + pyupgrade. | +| Type checker | **mypy** in `--strict` | Pyright is acceptable for editor support; CI gate is mypy. | +| Logging | stdlib `logging` + `structlog` for structured output | No exotic deps. | +| Metrics / tracing | OpenTelemetry SDK (deferred to its own workplan) | Listed for forward-compatibility; not a v1 dep. | + +### Project layout + +``` +artifact-store/ +├── pyproject.toml +├── uv.lock +├── Makefile # thin shim: make dev / test / lint / type / migrate +├── alembic.ini +├── src/ +│ └── artifactstore/ +│ ├── __init__.py +│ ├── identity/ # content address, digest abstraction (ADR-0001) +│ ├── manifest/ # canonical CBOR, JCS projection (ADR-0003) +│ ├── events/ # append-only log + replayer (ADR-0002) +│ ├── retention/ # policy engine +│ ├── audit/ # audit emission as event subset +│ ├── storage/ # adapter SPI + backend registry +│ │ ├── spi.py +│ │ └── backends/ +│ │ ├── local.py # filesystem backend +│ │ └── s3.py # placeholder, WP-0004 +│ ├── dataplane/ # SPI + in-process impl (ADR-0004) +│ │ ├── spi.py +│ │ └── inproc.py +│ ├── registry/ # high-level orchestrator +│ ├── api/ +│ │ └── http/ # FastAPI app +│ ├── cli/ # typer CLI (thin) +│ └── config.py +├── tests/ +│ ├── unit/ +│ ├── integration/ +│ └── conftest.py +├── migrations/ # alembic +└── docs/ +``` + +### Commands (T001 acceptance) + +``` +make dev # uvicorn with reload, sqlite backend, local FS storage +make test # pytest -q +make lint # ruff check + ruff format --check +make type # mypy --strict src tests +make migrate # alembic upgrade head +artifactstore # CLI entry point installed by uv +``` + +## Consequences + +Positive: + +- Dependency budget is small and each dep is best-in-class for its slot. +- The same toolchain works on Linux, macOS, and CI without special cases. +- `uv.lock` is checked in; builds are reproducible. +- Every layer maps one-to-one to a docs concept (identity, manifest, + events, dataplane, etc.), so the codebase remains navigable. + +Negative: + +- Pydantic v2 is the heaviest non-DB dep; acceptable for the OpenAPI win. +- Choosing SQLAlchemy Core over ORM costs some convenience; we accept + it because explicit SQL is easier to migrate to Rust later (ADR-0004). +- mypy `--strict` is a per-PR tax; bounded by keeping the codebase small. + +## Revision policy + +This ADR is the most likely candidate for revision once we have profile +data from real ingestion. Candidates we are already watching: + +- Replace `cbor2` with a Rust-backed CBOR codec if profile shows it on + the hot path. +- Replace `uvicorn` with `granian` (Rust ASGI server) if perf demands. +- Replace `SQLAlchemy Core` with raw `asyncpg` + a tiny query builder + if Core's abstractions show up in flame graphs. + +Each replacement is its own ADR. None of them are v1 work. diff --git a/docs/adr/0006-oci-compatibility-reachable.md b/docs/adr/0006-oci-compatibility-reachable.md new file mode 100644 index 0000000..ceb8305 --- /dev/null +++ b/docs/adr/0006-oci-compatibility-reachable.md @@ -0,0 +1,69 @@ +# ADR-0006 — OCI Artifact Compatibility Kept Reachable + +Status: accepted +Date: 2026-05-15 +Related: ADR-0001, ADR-0003, `docs/PLATFORM-AMBITION.md` commitment A9 + +## Context + +The OCI Distribution Specification and the OCI Artifact Manifest define +a widely-deployed wire format for content-addressed artifact exchange. +The ecosystem includes `oras`, `cosign`, `crane`, Helm, ChartMuseum, +ML-model packaging tools, and most container registries. Compatibility +with this ecosystem is the single highest-leverage opportunity in +`docs/PLATFORM-AMBITION.md`. + +We do not implement OCI compatibility in v1. We do refuse to take any +v1 decision that prevents it. + +## Decision + +1. The internal data model is structurally compatible with an OCI + artifact manifest. Concretely: + - Storage addresses content as `:` + (ADR-0001). OCI requires exactly this shape. + - Manifests have a `config` blob plus an ordered list of `layers`, + each with `mediaType`, `digest`, `size`, and optional + `annotations`. Our `Manifest` value class includes all of these + fields, even when v1 has no use for `mediaType` or `annotations`. + - Manifest serialisation produces byte-identical output across + callers (ADR-0003). OCI requires this for the manifest digest. +2. The native API may be richer than OCI, but v1 reviews every schema + change against the OCI spec and rejects changes that would block + later OCI compatibility. +3. A future `/v2/` namespace will speak the OCI Distribution Spec on + top of the same storage. This is its own workplan; it does not + modify v1 endpoints, only add new ones. + +## Consequences + +Positive: + +- `oras push`, `cosign sign`, `crane copy`, Helm `chart pull` become + reachable additions, not rewrites. +- Customers who already speak OCI can adopt incrementally. +- The `mediaType` discipline forces v1 producers to label their files, + which improves the manifest's value as a portable record. + +Negative: + +- v1 carries some otherwise-unnecessary manifest fields. Acceptable; + the cost is bytes, not complexity. +- The OCI manifest model uses SHA-256 as the canonical digest in + practice. ADR-0001's `digest_sha256` column satisfies this; the + native primary digest can still be BLAKE3. + +## What this ADR does NOT commit to + +- It does not commit to implementing OCI Distribution in v1. +- It does not commit to OCI as the *only* wire format. The native API + remains the richer interface. +- It does not commit to specific OCI media types for evidence packages. + Media-type assignment is the subject of a later workplan. + +## Review trigger + +Every schema-affecting workplan (anything that touches the data model +or the manifest shape) must include an explicit one-paragraph review +against this ADR. Reject changes that introduce OCI-incompatible +invariants without superseding this ADR. diff --git a/docs/adr/README.md b/docs/adr/README.md new file mode 100644 index 0000000..e6552f1 --- /dev/null +++ b/docs/adr/README.md @@ -0,0 +1,32 @@ +# Architecture Decision Records + +This directory holds the architectural decisions that govern `artifact-store`. +Each ADR is a small Markdown file with a status (`proposed`, `accepted`, +`superseded`, `deprecated`), a concise statement of the decision, the +forces that pushed it, and the consequences. + +ADRs are the canonical home for "we are doing X" statements that survive +multiple workplans. `INTENT.md` says what we build; `SCOPE.md` says where +the boundary is; `docs/PLATFORM-AMBITION.md` says where we are pointed; +ADRs say how — and they are the only document that records a *changeable* +decision in a form that can be superseded cleanly. + +Workplans cite the ADRs they depend on. The architecture blueprint cites +the ADRs it operationalises. + +## Index + +- [ADR-0001 — Content-Addressed Storage with Dual Digest](0001-content-addressed-storage.md) — accepted +- [ADR-0002 — Append-Only Event Log as Source of Truth](0002-event-log-source-of-truth.md) — accepted +- [ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)](0003-manifest-canonical-cbor.md) — accepted +- [ADR-0004 — Control Plane / Data Plane Contract](0004-control-plane-data-plane-contract.md) — accepted +- [ADR-0005 — V1 Technology Stack](0005-v1-tech-stack.md) — accepted +- [ADR-0006 — OCI Artifact Compatibility Kept Reachable](0006-oci-compatibility-reachable.md) — accepted + +## Conventions + +- Filenames: `NNNN-kebab-case-slug.md`, numbered in acceptance order. +- Status transitions: `proposed → accepted → (superseded | deprecated)`. +- Supersession is explicit: the new ADR links the old; the old ADR links + forward and changes status. Never delete an ADR. +- Each ADR is short. If it is long, it is wrong: split it. diff --git a/workplans/ARTIFACT-STORE-WP-0001-service-baseline.md b/workplans/ARTIFACT-STORE-WP-0001-service-baseline.md index 51116f2..8541ae1 100644 --- a/workplans/ARTIFACT-STORE-WP-0001-service-baseline.md +++ b/workplans/ARTIFACT-STORE-WP-0001-service-baseline.md @@ -1,7 +1,7 @@ --- id: ARTIFACT-STORE-WP-0001 type: workplan -title: "Artifact Store Service Baseline" +title: "Foundation: Scaffold, Core Kernels, Local FS Backend" repo: artifact-store domain: stack status: active @@ -14,51 +14,53 @@ updated: "2026-05-15" state_hub_workstream_id: "aebf996c-8721-4e8c-9e56-61d5e4bf8dcb" --- -# ARTIFACT-STORE-WP-0001: Artifact Store Service Baseline +# ARTIFACT-STORE-WP-0001: Foundation — Scaffold, Core Kernels, Local FS Backend ## Purpose -Implement the first usable artifact registry and storage gateway. The service -should preserve artifact packages, index their metadata, delegate bytes to a -configured storage backend, apply default retention rules, and expose stable -package identifiers that Statehub and producer repositories can link to. +Stand up the smallest credible `artifact-store` core. By the end of +this workplan, the library can ingest a directory of files into a +package, compute dual digests, write canonical-CBOR manifests, persist +state through the append-only event log, store bytes on local +filesystem, and replay materialised views from the event log. No HTTP +API yet (that lands in WP-0002); a `/health` endpoint exists so that +the dev loop has something to hit. -The first producer target is a guide-board assessment run, including OpenCMIS TCK -reports and raw assessment artifacts. +The shape is **library-first** (ffmpeg-style). HTTP server and CLI are +explicitly thin consumers of `artifactstore.registry`. -## Background +## Constraints (must satisfy) -Guide-board can already produce self-contained run directories with retention -summaries, assessment packages, raw artifacts, scorecards, and log reviews. Those -directories should not live only in `/tmp`, and committing raw evidence into -producer repositories is the wrong long-term shape. - -`artifact-store` becomes the shared preservation layer: - -- producers generate files, -- artifact-store registers and stores them, -- Statehub records the work outcome and links to the registry package, -- storage backends handle durable bytes. - -Ceph is the likely self-hosted production backend through its S3-compatible RGW -interface, but the service must keep the backend interface generic. - -## Target Architecture - -```text -producer package - -> registry API - -> metadata database - -> retention policy engine - -> storage adapter - -> local filesystem or S3-compatible object storage -``` +- ADR-0001 — content-addressed storage with dual digest. +- ADR-0002 — append-only event log as source of truth. +- ADR-0003 — manifest canonicalisation = canonical CBOR. +- ADR-0004 — control plane / data plane SPI named. +- ADR-0005 — v1 technology stack pinned (Python 3.12, uv, FastAPI, + SQLAlchemy Core, asyncpg, alembic, cbor2, blake3, ruff, mypy, pytest). +- ADR-0006 — OCI compatibility kept reachable. +- `docs/ARCHITECTURE-BLUEPRINT.md` data model and module layout. ## Boundary -This workplan owns the first service implementation and API contract. It does -not need to build a UI, implement cold-storage restore tiers, replace Statehub, -or provide formal records-management certification. +This workplan builds the library and a minimal `/health` endpoint. It +does NOT implement: package CRUD HTTP API (WP-0002), retention rules +beyond the seed (WP-0003), S3-compatible backend (WP-0004), guide-board +producer wiring (WP-0005), GC of unreferenced bytes (WP-0006). + +## Target architecture (this workplan) + +```text +artifactstore (library) + identity ──┐ + manifest ──┼──> registry (orchestrator) ──> events (WAL + views) + events ───┘ │ + retention (seed only) └──> dataplane.spi ──> dataplane.inproc ──> storage.spi ──> storage.backends.local + audit (view) └──> filesystem + storage.spi + dataplane.spi + inproc +api.http (just /health) +cli (just `artifactstore version`, `artifactstore migrate`, `artifactstore replay`) +``` ## D1.1 - Service Scaffold And Repository Identity @@ -71,14 +73,71 @@ state_hub_task_id: "84209430-ec3b-4c5e-924e-019c25434230" Acceptance: -- Replace the seed README with artifact-store service instructions. -- Add a Python service scaffold with a clear package/module layout. -- Provide a local development command. -- Provide a test command. -- Keep generated artifact bytes and local databases ignored by git. -- Document required environment variables. +- `pyproject.toml` with `hatchling` build backend, pinned dependencies + per ADR-0005. +- `uv.lock` committed. +- `Makefile` exposes: `make dev`, `make test`, `make lint`, `make + type`, `make migrate`. Each target is a thin shim, no logic inline. +- `src/artifactstore/` package skeleton matches ADR-0005's layout + (empty `__init__.py` and one placeholder module per top-level + concern: `identity`, `manifest`, `events`, `retention`, `audit`, + `storage`, `dataplane`, `registry`, `api/http`, `cli`, `config`). +- `tests/{unit,integration}/conftest.py` in place. +- `.env.example` documents required environment variables: + `ARTIFACTSTORE_DATABASE_URL`, `ARTIFACTSTORE_STORAGE_LOCAL_ROOT`, + `ARTIFACTSTORE_LOG_LEVEL`. +- CI-equivalent local commands: `make lint && make type && make test` + pass on a clean checkout. +- `README.md` replaces the seed README: install with `uv sync`, run + with `make dev`, test with `make test`, links to ADRs and blueprint. -## D1.2 - Registry Data Model +## D1.2 - Digest Abstraction And Content Address + +```task +id: ARTIFACT-STORE-WP-0001-T009 +status: todo +priority: high +``` + +Acceptance: + +- `identity.Digest` value type with `algorithm: str` and `hex: str`, + immutable, hashable. +- `identity.ContentAddress` — string-form `:` with + validating parser and emitter. +- `identity.digest_stream(reader) -> {primary: Digest, sha256: Digest}` — + single-pass dual-hash over an `AsyncIterator[bytes]`. Default primary + algorithm: `blake3`. +- Algorithm registry with `blake3` and `sha256` registered at import. +- Property test: digest over random byte sequences round-trips through + serialisation; `sha256` matches `hashlib.sha256(...).hexdigest()`; + `blake3` matches `blake3.blake3(...).hexdigest()`. + +## D1.3 - Manifest Codec (Canonical CBOR + JCS Projection) + +```task +id: ARTIFACT-STORE-WP-0001-T010 +status: todo +priority: high +``` + +Acceptance: + +- `manifest.Manifest` dataclass with the v1 fields enumerated in the + blueprint (`manifest_version=1`, package, files, storage_receipts, + retention_summary, provenance). +- `manifest.codec.encode(m) -> bytes` produces canonical CBOR + (RFC 8949 §4.2.2): definite-length, shortest-form integers, + sorted map keys. +- `manifest.codec.decode(b) -> Manifest`. +- `manifest.projection.jcs(m) -> bytes` produces RFC 8785 canonical + JSON. +- Property test: `decode(encode(m)) == m` for randomly-generated + manifests; `encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`. +- Manifest digest helper: `manifest_digest(m) -> ContentAddress` using + BLAKE3 over the canonical CBOR bytes. + +## D1.4 - Registry Data Model And Migrations ```task id: ARTIFACT-STORE-WP-0001-T002 @@ -89,16 +148,44 @@ state_hub_task_id: "e5249a39-46a2-4b56-813e-0339c52cd14e" Acceptance: -- Define persistent models for artifact packages, files, storage locations, - retention rules, retention events, and audit events. -- Store package metadata as structured JSON while keeping core query fields - explicit. -- Record package lifecycle status: created, uploading, finalized, deleted, and - failed. -- Record file `sha256`, size, media type, and logical relative path. -- Add migrations or a reproducible schema initialization path. +- Alembic configured with `migrations/` directory; `alembic upgrade + head` works against both SQLite (dev) and PostgreSQL (prod). +- `events`, `artifact_packages`, `artifact_files`, `storage_locations`, + `retention_classes`, `retention_state`, `metadata_schemas` tables + match the blueprint schema. +- Seed migration populates `retention_classes` with the five v1 entries. +- A `make migrate` and `make migrate-fresh` target work end-to-end on + a clean DB. +- All schema columns required by ADR-0001 (`digest_algorithm`, + `digest_primary`, `digest_sha256`, `content_address`), ADR-0002 + (full `events` table), and the blueprint's `retrieval_tier` and + `restore_status` are present. -## D1.3 - Local Filesystem Storage Backend +## D1.5 - Event Log Persistence And Replay + +```task +id: ARTIFACT-STORE-WP-0001-T011 +status: todo +priority: high +``` + +Acceptance: + +- `events.write(transaction, Event)` writes one row in the given DB + transaction. Sequence numbers are assigned by the DB + (`BIGSERIAL`) and are guaranteed monotonic and gapless within a + registry instance. +- `events.tail(since_sequence) -> AsyncIterator[Event]` long-polls + the table (notify-style on PostgreSQL via `LISTEN/NOTIFY`, + poll-style on SQLite). +- `events.replay(into=ViewWriter)` rebuilds all materialised view + tables from `events` deterministically. +- Test: ingesting a fixed sequence of events, then rebuilding the + views from scratch, yields byte-identical materialised state. +- Event payloads use canonical CBOR (`manifest.codec`) so the same + bytes flow through registry → DB → tail consumer without re-encoding. + +## D1.6 - Storage Adapter SPI And Local Filesystem Backend ```task id: ARTIFACT-STORE-WP-0001-T003 @@ -109,90 +196,81 @@ state_hub_task_id: "68f9a752-0012-4cc1-8768-ec3f75295e7a" Acceptance: -- Implement a storage adapter interface. -- Implement a local filesystem backend for development and tests. -- Store objects under deterministic package/file keys. -- Prevent path traversal and accidental writes outside the configured storage - root. -- Add backend health reporting. -- Add tests for put, get, head, and delete operations. +- `storage.spi.StorageBackend` Protocol matches the blueprint. +- `storage.backends.local.LocalBackend` implements the SPI: + - Object key layout `////`. + - Atomic write via `fsync(tmpfile) + rename`. + - Path traversal rejected at the SPI boundary. + - `health()` returns disk usage and root accessibility. +- Backend registry resolves by `backend_id` string (per ADR-0004). +- Unit tests cover: put, get, head, delete, double-put idempotency, + delete-of-missing, range read. -## D1.4 - Package Ingestion API +## D1.7 - Data Plane SPI And In-Process Implementation ```task -id: ARTIFACT-STORE-WP-0001-T004 +id: ARTIFACT-STORE-WP-0001-T012 status: todo priority: high -state_hub_task_id: "e3879111-4be9-4731-8aea-15abb874f960" ``` Acceptance: -- Add endpoints to create a package, upload files, finalize a package, retrieve - package metadata, list packages, and download files. -- Compute file hashes server-side during ingestion. -- Reject duplicate logical paths within one package unless explicitly replacing - a non-finalized file. -- Produce a package manifest after finalization. -- Add API tests covering successful ingestion and validation failures. +- `dataplane.spi.DataPlane` Protocol matches ADR-0004. +- `dataplane.inproc.InProcessDataPlane` implements all five operations + on top of a configured `StorageBackend`. +- `ingest_stream` computes both digests in a single pass, writes to + the backend keyed by the primary content address, and returns an + `IngestResult` containing both digests, size, and the + `StorageReceipt`. +- `serve_object` and `verify_object` re-read bytes through the + backend; `verify_object` re-digests and returns mismatches if any. +- Lint rule (or test): no code outside `dataplane.*` imports + `storage.backends.*` directly. -## D1.5 - Retention Baseline +## D1.8 - Registry Orchestrator (Library Surface) ```task -id: ARTIFACT-STORE-WP-0001-T005 +id: ARTIFACT-STORE-WP-0001-T013 status: todo priority: high -state_hub_task_id: "2d6cbd83-c348-45ad-a223-7870a3412225" ``` Acceptance: -- Seed default retention classes for transient, raw-evidence, summary-evidence, - release-evidence, and permanent-record. -- Apply a default `expires_at` when a package is created or finalized. -- Add endpoints to extend retention and apply or release holds. -- Record retention changes as retention events and audit events. -- Expose deletion eligibility without deleting bytes automatically in the first - implementation. +- `registry.Registry` exposes: `create_package`, `ingest_file`, + `finalize_package`, `get_manifest_bytes` (CBOR + JCS), `get_file`, + `tail_events`. Plus stubs for the retention operations that lighten + WP-0003. +- Each mutating operation is one DB transaction that writes events + AND updates materialised views. +- Finalisation writes one `v1.package.finalized` event whose payload + *is* the canonical CBOR manifest, and stamps `manifest_digest` on + `artifact_packages`. +- Duplicate `relative_path` within one not-yet-finalised package is + rejected unless an explicit replace is requested. +- Integration test: end-to-end ingest of a 3-file package against + local backend → finalize → read manifest → verify digests + → tail events → replay rebuilds identical state. -## D1.6 - S3-Compatible Backend Design Hook +## D1.9 - Minimal HTTP App And CLI ```task -id: ARTIFACT-STORE-WP-0001-T006 +id: ARTIFACT-STORE-WP-0001-T014 status: todo priority: medium -state_hub_task_id: "7b980a55-2364-48c3-98ac-081629a8d2b7" ``` Acceptance: -- Define configuration fields for an S3-compatible backend. -- Keep the adapter contract compatible with Ceph RGW. -- Add an implementation stub or feature-flagged backend if dependencies are not - ready. -- Document expected Ceph/S3 configuration without requiring a live Ceph service - for baseline tests. +- `api.http.app` is a FastAPI app with one route: `GET /health` + reporting registry liveness, DB connectivity, and backend health. +- `cli` exposes `artifactstore version`, `artifactstore migrate`, + `artifactstore replay`, `artifactstore health`. +- `make dev` starts the API on `127.0.0.1:8000` with SQLite + + local FS backend by default. -## D1.7 - Guide-Board Pilot Ingestion - -```task -id: ARTIFACT-STORE-WP-0001-T007 -status: todo -priority: high -state_hub_task_id: "eb822821-353c-4cd2-95bf-acb2f084b7ea" -``` - -Acceptance: - -- Provide a CLI helper or documented curl flow to register a guide-board run - directory as one package. -- Preserve guide-board run metadata: run id, target profile, assessment profile, - evidence result counts, finding counts, source commits, and report paths. -- Ingest the CMIS pilot run shape, including scorecard and log-review reports. -- Return a package id suitable for recording in Statehub. -- Add a fixture-based test that does not require the real OpenCMIS TCK. - -## D1.8 - Operator Documentation And Handoff +## D1.10 - Operator Documentation And ADR Cross-Linking ```task id: ARTIFACT-STORE-WP-0001-T008 @@ -203,27 +281,33 @@ state_hub_task_id: "9b60036c-61f2-4c22-ad31-7213473d42d0" Acceptance: -- Document local run, test, and package ingestion commands. -- Document retention behavior and extension flow. -- Document the boundary between artifact-store and Statehub. -- Include a dev-agent handoff section listing the first implementation order. -- Keep architecture docs aligned with the implemented API. +- `README.md` updated with current run / test / migrate commands. +- `AGENTS.md` "Current Repo Shape" section reflects the scaffold. +- An `docs/OPERATOR.md` page documents environment variables, local + vs PostgreSQL setup, replay command, and a smoke-test recipe. +- Every ADR is cross-linked from at least one of: blueprint, this + workplan, or `OPERATOR.md`. -## Suggested Implementation Order +## Suggested implementation order -1. Service scaffold, test harness, and README. -2. Metadata models and local database setup. -3. Local filesystem storage adapter. -4. Package create/upload/finalize/download API. -5. Retention defaults, extension, hold, and audit events. -6. Guide-board run ingestion helper. -7. S3-compatible backend configuration and Ceph notes. +1. T001 — scaffold and tooling (no other task can start without this). +2. T009 — digest abstraction (unblocks T010, T012). +3. T010 — manifest codec (unblocks T013). +4. T002 — schema and migrations (unblocks T011, T013). +5. T011 — event log + replay. +6. T003 — storage SPI + local backend. +7. T012 — data plane SPI + in-process impl. +8. T013 — registry orchestrator. +9. T014 — minimal HTTP app and CLI. +10. T008 — docs. -## First Pilot Success Criteria +## Success criteria -- A completed guide-board CMIS run can be ingested from a local directory. -- The package manifest lists every stored file with SHA-256 and size. -- The registry returns a stable package id. -- Files can be downloaded through the service. -- Default retention is visible and can be extended. -- Statehub can record the package id and summary without storing artifact bytes. +- `make dev && make test` round-trips on a clean checkout. +- A scripted integration test ingests a directory of fixture files, + finalises the package, reads the manifest, downloads each file, and + verifies digests end-to-end against the local backend. +- Replaying events from sequence 1 reproduces the materialised view + state byte-for-byte. +- The library can be imported and exercised without an HTTP server + running (embedding test). diff --git a/workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md b/workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md new file mode 100644 index 0000000..7571862 --- /dev/null +++ b/workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md @@ -0,0 +1,150 @@ +--- +id: ARTIFACT-STORE-WP-0002 +type: workplan +title: "Ingestion API And Manifest Surface" +repo: artifact-store +domain: stack +status: planned +owner: codex +topic_slug: stack +planning_priority: high +planning_order: 2 +created: "2026-05-15" +updated: "2026-05-15" +--- + +# ARTIFACT-STORE-WP-0002: Ingestion API And Manifest Surface + +## Purpose + +Expose the WP-0001 library as a complete HTTP API. Producers can create +packages, ingest files (single-shot or via the upload-session resource +shape), finalise to produce a manifest, list and search packages, +download files, and tail the event stream. + +## Constraints + +- ADR-0001, ADR-0002, ADR-0003, ADR-0004, ADR-0005, ADR-0006. +- `docs/ARCHITECTURE-BLUEPRINT.md` API shape section. +- All handlers must be thin: translate transport → `registry.*` calls. + +## Prerequisites + +- WP-0001 done (library is functional against local backend). + +## D2.1 - Package CRUD Endpoints + +```task +id: ARTIFACT-STORE-WP-0002-T001 +status: todo +priority: high +state_hub_task_id: "e3879111-4be9-4731-8aea-15abb874f960" +``` + +Acceptance: + +- `POST /packages`, `GET /packages` (filterable by producer / subject / + retention_class / metadata key), `GET /packages/{id}`, + `POST /packages/{id}/files` (single-shot multipart), + `POST /packages/{id}/finalize`. +- `GET /packages/{id}/manifest` (`Accept: application/cbor`) and + `GET /packages/{id}/manifest.json` (JCS projection). +- Validation errors return RFC 7807 problem documents. +- OpenAPI is generated automatically (FastAPI default) and served at + `/openapi.json` + `/docs`. + +## D2.2 - File Download And Range Reads + +```task +id: ARTIFACT-STORE-WP-0002-T002 +status: todo +priority: high +``` + +Acceptance: + +- `GET /files/{file_id}` returns metadata. +- `GET /files/{file_id}/download` streams bytes; supports `Range` + request headers (single contiguous range; multi-range is out of + scope for v1). +- ETag is the file's primary content address; `If-None-Match` returns + `304`. +- Streaming uses `AsyncIterator[bytes]` end-to-end; no full-file + buffering. + +## D2.3 - Upload Session Resource (Wire Shape Pinned) + +```task +id: ARTIFACT-STORE-WP-0002-T003 +status: todo +priority: medium +``` + +Acceptance: + +- `POST /uploads` opens a session, returns an upload id and content + upload URL. +- `PATCH /uploads/{upload_id}` accepts a body with `Content-Range`; + v1 implementation may accept the whole body in one call. +- `POST /uploads/{upload_id}/complete` promotes the upload into a + file under a given package id and relative path. +- Implementation is allowed to be single-shot internally; the wire + shape and resource lifecycle must be the final one (per + PLATFORM-AMBITION A6). + +## D2.4 - Event Stream Long-Poll + +```task +id: ARTIFACT-STORE-WP-0002-T004 +status: todo +priority: medium +``` + +Acceptance: + +- `GET /events?since=&limit=N` returns events in order with + a long-poll wait when the tail is reached. +- Events are CBOR by default; `Accept: application/json` returns the + JCS projection of each event payload. +- Test: a consumer that tails from sequence 1 never misses an event + produced during the test. + +## D2.5 - Auth Scaffolding (Shared-Secret Bearer) + +```task +id: ARTIFACT-STORE-WP-0002-T005 +status: todo +priority: medium +``` + +Acceptance: + +- Bearer token auth on all mutating endpoints; configurable per-tenant + token list via env / config file. +- Read endpoints are also gated by default; an explicit + `ARTIFACTSTORE_ANON_READ=true` opt-in for dev. +- Health endpoint remains anonymous. + +## D2.6 - Integration Tests Through The Full HTTP Surface + +```task +id: ARTIFACT-STORE-WP-0002-T006 +status: todo +priority: high +``` + +Acceptance: + +- httpx-based test suite exercises every endpoint. +- A scripted test ingests a 50-file package, finalises it, downloads + every file, verifies digests, and tails events. +- A property-based test fuzzes the upload session lifecycle. + +## Success criteria + +- A producer can run the full ingest-and-retrieve flow against + `make dev` with curl. +- All blueprint endpoints in the v1 native surface are implemented. +- The CLI gains `artifactstore push ` and + `artifactstore manifest ` subcommands as thin clients + over the HTTP API. diff --git a/workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md b/workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md new file mode 100644 index 0000000..b30a521 --- /dev/null +++ b/workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md @@ -0,0 +1,132 @@ +--- +id: ARTIFACT-STORE-WP-0003 +type: workplan +title: "Retention Lifecycle: Defaults, Extensions, Holds, Deletion Eligibility" +repo: artifact-store +domain: stack +status: planned +owner: codex +topic_slug: stack +planning_priority: high +planning_order: 3 +created: "2026-05-15" +updated: "2026-05-15" +--- + +# ARTIFACT-STORE-WP-0003: Retention Lifecycle + +## Purpose + +Implement the retention engine. By the end of this workplan, every +package has a computed `expires_at`, operators can extend retention or +apply / release holds, and the system can mark expired packages as +eligible for deletion — without actually deleting bytes (GC is +WP-0006). + +## Constraints + +- ADR-0002 (every retention change is an event). +- `docs/ARCHITECTURE-BLUEPRINT.md` retention sections. + +## Prerequisites + +- WP-0001 done (`retention_classes` seeded, `retention_state` view + exists). +- WP-0002 done (HTTP surface exists to attach the new endpoints to). + +## D3.1 - Default Retention Application + +```task +id: ARTIFACT-STORE-WP-0003-T001 +status: todo +priority: high +state_hub_task_id: "2d6cbd83-c348-45ad-a223-7870a3412225" +``` + +Acceptance: + +- On `POST /packages`, the requested `retention_class` is validated + and the `v1.retention.default_applied` event is written with the + computed `expires_at`. +- Default durations per class are operator-configurable via a + config file (TOML); the file path is documented in `OPERATOR.md`. +- `permanent-record` packages have `expires_at = NULL` and + `eligible_for_deletion = false`. + +## D3.2 - Retention Extensions + +```task +id: ARTIFACT-STORE-WP-0003-T002 +status: todo +priority: high +``` + +Acceptance: + +- `POST /packages/{id}/retention/extensions` accepts + `{new_expires_at, reason}`. The new value must be strictly later + than the current; reason is mandatory. +- Each extension writes a `v1.retention.extended` event; + `retention_state.current_expires_at` updates on the same + transaction. +- A package's full extension history is recoverable from `events`. + +## D3.3 - Holds (Apply And Release) + +```task +id: ARTIFACT-STORE-WP-0003-T003 +status: todo +priority: high +``` + +Acceptance: + +- `POST /packages/{id}/retention/holds` records a hold with a reason + and actor; emits `v1.retention.hold_applied`. +- A package with at least one active hold is never + `eligible_for_deletion` regardless of `expires_at`. +- `POST /packages/{id}/retention/holds/{hold_id}/release` requires a + reason; emits `v1.retention.hold_released`. +- Test: hold applied → expiry passes → eligibility stays `false`; + hold released → eligibility flips to `true`. + +## D3.4 - Deletion Eligibility Sweeper + +```task +id: ARTIFACT-STORE-WP-0003-T004 +status: todo +priority: medium +``` + +Acceptance: + +- A scheduled task (cron-style configurable interval; default 1 hour) + scans packages whose `expires_at` has passed and no active hold + exists, and emits `v1.retention.deletion_eligible` events. +- The sweeper is idempotent: events are emitted at most once per + package per eligibility transition. +- The sweeper is invokable as a CLI subcommand for tests: + `artifactstore retention sweep`. + +## D3.5 - Audit Surface For Retention + +```task +id: ARTIFACT-STORE-WP-0003-T005 +status: todo +priority: medium +``` + +Acceptance: + +- `GET /packages/{id}/retention/history` returns the ordered list of + retention events for a package. +- The default response is the JCS projection; CBOR is available via + `Accept: application/cbor`. + +## Success criteria + +- A guide-board run can be ingested, given `release-evidence`, later + extended once, held for a quarter, released, swept, and marked + eligible — all visible through both `retention_state` and the + event log. +- No bytes are deleted by this workplan; that is WP-0006. diff --git a/workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md b/workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md new file mode 100644 index 0000000..b9ba9d4 --- /dev/null +++ b/workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md @@ -0,0 +1,131 @@ +--- +id: ARTIFACT-STORE-WP-0004 +type: workplan +title: "S3-Compatible Backend (Ceph RGW Target)" +repo: artifact-store +domain: stack +status: planned +owner: codex +topic_slug: stack +planning_priority: medium +planning_order: 4 +created: "2026-05-15" +updated: "2026-05-15" +--- + +# ARTIFACT-STORE-WP-0004: S3-Compatible Backend + +## Purpose + +Add a second concrete storage backend that speaks the S3 protocol. +Validated targets: Ceph RGW (primary self-hosted production target), +MinIO (dev / CI), AWS S3 (interop check). The backend must satisfy +the storage SPI without any leaks of S3-specific concepts into the +registry. + +## Constraints + +- `storage.spi.StorageBackend` Protocol from WP-0001 is the contract. +- No S3 vocabulary leaks into `registry.*` or `api.*`. +- `docs/ARCHITECTURE-BLUEPRINT.md` storage-backend section. + +## Prerequisites + +- WP-0001 done (SPI exists, local backend exists as a reference). + +## D4.1 - Configuration Surface + +```task +id: ARTIFACT-STORE-WP-0004-T001 +status: todo +priority: high +state_hub_task_id: "7b980a55-2364-48c3-98ac-081629a8d2b7" +``` + +Acceptance: + +- `s3` backend configuration accepts: `endpoint_url`, `region`, + `bucket`, `key_prefix`, `access_key_ref`, `secret_key_ref`, + `storage_class`, `sse` (optional), `multipart_threshold_bytes`, + `multipart_chunk_bytes`. +- Credential references resolve from env vars or mounted files; never + from request bodies. +- Documented Ceph RGW configuration example checked in under + `docs/OPERATOR.md`. + +## D4.2 - S3 Backend Implementation + +```task +id: ARTIFACT-STORE-WP-0004-T002 +status: todo +priority: high +``` + +Acceptance: + +- `storage.backends.s3.S3Backend` implements the SPI using `aioboto3` + or `aiobotocore` (decision recorded in the workplan; whichever is + better-maintained at implementation time). +- Object key layout + `////`. +- `put` uses multipart for objects above the configured threshold. +- `get` supports `Range`. +- `head`, `delete`, `health` implemented. +- `delete` is idempotent (delete-of-missing returns success). + +## D4.3 - Backend Selection And Routing + +```task +id: ARTIFACT-STORE-WP-0004-T003 +status: todo +priority: medium +``` + +Acceptance: + +- A registry can have multiple backends configured; package creation + records which backend a file is stored in. +- Per-package backend selection rule: configurable function of + `retention_class` + producer; default routes everything to a single + backend. +- `storage_locations.backend_id` reflects the actual storage. + +## D4.4 - Test Strategy: MinIO In CI, RGW As Documented Manual Smoke + +```task +id: ARTIFACT-STORE-WP-0004-T004 +status: todo +priority: high +``` + +Acceptance: + +- Integration tests run against MinIO via `testcontainers-python` + (or a docker-compose fixture if testcontainers fights the WSL2 + environment). +- A documented manual procedure tests against a real Ceph RGW + endpoint; results recorded in `docs/OPERATOR.md`. +- No CI dependency on a live Ceph or AWS account. + +## D4.5 - Verification Pass + +```task +id: ARTIFACT-STORE-WP-0004-T005 +status: todo +priority: medium +``` + +Acceptance: + +- `artifactstore storage verify --backend s3` re-reads every object in + the backend, recomputes its primary digest, and emits + `v1.storage.location_verified` events. +- Mismatches are reported as `failed` locations and surfaced via the + health endpoint. + +## Success criteria + +- The same package ingestion flow that worked against `local` in + WP-0001 works unchanged against `s3`. +- Switching backend by config — without code changes in the registry + or API layers — is the smoke test. diff --git a/workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md b/workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md new file mode 100644 index 0000000..54e48a0 --- /dev/null +++ b/workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md @@ -0,0 +1,146 @@ +--- +id: ARTIFACT-STORE-WP-0005 +type: workplan +title: "Guide-Board Pilot Ingestion" +repo: artifact-store +domain: stack +status: planned +owner: codex +topic_slug: stack +planning_priority: high +planning_order: 5 +created: "2026-05-15" +updated: "2026-05-15" +--- + +# ARTIFACT-STORE-WP-0005: Guide-Board Pilot Ingestion + +## Purpose + +Wire the first real producer end-to-end. A guide-board CMIS +assessment run directory is registered as one artifact package, its +files are stored through a configured backend, retention is applied, +and Statehub records a stable package id and summary without storing +bytes itself. This is the pilot success criterion in INTENT.md. + +## Constraints + +- WP-0001 — WP-0004 must be done. +- `docs/ARCHITECTURE-BLUEPRINT.md` guide-board manifest fields. +- No guide-board-specific code lives in `artifactstore.registry`; + pilot-specific glue lives in `artifactstore.pilots.guide_board` or + in a separate small package. + +## Prerequisites + +- WP-0001, WP-0002, WP-0003 done. WP-0004 only required for the + production target; local FS is sufficient for the pilot test. + +## D5.1 - Pilot Metadata Schema Registration + +```task +id: ARTIFACT-STORE-WP-0005-T001 +status: todo +priority: high +state_hub_task_id: "eb822821-353c-4cd2-95bf-acb2f084b7ea" +``` + +Acceptance: + +- A JSON Schema for `guide-board.run.v1` package metadata is checked + in under `schemas/guide-board.run.v1.json`. +- A bootstrap script registers it via `POST /metadata-schemas` + (an endpoint added in this workplan). +- Required keys: `run_id`, `target_profile_ref`, + `assessment_profile_ref`, `result_status`, `source_commits` + (object of slug → SHA), `report_paths`, `evidence_counts`, + `finding_counts`. + +## D5.2 - Pilot Ingest Helper (CLI + Library Function) + +```task +id: ARTIFACT-STORE-WP-0005-T002 +status: todo +priority: high +``` + +Acceptance: + +- `artifactstore guide-board ingest ` walks a guide-board + run directory, builds the package metadata from `run.json` and + `retention-summary.json`, uploads every file declared in the + assessment package manifest (and the manifest itself), and + finalises the package. +- Library entry point `pilots.guide_board.ingest_run(path, ...)` + exposes the same behaviour for embedding. +- Output: the package id (UUID) and the package manifest digest + (`blake3:`). + +## D5.3 - Fixture-Based Test + +```task +id: ARTIFACT-STORE-WP-0005-T003 +status: todo +priority: high +``` + +Acceptance: + +- A trimmed-down guide-board run fixture (under 1 MB total) lives in + `tests/fixtures/guide-board/` with realistic file shapes: + `run.json`, `retention-summary.json`, + `reports/assessment-package.json`, `reports/report.md`, one + scorecard, one log-review summary, and a couple of raw artifact + files. +- The test runs the CLI / library helper end-to-end against an + in-memory SQLite + tempdir local backend, then verifies: + 1. package id returned, + 2. manifest digest stable across two runs of the same fixture, + 3. every file downloadable with correct bytes, + 4. retention class applied as configured. + +## D5.4 - Statehub Linkage Recipe + +```task +id: ARTIFACT-STORE-WP-0005-T004 +status: todo +priority: medium +``` + +Acceptance: + +- `docs/OPERATOR.md` (or a new `docs/pilots/guide-board.md`) + documents the exact `POST /progress/` or `record_decision` call + shape Statehub clients should use to link a guide-board run to + its artifact-store package id and manifest digest. +- A reference Statehub client snippet is checked in, parameterised + by env vars. + +## D5.5 - Operator Smoke Procedure For The Real Producer + +```task +id: ARTIFACT-STORE-WP-0005-T005 +status: todo +priority: medium +``` + +Acceptance: + +- A documented procedure ingests a real (non-fixture) guide-board run + produced from `~/guide-board` / `~/open-cmis-tck`. +- Procedure includes: starting `make dev`, registering the schema, + running the ingest CLI, verifying the manifest, and + recording the package id in Statehub. +- Procedure runs end-to-end on a developer workstation under 5 + minutes. + +## Success criteria + +- A real guide-board CMIS run is ingested with one CLI invocation. +- The package manifest lists every stored file with both digests and + the canonical CBOR digest of the manifest itself. +- Statehub records the package id and summary; no artifact bytes + live in Statehub. +- Retention can be extended on the package without touching bytes. +- The pilot path validates the storage adapter swap: the same + command works against `local` and against `s3` (if WP-0004 done).