generated from coulomb/repo-seed
docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
18
AGENTS.md
18
AGENTS.md
@@ -162,13 +162,21 @@ To create a new workplan:
|
||||
|
||||
## Current Repo Shape
|
||||
|
||||
This repository is in service-baseline planning. The current source of truth is:
|
||||
This repository is in service-baseline planning. The current sources of truth are:
|
||||
|
||||
- `INTENT.md` for purpose, product thesis, scope, and service boundary
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` for the draft architecture
|
||||
- `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` for implementation tasks
|
||||
- `INTENT.md` — purpose, product thesis, scope, service boundary.
|
||||
- `SCOPE.md` — lightweight orientation.
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` — architecture v2: modules, data model, API shape.
|
||||
- `docs/PLATFORM-AMBITION.md` — longer-horizon thesis and the v1 schema commitments (A1–A9).
|
||||
- `docs/adr/` — architecture decision records ADR-0001 … ADR-0006 (content-addressed storage, event log as source of truth, canonical CBOR manifests, control/data plane contract, v1 tech stack, OCI reachability).
|
||||
- `docs/ROADMAP.md` — workplan sequencing across phases.
|
||||
- `docs/ASSEMBLY-EXPERIMENT.md` — opt-in research line on hand-tuned asm for hot kernels.
|
||||
- `workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` — Foundation workplan; first to start.
|
||||
- `workplans/ARTIFACT-STORE-WP-{0002..0005}-*.md` — planned next workplans.
|
||||
|
||||
No runnable service scaffold exists yet. Add install, dev-server, and test
|
||||
No runnable service scaffold exists yet. The pinned tech stack is in ADR-0005
|
||||
(Python 3.12, uv, FastAPI, SQLAlchemy Core + asyncpg/aiosqlite, Alembic,
|
||||
cbor2, blake3, ruff, mypy, pytest, typer). Add install, dev-server, and test
|
||||
commands here when `ARTIFACT-STORE-WP-0001-T001` lands.
|
||||
|
||||
## Repo Boundary
|
||||
|
||||
58
README.md
58
README.md
@@ -1,17 +1,51 @@
|
||||
# artifact-store
|
||||
|
||||
Generic artifact registry and storage gateway for generated outputs, evidence
|
||||
packages, reports, logs, and release artifacts.
|
||||
Generic artifact registry and storage gateway for generated outputs,
|
||||
evidence packages, reports, logs, snapshots, exports, and release
|
||||
artifacts.
|
||||
|
||||
The registry owns artifact identity, metadata, provenance, retention policy, and
|
||||
retrieval records. Actual bytes are delegated to configured storage backends such
|
||||
as a local filesystem, S3-compatible object storage, or Ceph RGW.
|
||||
The registry owns artifact identity, metadata, provenance, retention
|
||||
policy, and retrieval records. Bytes are delegated to configured
|
||||
storage backends (local filesystem in v1, S3-compatible / Ceph RGW
|
||||
next).
|
||||
|
||||
Start here:
|
||||
The shape is library-first (`artifactstore` Python package); the HTTP
|
||||
server and the CLI are thin consumers. Content is addressed by digest;
|
||||
state is authoritative in an append-only event log; materialised views
|
||||
are rebuildable.
|
||||
|
||||
- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary
|
||||
- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — draft architecture
|
||||
- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon thesis and the schema commitments v1 preserves
|
||||
- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md) — SWOT and optimisation review
|
||||
- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in research line on hand-tuned assembly for hot kernels
|
||||
- [workplans/ARTIFACT-STORE-WP-0001-service-baseline.md](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md) — first implementation workplan
|
||||
## Status
|
||||
|
||||
Concept / service-baseline planning. No runnable scaffold yet —
|
||||
`workplans/ARTIFACT-STORE-WP-0001-service-baseline.md` is the next step.
|
||||
|
||||
## Start here
|
||||
|
||||
- [INTENT.md](INTENT.md) — purpose, product thesis, scope, boundary.
|
||||
- [SCOPE.md](SCOPE.md) — lightweight orientation.
|
||||
- [docs/ARCHITECTURE-BLUEPRINT.md](docs/ARCHITECTURE-BLUEPRINT.md) — the
|
||||
v2 architecture: modules, data model, API shape.
|
||||
- [docs/PLATFORM-AMBITION.md](docs/PLATFORM-AMBITION.md) — longer-horizon
|
||||
thesis, ffmpeg / VLC reference points, the schema commitments v1
|
||||
preserves.
|
||||
- [docs/ROADMAP.md](docs/ROADMAP.md) — workplan sequencing across
|
||||
phases.
|
||||
- [docs/adr/](docs/adr/) — architecture decision records (ADR-0001 …
|
||||
ADR-0006).
|
||||
- [docs/ASSEMBLY-EXPERIMENT.md](docs/ASSEMBLY-EXPERIMENT.md) — opt-in
|
||||
research line on hand-tuned assembly for hot kernels.
|
||||
- [docs/REVIEW-2026-05-15-intent-and-blueprint.md](docs/REVIEW-2026-05-15-intent-and-blueprint.md)
|
||||
— the SWOT review that triggered this cleanup.
|
||||
|
||||
## Active workplans
|
||||
|
||||
- [WP-0001 — Foundation: scaffold, core kernels, local FS backend](workplans/ARTIFACT-STORE-WP-0001-service-baseline.md)
|
||||
- [WP-0002 — Ingestion API and manifest surface](workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md) (planned)
|
||||
- [WP-0003 — Retention lifecycle](workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md) (planned)
|
||||
- [WP-0004 — S3-compatible backend](workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md) (planned)
|
||||
- [WP-0005 — Guide-board pilot ingestion](workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md) (planned)
|
||||
|
||||
## Agent operating notes
|
||||
|
||||
See [AGENTS.md](AGENTS.md) for the StateHub-integrated session
|
||||
protocol, workplan conventions, and progress-logging contract.
|
||||
|
||||
@@ -1,330 +1,378 @@
|
||||
# Artifact Store Architecture Blueprint
|
||||
# Architecture Blueprint
|
||||
|
||||
Status: draft
|
||||
Created: 2026-05-15
|
||||
Status: accepted (v2 — supersedes 2026-05-15 draft)
|
||||
Updated: 2026-05-15
|
||||
|
||||
## Purpose
|
||||
This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md`
|
||||
thesis, and the decisions recorded in `docs/adr/`. Where a tension exists
|
||||
between this blueprint and an ADR, the ADR wins; raise an issue or
|
||||
supersede the ADR.
|
||||
|
||||
`artifact-store` provides a generic registry and storage gateway for durable
|
||||
generated artifacts. Producers register packages and files with metadata;
|
||||
storage adapters persist the bytes; retention policy decides how long artifacts
|
||||
remain eligible for retrieval.
|
||||
## Architecture in one paragraph
|
||||
|
||||
The design keeps artifact identity and lifecycle separate from storage
|
||||
implementation. This allows the first version to run against local filesystem
|
||||
storage while the production path can use S3-compatible object storage such as
|
||||
Ceph RGW.
|
||||
`artifact-store` is a **library-first** artifact registry and storage
|
||||
gateway. A small core library (`artifactstore`) implements identity,
|
||||
manifests, retention, the storage adapter SPI, the data plane SPI, and
|
||||
the registry orchestrator. The HTTP server and the CLI are thin
|
||||
consumers of that library. Bytes are addressed by content
|
||||
(`blake3:<hex>`) and stored through a pluggable adapter SPI. State is
|
||||
authoritative in an append-only event log; queryable tables are
|
||||
materialised views.
|
||||
|
||||
## Architecture Summary
|
||||
## Design lineage
|
||||
|
||||
The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight
|
||||
core of well-named modules with stable contracts, runtime-pluggable
|
||||
backends, a thin orchestration binary, and an explicit hot-path
|
||||
boundary that can be rewritten in faster code without changing the
|
||||
consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table.
|
||||
|
||||
## Top-level shape
|
||||
|
||||
```text
|
||||
producer
|
||||
-> Artifact Registry API
|
||||
-> metadata database
|
||||
-> retention policy engine
|
||||
-> audit event log
|
||||
-> storage adapter interface
|
||||
-> local filesystem backend
|
||||
-> S3-compatible backend
|
||||
-> Ceph RGW deployment
|
||||
-> future cloud/blob/archive backends
|
||||
producers / operators / agents
|
||||
|
|
||||
v
|
||||
+------------------------+
|
||||
| HTTP API | CLI | <-- thin consumers
|
||||
+------------------------+
|
||||
|
|
||||
v
|
||||
+------------------------+
|
||||
| registry orchestrator |
|
||||
+------------------------+
|
||||
| | |
|
||||
v v v
|
||||
+----------+ +---------+ +---------+
|
||||
| identity | | events | |retention|
|
||||
|/manifest | | (log + | | policy |
|
||||
| | | views) | | engine |
|
||||
+----------+ +---------+ +---------+
|
||||
|
|
||||
v
|
||||
+-----------------------+
|
||||
| data plane SPI | <-- ADR-0004 contract
|
||||
+-----------------------+
|
||||
|
|
||||
v
|
||||
+-----------------------+
|
||||
| storage adapter SPI |
|
||||
+-----------------------+
|
||||
| | |
|
||||
v v v
|
||||
+-----+ +------+ +-------+
|
||||
|local| | S3 | | Ceph | ... future backends
|
||||
| FS | | RGW | | RGW |
|
||||
+-----+ +------+ +-------+
|
||||
```
|
||||
|
||||
The registry is the authority for artifact metadata and lifecycle. Backends are
|
||||
responsible for byte storage and retrieval.
|
||||
## Core modules
|
||||
|
||||
## Design Principles
|
||||
Mapped one-to-one to ADR-0005's project layout. Each module has a
|
||||
stable public surface; internals are free to evolve.
|
||||
|
||||
- Backend-neutral registry: no producer should know whether bytes live in Ceph,
|
||||
local disk, or a cloud bucket.
|
||||
- Content-addressable confidence: every stored file has a digest and size.
|
||||
- Retention by default: every package receives an expiry decision at ingestion.
|
||||
- Extensions are explicit: retention extensions and holds are audit events, not
|
||||
silent metadata edits.
|
||||
- Packages remain portable: a manifest should be enough to understand a package
|
||||
without calling the producer.
|
||||
- Statehub links, it does not store bytes: Statehub records artifact IDs and
|
||||
outcomes; artifact-store owns file persistence.
|
||||
- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion
|
||||
jobs must be auditable and reversible only when the backend still has data.
|
||||
### `identity`
|
||||
|
||||
## Components
|
||||
- `Digest(algorithm, hex)` — value object.
|
||||
- `ContentAddress` — `<algorithm>:<hex>` (ADR-0001).
|
||||
- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest.
|
||||
- Algorithm registry: `blake3` (default primary), `sha256` (always
|
||||
computed).
|
||||
|
||||
### Registry API
|
||||
### `manifest`
|
||||
|
||||
HTTP API for producers and operators.
|
||||
- `Manifest` — versioned dataclass: package metadata + ordered file list
|
||||
+ retention summary + provenance + storage receipts.
|
||||
- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR
|
||||
(ADR-0003).
|
||||
- `manifest.codec.decode(bytes) -> Manifest`.
|
||||
- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON
|
||||
projection for display and signing-tool interop.
|
||||
- Round-trip invariant: `decode(encode(m)) == m` and
|
||||
`encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
|
||||
|
||||
Initial responsibilities:
|
||||
### `events`
|
||||
|
||||
- create artifact packages,
|
||||
- upload or ingest files,
|
||||
- finalize packages,
|
||||
- retrieve package metadata,
|
||||
- list/search packages by subject and producer metadata,
|
||||
- create retention extensions and holds,
|
||||
- expose download metadata or redirect/download endpoints,
|
||||
- expose health and backend status.
|
||||
- `events.write(transaction, event)` — appends one row with monotonic
|
||||
sequence (ADR-0002).
|
||||
- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll.
|
||||
- `events.replay(into=ViewWriter)` — rebuild materialised views.
|
||||
- Event types (v1):
|
||||
`v1.package.created`, `v1.file.ingested`, `v1.package.finalized`,
|
||||
`v1.retention.default_applied`, `v1.retention.extended`,
|
||||
`v1.retention.hold_applied`, `v1.retention.hold_released`,
|
||||
`v1.retention.deletion_eligible`, `v1.storage.location_recorded`,
|
||||
`v1.storage.location_verified`, `v1.audit.access`,
|
||||
`v1.system.note`.
|
||||
|
||||
### Metadata Store
|
||||
### `retention`
|
||||
|
||||
Persistent database for registry state.
|
||||
- `retention.classes` — `transient`, `raw-evidence`, `summary-evidence`,
|
||||
`release-evidence`, `permanent-record`. Defined as data, not code.
|
||||
- `retention.policy.apply(package, class) -> RetentionDecision` —
|
||||
computes `expires_at` and the deletion eligibility rule.
|
||||
- `retention.extend(package, until, reason, actor)` — emits an event;
|
||||
the materialised view updates on commit.
|
||||
- `retention.hold(package, reason, actor)` /
|
||||
`retention.release_hold(hold_id, actor)`.
|
||||
|
||||
Initial implementation can use SQLite for local development and PostgreSQL for
|
||||
shared service deployments if that matches the surrounding service stack.
|
||||
### `audit`
|
||||
|
||||
Core tables:
|
||||
- A view over `events` filtered to access and lifecycle events. No
|
||||
separate write path; auditing happens by event emission elsewhere.
|
||||
|
||||
- `artifact_packages`
|
||||
- `artifact_files`
|
||||
- `storage_locations`
|
||||
- `retention_rules`
|
||||
- `retention_events`
|
||||
- `audit_events`
|
||||
### `storage` (adapter SPI)
|
||||
|
||||
### Storage Adapter Interface
|
||||
```python
|
||||
class StorageBackend(Protocol):
|
||||
backend_id: str
|
||||
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
|
||||
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
|
||||
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
|
||||
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
|
||||
async def health(self) -> BackendStatus: ...
|
||||
```
|
||||
|
||||
Small backend contract used by the API service.
|
||||
- Backend registry: backends register at import time; selection is
|
||||
per-package by configuration.
|
||||
- v1 ships `local` (filesystem); `s3` ships in WP-0004.
|
||||
|
||||
Required operations:
|
||||
### `dataplane` (SPI per ADR-0004)
|
||||
|
||||
- `put(object_key, stream, metadata) -> storage_location`
|
||||
- `get(object_key) -> stream or signed_url`
|
||||
- `head(object_key) -> object_metadata`
|
||||
- `delete(object_key) -> deletion_result`
|
||||
- `health() -> backend_status`
|
||||
```python
|
||||
class DataPlane(Protocol):
|
||||
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
|
||||
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
|
||||
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
|
||||
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
|
||||
async def backend_health(self) -> BackendStatus: ...
|
||||
```
|
||||
|
||||
Initial backends:
|
||||
- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`,
|
||||
computes digests during streaming.
|
||||
- Future implementation: `dataplane.remote` — gRPC or
|
||||
framed-bincode-over-Unix-socket client to a Rust daemon.
|
||||
|
||||
- local filesystem backend for tests and development,
|
||||
- S3-compatible backend for Ceph RGW and cloud object stores.
|
||||
### `registry`
|
||||
|
||||
### Retention Policy Engine
|
||||
The orchestrator. Combines `identity + manifest + events + retention +
|
||||
dataplane` into the operations the HTTP API and CLI consume:
|
||||
`create_package`, `ingest_file`, `finalize_package`, `get_manifest`,
|
||||
`download_file`, `extend_retention`, `apply_hold`, `release_hold`,
|
||||
`mark_deletion_eligible`, `tail_events`. Each operation is one DB
|
||||
transaction that writes one or more events and updates materialised
|
||||
views.
|
||||
|
||||
Applies default rules at ingestion and records later changes.
|
||||
### `api.http` and `cli`
|
||||
|
||||
Initial retention classes:
|
||||
Thin. Their job is to translate transport (HTTP / argv) into calls on
|
||||
`registry`. No business logic.
|
||||
|
||||
- `transient`: short-lived scratch artifacts,
|
||||
- `raw-evidence`: raw logs and run output,
|
||||
- `summary-evidence`: compact reports and summaries,
|
||||
- `release-evidence`: release or customer-facing evidence packages,
|
||||
- `permanent-record`: manually held records with no automatic expiry.
|
||||
## Data model
|
||||
|
||||
Each package stores:
|
||||
All tables exist as **materialised views over `events`** (ADR-0002),
|
||||
except `events` itself, `retention_classes` (seed data), and
|
||||
`metadata_schemas` (config).
|
||||
|
||||
- selected retention class,
|
||||
- default retention rule,
|
||||
- computed `expires_at`,
|
||||
- extension records,
|
||||
- hold records,
|
||||
- deletion eligibility state.
|
||||
### `events` (source of truth)
|
||||
|
||||
### Audit Log
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless |
|
||||
| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default |
|
||||
| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) |
|
||||
| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` |
|
||||
| `subject_id` | `UUID NULL` | |
|
||||
| `actor` | `TEXT NOT NULL` | producer or operator identity |
|
||||
| `payload` | `BYTEA NOT NULL` | canonical CBOR |
|
||||
| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` |
|
||||
|
||||
Append-only record of important events:
|
||||
Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
|
||||
|
||||
- package created,
|
||||
- file uploaded,
|
||||
- package finalized,
|
||||
- retrieval requested,
|
||||
- retention extended,
|
||||
- hold applied or released,
|
||||
- deletion requested,
|
||||
- deletion completed or failed.
|
||||
### `artifact_packages` (materialised view)
|
||||
|
||||
The audit log does not need to be cryptographic in the first release, but the
|
||||
schema should leave room for signed events or external write-once storage later.
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `name` | `TEXT NOT NULL` | |
|
||||
| `producer` | `TEXT NOT NULL` | |
|
||||
| `subject` | `TEXT NOT NULL` | |
|
||||
| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` |
|
||||
| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` |
|
||||
| `metadata` | `JSONB NOT NULL` | validated against schema if present |
|
||||
| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` |
|
||||
| `manifest_digest` | `BYTEA NULL` | populated on finalize |
|
||||
| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | |
|
||||
| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping |
|
||||
|
||||
## Data Model
|
||||
### `artifact_files` (materialised view)
|
||||
|
||||
### Artifact Package
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `package_id` | `UUID NOT NULL` | FK |
|
||||
| `relative_path` | `TEXT NOT NULL` | logical path; unique within package |
|
||||
| `media_type` | `TEXT NOT NULL` | required (ADR-0006) |
|
||||
| `size_bytes` | `BIGINT NOT NULL` | |
|
||||
| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) |
|
||||
| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest |
|
||||
| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop |
|
||||
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
|
||||
|
||||
Required fields:
|
||||
### `storage_locations` (materialised view)
|
||||
|
||||
- `id`
|
||||
- `name`
|
||||
- `producer`
|
||||
- `subject`
|
||||
- `retention_class`
|
||||
- `status`
|
||||
- `created_at`
|
||||
- `finalized_at`
|
||||
- `expires_at`
|
||||
- `metadata`
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `artifact_file_id` | `UUID NOT NULL` | FK |
|
||||
| `backend_id` | `TEXT NOT NULL` | |
|
||||
| `content_address` | `TEXT NOT NULL` | `<algo>:<hex>` |
|
||||
| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` |
|
||||
| `storage_class` | `TEXT NULL` | backend-specific label |
|
||||
| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` |
|
||||
| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` |
|
||||
| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` |
|
||||
| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | |
|
||||
|
||||
Recommended metadata keys:
|
||||
### `retention_state` (materialised view)
|
||||
|
||||
- `repo_slug`
|
||||
- `run_id`
|
||||
- `assessment_id`
|
||||
- `target_profile_ref`
|
||||
- `assessment_profile_ref`
|
||||
- `source_commits`
|
||||
- `tool_versions`
|
||||
- `environment`
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `package_id` | `UUID PRIMARY KEY` | |
|
||||
| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) |
|
||||
| `effective_class` | `TEXT NOT NULL` | |
|
||||
| `active_hold_id` | `UUID NULL` | |
|
||||
| `eligible_for_deletion` | `BOOLEAN NOT NULL` | |
|
||||
|
||||
### Artifact File
|
||||
### `retention_classes` (seed data, not derived)
|
||||
|
||||
Required fields:
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` |
|
||||
| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` |
|
||||
| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) |
|
||||
|
||||
- `id`
|
||||
- `package_id`
|
||||
- `relative_path`
|
||||
- `media_type`
|
||||
- `size_bytes`
|
||||
- `sha256`
|
||||
- `created_at`
|
||||
### `metadata_schemas` (config table)
|
||||
|
||||
### Storage Location
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` |
|
||||
| `json_schema` | `JSONB NOT NULL` | |
|
||||
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
|
||||
|
||||
Required fields:
|
||||
## API shape
|
||||
|
||||
- `id`
|
||||
- `artifact_file_id`
|
||||
- `backend_id`
|
||||
- `object_key`
|
||||
- `storage_class`
|
||||
- `status`
|
||||
- `created_at`
|
||||
- `last_verified_at`
|
||||
|
||||
### Retention Event
|
||||
|
||||
Required fields:
|
||||
|
||||
- `id`
|
||||
- `package_id`
|
||||
- `event_type`
|
||||
- `reason`
|
||||
- `created_by`
|
||||
- `created_at`
|
||||
- `previous_expires_at`
|
||||
- `new_expires_at`
|
||||
|
||||
Event types:
|
||||
|
||||
- `default_rule_applied`
|
||||
- `extended`
|
||||
- `hold_applied`
|
||||
- `hold_released`
|
||||
- `deletion_eligible`
|
||||
- `deleted`
|
||||
|
||||
## API Shape
|
||||
|
||||
Initial endpoints:
|
||||
### Native v1 surface
|
||||
|
||||
```text
|
||||
GET /health
|
||||
GET /backends
|
||||
POST /packages
|
||||
GET /packages
|
||||
GET /packages/{package_id}
|
||||
POST /packages/{package_id}/files
|
||||
POST /packages/{package_id}/finalize
|
||||
GET /packages/{package_id}/manifest
|
||||
GET /files/{file_id}/download
|
||||
POST /packages/{package_id}/retention/extensions
|
||||
POST /packages/{package_id}/retention/holds
|
||||
POST /packages/{package_id}/retention/holds/{hold_id}/release
|
||||
GET /health
|
||||
GET /backends
|
||||
GET /retention-classes
|
||||
|
||||
POST /packages # create
|
||||
GET /packages # list, query by metadata
|
||||
GET /packages/{package_id} # metadata
|
||||
POST /packages/{package_id}/files # single-shot file upload
|
||||
POST /packages/{package_id}/finalize # produce manifest
|
||||
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
|
||||
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
|
||||
|
||||
GET /files/{file_id} # metadata
|
||||
GET /files/{file_id}/download # bytes
|
||||
|
||||
POST /uploads # open an upload session (resource shape pinned now)
|
||||
PATCH /uploads/{upload_id} # range body
|
||||
POST /uploads/{upload_id}/complete # promote to /packages/.../files
|
||||
|
||||
POST /packages/{package_id}/retention/extensions
|
||||
POST /packages/{package_id}/retention/holds
|
||||
POST /packages/{package_id}/retention/holds/{hold_id}/release
|
||||
|
||||
GET /events?since={sequence} # long-poll registry change feed
|
||||
```
|
||||
|
||||
The first ingestion path can accept multipart file uploads. A later trusted-local
|
||||
operator endpoint may ingest from server-local paths, but it should be disabled
|
||||
by default because path ingestion changes the security boundary.
|
||||
The `POST /uploads/...` resource shape is committed now even if v1
|
||||
implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6.
|
||||
|
||||
## Package Manifest
|
||||
### Deferred / not v1
|
||||
|
||||
Every finalized package should expose a JSON manifest containing:
|
||||
- `/v2/…` OCI Distribution endpoints (ADR-0006).
|
||||
- gRPC API.
|
||||
- Streaming CDC topic (NATS / Kafka).
|
||||
- Multi-tenant namespacing in URLs.
|
||||
|
||||
- package metadata,
|
||||
- retention summary,
|
||||
- file list,
|
||||
- file digests and sizes,
|
||||
- storage backend references,
|
||||
- source metadata,
|
||||
- created/finalized timestamps.
|
||||
## Package manifest content (v1)
|
||||
|
||||
For guide-board runs, the manifest should preserve links to:
|
||||
A finalised manifest carries:
|
||||
|
||||
- `run.json`
|
||||
- `retention-summary.json`
|
||||
- `reports/assessment-package.json`
|
||||
- `reports/report.md`
|
||||
- extension-generated scorecards or log reviews,
|
||||
- raw artifact files captured by the assessment package manifest.
|
||||
- `manifest_version: 1`
|
||||
- `package`: id, name, producer, subject, retention class, created_at,
|
||||
finalized_at, expires_at, metadata, metadata_schema_id (nullable).
|
||||
- `files`: ordered list of `{id, relative_path, media_type, size_bytes,
|
||||
digest_algorithm, digest_primary_hex, digest_sha256_hex}`.
|
||||
- `storage_receipts`: ordered list of `{file_id, backend_id,
|
||||
content_address, retrieval_tier, status}` per stored copy.
|
||||
- `retention_summary`: current class, expires_at, holds, last
|
||||
retention event.
|
||||
- `provenance`: `{source_commits, tool_versions, environment,
|
||||
ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a
|
||||
registered schema or empty if none.
|
||||
|
||||
## Guide-Board Pilot Flow
|
||||
The manifest digest (`blake3:<hex>`) is the package's canonical
|
||||
external identifier.
|
||||
|
||||
```text
|
||||
guide-board run directory
|
||||
-> open-cmis-tck scorecard/log review
|
||||
-> artifact-store package create
|
||||
-> upload run files
|
||||
-> finalize manifest
|
||||
-> Statehub record links package id and summary
|
||||
```
|
||||
## Storage backends
|
||||
|
||||
The artifact package should carry:
|
||||
### Local filesystem (v1)
|
||||
|
||||
- run id,
|
||||
- target profile reference,
|
||||
- assessment profile reference,
|
||||
- result status,
|
||||
- source commits for guide-board, open-cmis-tck, and the assessed repository,
|
||||
- important report paths,
|
||||
- retention class `raw-evidence` or `release-evidence`.
|
||||
- Root: configured directory.
|
||||
- Object key layout: `<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
||||
- Atomic write via `fsync(tmpfile) + rename`. No partial states visible.
|
||||
- Path traversal prevented at the SPI boundary; the local backend
|
||||
rejects any key that does not match the expected layout.
|
||||
|
||||
## Ceph And S3-Compatible Storage
|
||||
### S3-compatible / Ceph RGW (WP-0004)
|
||||
|
||||
Ceph should be introduced through the S3-compatible adapter, not as a special
|
||||
case in producer logic.
|
||||
- Endpoint, bucket, region, access key ref, secret key ref, key
|
||||
prefix, storage class label, optional SSE config.
|
||||
- Object key: `<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
||||
- Multipart upload for objects above a configurable threshold.
|
||||
|
||||
Configuration should support:
|
||||
## Security boundary (v1)
|
||||
|
||||
- endpoint URL,
|
||||
- bucket,
|
||||
- region,
|
||||
- access key reference,
|
||||
- secret key reference,
|
||||
- optional server-side encryption settings,
|
||||
- object key prefix,
|
||||
- storage class label.
|
||||
- Internal service. No anonymous public access.
|
||||
- Authenticated producer / operator API. v1 ships shared-secret bearer
|
||||
tokens; OIDC integration is its own workplan.
|
||||
- No secret values in artifact metadata.
|
||||
- Upload paths are logical; never trusted filesystem paths. The
|
||||
`/uploads/...` path-ingest endpoint is *not* offered in v1.
|
||||
- Download authorisation is checked at the registry layer, never at
|
||||
the backend.
|
||||
|
||||
The service should never require credentials in producer request bodies. Use
|
||||
environment variables, mounted secret files, or a local secret provider.
|
||||
## Resolved open questions
|
||||
|
||||
## Future Retrieval Tiers
|
||||
- **Deduplication scope.** Global by content address (ADR-0001).
|
||||
Reference-counted deletion via a GC pass (WP-0006, TBD).
|
||||
- **Deletion ordering.** Mark records `deletion_eligible` first via an
|
||||
event. Byte deletion is a separate, audited operation that emits a
|
||||
second event. Reverse order is forbidden.
|
||||
- **Metadata schemas.** Open JSON with optional producer-registered
|
||||
JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`).
|
||||
- **Statehub integration scope.** Statehub keeps package IDs and
|
||||
summary; never bytes. The `/events` long-poll is the integration
|
||||
point.
|
||||
|
||||
The initial API can treat all stored files as immediately retrievable. Later,
|
||||
storage locations can include:
|
||||
## Outstanding open questions (not blocking v1)
|
||||
|
||||
- `retrieval_tier`: hot, warm, cold, archive,
|
||||
- `restore_status`: available, restore_requested, restoring, restored, expired,
|
||||
- `restore_requested_at`,
|
||||
- `restore_expires_at`.
|
||||
- Identity provider for shared deployments.
|
||||
- Default retention durations per class (operator-configurable; needs
|
||||
one round of stakeholder input).
|
||||
- WASM plugin host design (deferred to its own workplan; see
|
||||
`PLATFORM-AMBITION`).
|
||||
- Federation / mirroring protocol (post-OCI-endpoint workplan).
|
||||
|
||||
The registry API should be able to return "not immediately available" without
|
||||
changing artifact identity.
|
||||
## Roadmap pointer
|
||||
|
||||
## Security Boundary
|
||||
|
||||
Initial service assumptions:
|
||||
|
||||
- internal service, not public internet exposed,
|
||||
- authenticated producer/operator API before shared deployment,
|
||||
- no secret values stored in artifact metadata,
|
||||
- package paths are logical paths, not trusted filesystem paths,
|
||||
- download authorization should be checked at the registry layer.
|
||||
|
||||
Files may contain sensitive evidence. The service must treat metadata and bytes
|
||||
as confidential by default.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Which identity provider should guard shared deployments?
|
||||
- Should package metadata schemas be open-ended JSON or typed by producer?
|
||||
- Should deduplication be package-local only or global by content hash?
|
||||
- Should deletion first mark records deleted, then delete bytes, or reverse that
|
||||
order with compensating events?
|
||||
- How much Statehub integration belongs in this repo versus in Statehub clients?
|
||||
The implementation sequence is in `docs/ROADMAP.md`. The first
|
||||
workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`.
|
||||
|
||||
93
docs/ROADMAP.md
Normal file
93
docs/ROADMAP.md
Normal file
@@ -0,0 +1,93 @@
|
||||
# Roadmap
|
||||
|
||||
Status: living document
|
||||
Updated: 2026-05-15
|
||||
|
||||
The roadmap sequences `artifact-store` from "no code" to a credible
|
||||
production v1 to the longer-horizon platform shape recorded in
|
||||
`docs/PLATFORM-AMBITION.md`. Each row is a self-contained workplan with
|
||||
its own acceptance criteria; nothing here is a binding milestone.
|
||||
|
||||
The sequencing principle is **library-first** (ffmpeg-shaped):
|
||||
foundational kernels and contracts before any consumer code. The HTTP
|
||||
server and CLI exist only after the core library can be exercised
|
||||
end-to-end against a local filesystem backend.
|
||||
|
||||
## Phase 0 — Cleanup (done 2026-05-15)
|
||||
|
||||
- ADR-0001 through ADR-0006 accepted.
|
||||
- Architecture blueprint rewritten to v2.
|
||||
- Platform ambition and assembly experiment documented.
|
||||
- Workplans re-sequenced.
|
||||
|
||||
## Phase 1 — Foundation and pilot (v0.1)
|
||||
|
||||
Goal: ingest a real guide-board run end-to-end, against a local
|
||||
filesystem backend, with retention applied and events logged.
|
||||
|
||||
| ID | Title | Carries existing task IDs | Notes |
|
||||
|---|---|---|---|
|
||||
| WP-0001 | Foundation: scaffold, core kernels, local FS backend | T001, T002, T003, T008 | All of the library-shaped modules; no HTTP API yet beyond `/health`. |
|
||||
| WP-0002 | Ingestion API + manifest surface | T004 | The HTTP API. Builds on WP-0001's library. |
|
||||
| WP-0003 | Retention lifecycle | T005 | Retention engine, extensions, holds, deletion eligibility. |
|
||||
| WP-0004 | S3-compatible backend (Ceph RGW target) | T006 | Second concrete adapter. |
|
||||
| WP-0005 | Guide-board pilot ingestion | T007 | First real producer wired up. |
|
||||
|
||||
Exit criteria for v0.1: WP-0001 through WP-0005 done; a guide-board
|
||||
CMIS run round-trips through artifact-store with manifest, retention,
|
||||
and Statehub linkage; backend swappable between local FS and an
|
||||
S3-compatible store.
|
||||
|
||||
## Phase 2 — Production hardening (v0.2 – v0.3)
|
||||
|
||||
| ID | Title | Notes |
|
||||
|---|---|---|
|
||||
| WP-0006 | Garbage collection + reference counting | Required by ADR-0001 global dedup. Mark-eligible already lands in WP-0003; this workplan does the byte-deletion pass. |
|
||||
| WP-0007 | Resumable / chunked upload implementation | The wire shape lands in WP-0002; this workplan makes the implementation actually streaming. |
|
||||
| WP-0008 | Auth, multi-tenancy, quota | OIDC integration; tenant namespacing; per-tenant rate limit and storage quota. |
|
||||
| WP-0009 | Observability: metrics, tracing, structured logs | OpenTelemetry SDK; latency / throughput SLOs published. |
|
||||
| WP-0010 | Event stream out (CDC) | NATS or Kafka topic of registry events; long-poll `/events` becomes a fallback. |
|
||||
| WP-0011 | Signed manifests | Sigstore / cosign integration; signature recorded alongside manifest digest. |
|
||||
|
||||
Exit criteria for v0.3: a deployment is operatable by humans without
|
||||
internal knowledge; SLOs are measurable; access is authenticated;
|
||||
artifacts can be signed and verified.
|
||||
|
||||
## Phase 3 — Platform features (v0.4 – v1.0)
|
||||
|
||||
| ID | Title | Notes |
|
||||
|---|---|---|
|
||||
| WP-0012 | OCI artifact `/v2/` endpoint | Implements OCI Distribution Spec on top of the same storage (ADR-0006). |
|
||||
| WP-0013 | Content-defined chunking + global dedup at chunk level | FastCDC; chunked storage. Builds toward `docs/ASSEMBLY-EXPERIMENT.md`. |
|
||||
| WP-0014 | Rust data plane extraction | Move `dataplane.inproc` to `dataplane.remote` (ADR-0004). |
|
||||
| WP-0015 | WASM plugin host | Extension surface for indexers, redactors, scorecard generators. |
|
||||
| WP-0016 | Cold-tier adapters | Glacier / Tape / IA classes; restore flow. |
|
||||
| WP-0017 | Federation and replication | Signed manifest exchange between artifact-store instances. |
|
||||
|
||||
Exit criteria for v1.0: artifact-store is embeddable as a library, runs
|
||||
as a single-binary CLI, runs as a server, speaks OCI, federates between
|
||||
instances, and is fast enough to be a credible commercial substrate.
|
||||
|
||||
## What this roadmap deliberately does NOT promise
|
||||
|
||||
- Specific calendar dates. Cadence is set by sessions, not quarters.
|
||||
- A UI. UIs are out-of-tree (see `docs/PLATFORM-AMBITION.md`).
|
||||
- ML-specific or container-specific features. Use OCI compatibility.
|
||||
- A storage backend for every cloud. Adapters are community surface.
|
||||
|
||||
## How to add a workplan
|
||||
|
||||
1. Pick the next free `ARTIFACT-STORE-WP-NNNN` number.
|
||||
2. Create `workplans/ARTIFACT-STORE-WP-NNNN-<slug>.md` with the
|
||||
frontmatter and task block format in `AGENTS.md`.
|
||||
3. Cite the ADRs the workplan depends on in its `## Constraints`
|
||||
section.
|
||||
4. Append a row to the appropriate phase table in this file.
|
||||
5. Notify the custodian operator to run
|
||||
`make fix-consistency REPO=artifact-store`.
|
||||
|
||||
## How to retire a workplan
|
||||
|
||||
1. Set `status: done` in the frontmatter when all tasks are `done`.
|
||||
2. Move the file to `workplans/archived/YYMMDD-ARTIFACT-STORE-WP-NNNN-<slug>.md`.
|
||||
3. Update this roadmap to reflect the new state.
|
||||
80
docs/adr/0001-content-addressed-storage.md
Normal file
80
docs/adr/0001-content-addressed-storage.md
Normal file
@@ -0,0 +1,80 @@
|
||||
# ADR-0001 — Content-Addressed Storage with Dual Digest
|
||||
|
||||
Status: accepted
|
||||
Date: 2026-05-15
|
||||
Supersedes: —
|
||||
Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9
|
||||
|
||||
## Context
|
||||
|
||||
The architecture blueprint as originally drafted addresses stored bytes by
|
||||
logical `(package, relative_path)`. That is sufficient for v1 ingestion but
|
||||
forecloses global deduplication, Merkle integrity proofs, partial
|
||||
replication, federation, and OCI artifact compatibility — all of which the
|
||||
platform ambition requires to remain reachable.
|
||||
|
||||
Independently, the original blueprint pins SHA-256 as the only file digest.
|
||||
SHA-256 with SHA-NI on modern x86 reaches ~1.5–2 GB/s/core. BLAKE3 on the
|
||||
same hardware reaches 6–10+ GB/s/core, parallelises across cores, and its
|
||||
construction *is* a Merkle tree — package-level integrity becomes free.
|
||||
SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we
|
||||
cannot drop it.
|
||||
|
||||
## Decision
|
||||
|
||||
1. The canonical storage key for any byte sequence is its content address
|
||||
in the form `<algorithm>:<lowercase-hex-digest>`. Storage backends store
|
||||
and retrieve by this key. `relative_path` is logical metadata recorded
|
||||
in the manifest, not a storage-layer concept.
|
||||
2. Every `artifact_files` row carries two digest columns:
|
||||
- `digest_primary` — the native digest; default algorithm `blake3`.
|
||||
- `digest_sha256` — always populated for interop, even when `blake3`
|
||||
is the primary.
|
||||
Both are computed in a single ingest pass (one read of the input).
|
||||
3. The schema also carries a `digest_algorithm` column naming the primary
|
||||
algorithm. Additional algorithms are added by new columns or a side
|
||||
table, never by overloading `digest_primary`.
|
||||
4. Storage backend object keys are derived from `digest_primary` only.
|
||||
Migrations between primary algorithms are explicit and audited; they
|
||||
are not silent.
|
||||
|
||||
## Consequences
|
||||
|
||||
Positive:
|
||||
|
||||
- Global deduplication is automatic — two identical files in two packages
|
||||
share one backend object.
|
||||
- Merkle integrity over a package is free with BLAKE3 (use the tree mode).
|
||||
- Federation, partial mirrors, and OCI compatibility (ADR-0006) become
|
||||
reachable without schema migration.
|
||||
- Verification of a single file does not require fetching its package.
|
||||
|
||||
Negative:
|
||||
|
||||
- Two digests must be computed per ingest. Mitigated by streaming both
|
||||
through one buffer; the bottleneck is I/O, not hashing.
|
||||
- Reference counting: deletion of an `artifact_file` row cannot
|
||||
unconditionally delete the backend object. A garbage-collector pass
|
||||
reconciles references before deleting bytes. This is correct anyway
|
||||
(deletion should be deliberate, per the blueprint).
|
||||
- Producers requesting "store these N bytes at path P" must understand
|
||||
that their P is logical. This is a documentation problem, not a
|
||||
technical one.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
- v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated;
|
||||
no asm we maintain).
|
||||
- v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython
|
||||
build links against OpenSSL with SHA-NI support).
|
||||
- A `Digest` value object wraps `(algorithm, hex)`; serialised forms
|
||||
always include the algorithm prefix.
|
||||
- A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not
|
||||
delete bytes automatically — it marks them eligible.
|
||||
|
||||
## Status of the original blueprint pin
|
||||
|
||||
The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by
|
||||
`digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup
|
||||
blueprint's implicit path-keyed storage is replaced by content-keyed
|
||||
storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`.
|
||||
76
docs/adr/0002-event-log-source-of-truth.md
Normal file
76
docs/adr/0002-event-log-source-of-truth.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# ADR-0002 — Append-Only Event Log as Source of Truth
|
||||
|
||||
Status: accepted
|
||||
Date: 2026-05-15
|
||||
Related: `docs/PLATFORM-AMBITION.md` commitment A3
|
||||
|
||||
## Context
|
||||
|
||||
The original blueprint defines `audit_events` and `retention_events` as
|
||||
separate tables. Both are useful, but neither is a complete authoritative
|
||||
record of how registry state was produced. Several downstream needs share
|
||||
one underlying primitive:
|
||||
|
||||
- audit (who did what when, with what result),
|
||||
- change-data-capture feed for downstream consumers (Statehub, search),
|
||||
- replication and federation between instances,
|
||||
- point-in-time replay and disaster recovery,
|
||||
- materialised view rebuilds when schemas evolve.
|
||||
|
||||
Each can be served by an append-only log of registry events with a
|
||||
monotonic sequence number. Two separate tables cannot.
|
||||
|
||||
## Decision
|
||||
|
||||
1. The registry persists an append-only `events` table. Every state-
|
||||
changing operation writes one row in the same database transaction as
|
||||
the operation. Once written, rows are immutable.
|
||||
2. Each row has a strictly monotonic, gapless sequence number scoped to
|
||||
the registry instance, and a UTC ingest timestamp.
|
||||
3. The current `artifact_packages`, `artifact_files`, `storage_locations`,
|
||||
and `retention_state` tables are materialised views over `events`.
|
||||
They are rebuildable by replay.
|
||||
4. Event payloads are stored as canonical CBOR (ADR-0003), keyed by
|
||||
`event_type` (string slug). The `event_type` namespace is versioned
|
||||
(`v1.package.created`, `v1.file.ingested`, `v1.retention.extended`,
|
||||
etc.).
|
||||
5. `audit_events` and `retention_events` cease to exist as standalone
|
||||
tables; their semantics are subsets of `events` filtered by
|
||||
`event_type`.
|
||||
|
||||
## Consequences
|
||||
|
||||
Positive:
|
||||
|
||||
- One primitive serves audit, CDC, replication, replay, and rebuild.
|
||||
- A consumer can tail by `sequence > N` and never miss an event.
|
||||
- Forward-compatibility: new view columns can be derived from existing
|
||||
events by adding a replay path; no migration required.
|
||||
- Signed event chains are reachable later by adding a signature column.
|
||||
|
||||
Negative:
|
||||
|
||||
- Replays cost wall-clock time on large datasets. Snapshots of
|
||||
materialised views (with the highest applied sequence stamped on them)
|
||||
are used to bound replay cost.
|
||||
- Schema migrations on materialised views still happen; they just no
|
||||
longer touch the source of truth.
|
||||
- Discipline required: any write that bypasses the event log is a bug.
|
||||
Enforced by code review and a runtime invariant check on the
|
||||
materialised tables.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
- `events` schema (v1):
|
||||
- `sequence BIGSERIAL PRIMARY KEY`
|
||||
- `created_at TIMESTAMPTZ NOT NULL DEFAULT now()`
|
||||
- `event_type TEXT NOT NULL`
|
||||
- `subject_kind TEXT NOT NULL` — `package` | `file` | `retention` | `storage` | `system`
|
||||
- `subject_id UUID` — nullable for system-level events
|
||||
- `actor TEXT NOT NULL` — producer or operator identity
|
||||
- `payload BYTEA NOT NULL` — canonical CBOR
|
||||
- `payload_digest BYTEA NOT NULL` — BLAKE3 of `payload`
|
||||
- Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
|
||||
- Replay tool ships in v1 as a CLI subcommand (`artifactstore replay`).
|
||||
- Outbound CDC stream (NATS / Kafka) is its own workplan; v1 only exposes
|
||||
long-poll over `GET /events?since=<sequence>`.
|
||||
78
docs/adr/0003-manifest-canonical-cbor.md
Normal file
78
docs/adr/0003-manifest-canonical-cbor.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)
|
||||
|
||||
Status: accepted
|
||||
Date: 2026-05-15
|
||||
Related: ADR-0001, ADR-0002, ADR-0006, `docs/PLATFORM-AMBITION.md` commitment A4
|
||||
|
||||
## Context
|
||||
|
||||
Manifests describe a package's identity, contents, retention, and
|
||||
provenance. They are the durable, portable, signable summary of a package.
|
||||
Three downstream features depend on byte-identical manifest serialisation:
|
||||
|
||||
1. Manifest digest (used as the package's content address — ADR-0001).
|
||||
2. Signatures (cosign, Sigstore, in-toto, SLSA).
|
||||
3. Cross-language / cross-version reproducibility (any client must be
|
||||
able to verify a manifest produced by any other client).
|
||||
|
||||
JSON does not guarantee byte-identical output without an explicit
|
||||
canonicalisation profile. The candidates are:
|
||||
|
||||
- **JCS** (JSON Canonicalization Scheme, RFC 8785) — JSON-shaped, widely
|
||||
available, text-format, signs cleanly.
|
||||
- **Canonical CBOR** (RFC 8949 §4.2.2) — binary, smaller, lower overhead
|
||||
to canonicalise, native in cosign / Sigstore tooling, used by COSE.
|
||||
- **DAG-CBOR** (IPLD profile) — canonical CBOR plus content-addressing
|
||||
conventions; useful if we later integrate with IPLD/IPFS, but pulls in
|
||||
ecosystem assumptions we don't yet need.
|
||||
|
||||
Canonical CBOR wins on size, parser surface, and direct compatibility
|
||||
with the tooling we will adopt for signing (ADR commitments A4, A9). JCS
|
||||
is a reasonable alternative; we keep an emit-JCS path for human-readable
|
||||
display but the signed form is CBOR.
|
||||
|
||||
## Decision
|
||||
|
||||
1. Manifests are serialised as **canonical CBOR** per RFC 8949 §4.2.2:
|
||||
- definite-length encoding throughout,
|
||||
- shortest-form integer encoding,
|
||||
- map keys sorted bytewise lexicographically,
|
||||
- no floating-point unless explicitly required (we do not require it),
|
||||
- no semantic tags except those we explicitly enumerate.
|
||||
2. The manifest's content address is `blake3:<hex>` of its canonical
|
||||
CBOR bytes. This is the package's primary identifier in storage.
|
||||
3. A canonical JSON projection (JCS) of the same manifest is available
|
||||
for display, signing-tool interop, and human inspection. The
|
||||
projection is deterministic: round-tripping through it must yield
|
||||
byte-identical CBOR.
|
||||
4. The manifest schema is itself versioned (`manifest_version: 1`).
|
||||
Unknown fields are preserved on read and re-emitted on write (forward
|
||||
compatibility); breaking schema changes bump the version.
|
||||
|
||||
## Consequences
|
||||
|
||||
Positive:
|
||||
|
||||
- Manifests are signable today by any tool that consumes CBOR (cosign,
|
||||
ssh-keygen `-Y sign`, COSE libraries).
|
||||
- The manifest digest is stable across languages, OS, and compiler.
|
||||
- Smaller on disk and on the wire than JSON.
|
||||
- Replay (ADR-0002) is unambiguous because event payloads are also CBOR.
|
||||
|
||||
Negative:
|
||||
|
||||
- Less human-readable in raw form; the CLI must offer a `pretty` projection.
|
||||
- One more dependency (a CBOR library). We pin one in ADR-0005.
|
||||
- Future schema evolution requires the same canonicalisation discipline.
|
||||
Enforced by a property-based test: any manifest must round-trip
|
||||
CBOR → JCS → CBOR with byte equality.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
- v1 library: `cbor2` (PyPI; pure-Python with optional C extension).
|
||||
Wrapped behind `artifactstore.manifest.codec` so swapping to a faster
|
||||
impl is transparent.
|
||||
- JCS projection: `jcs` (PyPI) or hand-rolled — decision deferred to
|
||||
WP-0001-T003.
|
||||
- A `Manifest` value class enforces field order on emit, not just on
|
||||
encode. This catches non-canonical producers at the API boundary.
|
||||
79
docs/adr/0004-control-plane-data-plane-contract.md
Normal file
79
docs/adr/0004-control-plane-data-plane-contract.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# ADR-0004 — Control Plane / Data Plane Contract
|
||||
|
||||
Status: accepted
|
||||
Date: 2026-05-15
|
||||
Related: ADR-0005, `docs/PLATFORM-AMBITION.md` commitment A5,
|
||||
`docs/ASSEMBLY-EXPERIMENT.md`
|
||||
|
||||
## Context
|
||||
|
||||
The platform ambition expects a Rust (eventually asm-tuned) data plane
|
||||
to handle hot ingest paths — hashing, chunking, optional compression and
|
||||
encryption, storage backend I/O. The v1 service is written entirely in
|
||||
Python (ADR-0005). The cost of conflating control and data planes at the
|
||||
code level is that extracting the data plane later requires API churn,
|
||||
test rework, and producer migrations.
|
||||
|
||||
The cost of separating them now is one named module boundary and one
|
||||
in-process protocol shape. That cost is essentially free if taken
|
||||
before any consumer exists.
|
||||
|
||||
## Decision
|
||||
|
||||
1. The Python package is organised so that *every byte-handling
|
||||
operation* lives behind a named contract:
|
||||
- `artifactstore.dataplane.spi` — the abstract surface (typed
|
||||
dataclasses, async iterator protocols).
|
||||
- `artifactstore.dataplane.inproc` — the v1 implementation, running
|
||||
in the same process as the control plane.
|
||||
2. The control plane (`artifactstore.registry`, `artifactstore.api.http`,
|
||||
`artifactstore.retention`, `artifactstore.audit`) interacts with
|
||||
bytes *only* through the SPI. No HTTP handler, no DB writer, no
|
||||
retention rule ever reads or writes file bytes directly.
|
||||
3. The SPI exposes exactly these operations:
|
||||
- `ingest_stream(stream, hints) -> IngestResult` — consumes an
|
||||
upload, returns content addresses, sizes, and storage receipts.
|
||||
- `serve_object(content_address, range?) -> AsyncIterator[bytes]` —
|
||||
produces bytes for a download.
|
||||
- `verify_object(content_address) -> VerifyResult` — re-reads bytes,
|
||||
re-digests, returns mismatches.
|
||||
- `delete_object(content_address) -> DeletionResult` — best-effort,
|
||||
idempotent.
|
||||
- `backend_health() -> BackendStatus` — readiness, latency, free
|
||||
capacity.
|
||||
4. The SPI surface is the contract a future Rust daemon must satisfy.
|
||||
When that daemon ships, `artifactstore.dataplane.inproc` is replaced
|
||||
by `artifactstore.dataplane.remote` (a thin gRPC or
|
||||
framed-bincode-over-Unix-socket client). The control plane sees no
|
||||
change.
|
||||
5. SPI parameter and return types are CBOR-serialisable today, even when
|
||||
nothing serialises them. This lets us toggle to RPC without rewriting
|
||||
types.
|
||||
|
||||
## Consequences
|
||||
|
||||
Positive:
|
||||
|
||||
- The data plane can be rewritten in Rust later with zero API churn.
|
||||
- Tests can fake the SPI cheaply; integration tests pin the contract.
|
||||
- The CLI in `artifactstore.cli` is a second consumer of the SPI on
|
||||
equal footing with the HTTP server.
|
||||
- Operators with strong embedding requirements can use the in-process
|
||||
data plane forever; nothing forces the RPC hop.
|
||||
|
||||
Negative:
|
||||
|
||||
- One extra abstraction layer in v1. Mitigated by the contract being
|
||||
narrow (five operations).
|
||||
- Discipline required: PRs that bypass the SPI are rejected. A linter
|
||||
rule (forbidden import: `artifactstore.api.* -> filesystem`) makes
|
||||
this mechanical.
|
||||
|
||||
## Implementation notes
|
||||
|
||||
- The SPI is a `Protocol` (typing.Protocol) in `dataplane/spi.py` so the
|
||||
in-process and future remote impls don't share an inheritance tree.
|
||||
- Streaming returns `AsyncIterator[bytes]` so neither full-file buffering
|
||||
nor `sendfile()` zero-copy is foreclosed.
|
||||
- The `IngestResult` payload is the canonical CBOR-able value used in
|
||||
events (ADR-0002). The same byte sequence flows API → SPI → event.
|
||||
117
docs/adr/0005-v1-tech-stack.md
Normal file
117
docs/adr/0005-v1-tech-stack.md
Normal file
@@ -0,0 +1,117 @@
|
||||
# ADR-0005 — V1 Technology Stack
|
||||
|
||||
Status: accepted
|
||||
Date: 2026-05-15
|
||||
Related: ADR-0001, ADR-0002, ADR-0003, ADR-0004
|
||||
|
||||
## Context
|
||||
|
||||
WP-0001 ("Foundation") cannot start without a pinned stack. The decision
|
||||
needs to balance:
|
||||
|
||||
- ffmpeg / VLC philosophy: minimal dependency budget, sharp boundaries,
|
||||
native code at the hot edges, plain tools.
|
||||
- Python is already implied by `.gitignore` and ecosystem fit (StateHub,
|
||||
guide-board, open-cmis-tck are all Python-leaning).
|
||||
- The data plane will eventually be Rust (ADR-0004); the control plane
|
||||
stays in Python and must stay approachable.
|
||||
|
||||
## Decision
|
||||
|
||||
| Concern | Choice | Rationale |
|
||||
|---|---|---|
|
||||
| Language (control plane) | **Python 3.12+** | Async ecosystem, type hints, matches sibling repos. 3.12 specifically: PEP 695 generics, faster CPython, `sys.monitoring`. |
|
||||
| Package / project manager | **uv** | Single static binary, fast resolver, lockfile-first, replaces `pip + pip-tools + venv + pipx` in one tool. |
|
||||
| Build backend | **hatchling** (via `pyproject.toml`) | Standards-track PEP 517 backend. No magic. |
|
||||
| HTTP framework | **FastAPI** (Starlette + Pydantic v2) | OpenAPI generation, async-native, broad community. |
|
||||
| ASGI server | **uvicorn** (dev), **gunicorn + uvicorn workers** (prod) | Plain, well-understood. |
|
||||
| Database (prod) | **PostgreSQL 16+** | Source-of-truth event log (ADR-0002) wants `BIGSERIAL`, `BYTEA`, advisory locks, logical replication. |
|
||||
| Database (dev/embedded) | **SQLite (WAL mode)** | Zero-dependency local. Schema is portable when we use SQLAlchemy Core. |
|
||||
| DB access | **SQLAlchemy 2.0 Core** + **asyncpg** (prod) / **aiosqlite** (dev) | Core, not ORM — explicit SQL, async drivers. Migrations live below the API surface. |
|
||||
| Migrations | **Alembic** | Standard, integrates with SQLAlchemy Core, supports both pg and sqlite. |
|
||||
| Hashing | stdlib **`hashlib`** for SHA-256, **`blake3`** PyPI wheel for BLAKE3 | `blake3` wheel embeds the SIMD-tuned Rust impl with no build-time toolchain. |
|
||||
| Serialisation | **`cbor2`** for canonical CBOR (ADR-0003); stdlib `json` for JCS or `jcs` PyPI | Smallest deps that satisfy ADR-0003. |
|
||||
| CLI | **typer** (atop click) | Sits on FastAPI's Pydantic types cleanly; type-driven CLI surface. |
|
||||
| Tests | **pytest** + **httpx** + **trio-asyncio**-free `pytest-asyncio` | Standard. |
|
||||
| Lint / format | **ruff** (lint + format) | One tool replaces black + isort + flake8 + pyupgrade. |
|
||||
| Type checker | **mypy** in `--strict` | Pyright is acceptable for editor support; CI gate is mypy. |
|
||||
| Logging | stdlib `logging` + `structlog` for structured output | No exotic deps. |
|
||||
| Metrics / tracing | OpenTelemetry SDK (deferred to its own workplan) | Listed for forward-compatibility; not a v1 dep. |
|
||||
|
||||
### Project layout
|
||||
|
||||
```
|
||||
artifact-store/
|
||||
├── pyproject.toml
|
||||
├── uv.lock
|
||||
├── Makefile # thin shim: make dev / test / lint / type / migrate
|
||||
├── alembic.ini
|
||||
├── src/
|
||||
│ └── artifactstore/
|
||||
│ ├── __init__.py
|
||||
│ ├── identity/ # content address, digest abstraction (ADR-0001)
|
||||
│ ├── manifest/ # canonical CBOR, JCS projection (ADR-0003)
|
||||
│ ├── events/ # append-only log + replayer (ADR-0002)
|
||||
│ ├── retention/ # policy engine
|
||||
│ ├── audit/ # audit emission as event subset
|
||||
│ ├── storage/ # adapter SPI + backend registry
|
||||
│ │ ├── spi.py
|
||||
│ │ └── backends/
|
||||
│ │ ├── local.py # filesystem backend
|
||||
│ │ └── s3.py # placeholder, WP-0004
|
||||
│ ├── dataplane/ # SPI + in-process impl (ADR-0004)
|
||||
│ │ ├── spi.py
|
||||
│ │ └── inproc.py
|
||||
│ ├── registry/ # high-level orchestrator
|
||||
│ ├── api/
|
||||
│ │ └── http/ # FastAPI app
|
||||
│ ├── cli/ # typer CLI (thin)
|
||||
│ └── config.py
|
||||
├── tests/
|
||||
│ ├── unit/
|
||||
│ ├── integration/
|
||||
│ └── conftest.py
|
||||
├── migrations/ # alembic
|
||||
└── docs/
|
||||
```
|
||||
|
||||
### Commands (T001 acceptance)
|
||||
|
||||
```
|
||||
make dev # uvicorn with reload, sqlite backend, local FS storage
|
||||
make test # pytest -q
|
||||
make lint # ruff check + ruff format --check
|
||||
make type # mypy --strict src tests
|
||||
make migrate # alembic upgrade head
|
||||
artifactstore # CLI entry point installed by uv
|
||||
```
|
||||
|
||||
## Consequences
|
||||
|
||||
Positive:
|
||||
|
||||
- Dependency budget is small and each dep is best-in-class for its slot.
|
||||
- The same toolchain works on Linux, macOS, and CI without special cases.
|
||||
- `uv.lock` is checked in; builds are reproducible.
|
||||
- Every layer maps one-to-one to a docs concept (identity, manifest,
|
||||
events, dataplane, etc.), so the codebase remains navigable.
|
||||
|
||||
Negative:
|
||||
|
||||
- Pydantic v2 is the heaviest non-DB dep; acceptable for the OpenAPI win.
|
||||
- Choosing SQLAlchemy Core over ORM costs some convenience; we accept
|
||||
it because explicit SQL is easier to migrate to Rust later (ADR-0004).
|
||||
- mypy `--strict` is a per-PR tax; bounded by keeping the codebase small.
|
||||
|
||||
## Revision policy
|
||||
|
||||
This ADR is the most likely candidate for revision once we have profile
|
||||
data from real ingestion. Candidates we are already watching:
|
||||
|
||||
- Replace `cbor2` with a Rust-backed CBOR codec if profile shows it on
|
||||
the hot path.
|
||||
- Replace `uvicorn` with `granian` (Rust ASGI server) if perf demands.
|
||||
- Replace `SQLAlchemy Core` with raw `asyncpg` + a tiny query builder
|
||||
if Core's abstractions show up in flame graphs.
|
||||
|
||||
Each replacement is its own ADR. None of them are v1 work.
|
||||
69
docs/adr/0006-oci-compatibility-reachable.md
Normal file
69
docs/adr/0006-oci-compatibility-reachable.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# ADR-0006 — OCI Artifact Compatibility Kept Reachable
|
||||
|
||||
Status: accepted
|
||||
Date: 2026-05-15
|
||||
Related: ADR-0001, ADR-0003, `docs/PLATFORM-AMBITION.md` commitment A9
|
||||
|
||||
## Context
|
||||
|
||||
The OCI Distribution Specification and the OCI Artifact Manifest define
|
||||
a widely-deployed wire format for content-addressed artifact exchange.
|
||||
The ecosystem includes `oras`, `cosign`, `crane`, Helm, ChartMuseum,
|
||||
ML-model packaging tools, and most container registries. Compatibility
|
||||
with this ecosystem is the single highest-leverage opportunity in
|
||||
`docs/PLATFORM-AMBITION.md`.
|
||||
|
||||
We do not implement OCI compatibility in v1. We do refuse to take any
|
||||
v1 decision that prevents it.
|
||||
|
||||
## Decision
|
||||
|
||||
1. The internal data model is structurally compatible with an OCI
|
||||
artifact manifest. Concretely:
|
||||
- Storage addresses content as `<algorithm>:<lowercase-hex>`
|
||||
(ADR-0001). OCI requires exactly this shape.
|
||||
- Manifests have a `config` blob plus an ordered list of `layers`,
|
||||
each with `mediaType`, `digest`, `size`, and optional
|
||||
`annotations`. Our `Manifest` value class includes all of these
|
||||
fields, even when v1 has no use for `mediaType` or `annotations`.
|
||||
- Manifest serialisation produces byte-identical output across
|
||||
callers (ADR-0003). OCI requires this for the manifest digest.
|
||||
2. The native API may be richer than OCI, but v1 reviews every schema
|
||||
change against the OCI spec and rejects changes that would block
|
||||
later OCI compatibility.
|
||||
3. A future `/v2/` namespace will speak the OCI Distribution Spec on
|
||||
top of the same storage. This is its own workplan; it does not
|
||||
modify v1 endpoints, only add new ones.
|
||||
|
||||
## Consequences
|
||||
|
||||
Positive:
|
||||
|
||||
- `oras push`, `cosign sign`, `crane copy`, Helm `chart pull` become
|
||||
reachable additions, not rewrites.
|
||||
- Customers who already speak OCI can adopt incrementally.
|
||||
- The `mediaType` discipline forces v1 producers to label their files,
|
||||
which improves the manifest's value as a portable record.
|
||||
|
||||
Negative:
|
||||
|
||||
- v1 carries some otherwise-unnecessary manifest fields. Acceptable;
|
||||
the cost is bytes, not complexity.
|
||||
- The OCI manifest model uses SHA-256 as the canonical digest in
|
||||
practice. ADR-0001's `digest_sha256` column satisfies this; the
|
||||
native primary digest can still be BLAKE3.
|
||||
|
||||
## What this ADR does NOT commit to
|
||||
|
||||
- It does not commit to implementing OCI Distribution in v1.
|
||||
- It does not commit to OCI as the *only* wire format. The native API
|
||||
remains the richer interface.
|
||||
- It does not commit to specific OCI media types for evidence packages.
|
||||
Media-type assignment is the subject of a later workplan.
|
||||
|
||||
## Review trigger
|
||||
|
||||
Every schema-affecting workplan (anything that touches the data model
|
||||
or the manifest shape) must include an explicit one-paragraph review
|
||||
against this ADR. Reject changes that introduce OCI-incompatible
|
||||
invariants without superseding this ADR.
|
||||
32
docs/adr/README.md
Normal file
32
docs/adr/README.md
Normal file
@@ -0,0 +1,32 @@
|
||||
# Architecture Decision Records
|
||||
|
||||
This directory holds the architectural decisions that govern `artifact-store`.
|
||||
Each ADR is a small Markdown file with a status (`proposed`, `accepted`,
|
||||
`superseded`, `deprecated`), a concise statement of the decision, the
|
||||
forces that pushed it, and the consequences.
|
||||
|
||||
ADRs are the canonical home for "we are doing X" statements that survive
|
||||
multiple workplans. `INTENT.md` says what we build; `SCOPE.md` says where
|
||||
the boundary is; `docs/PLATFORM-AMBITION.md` says where we are pointed;
|
||||
ADRs say how — and they are the only document that records a *changeable*
|
||||
decision in a form that can be superseded cleanly.
|
||||
|
||||
Workplans cite the ADRs they depend on. The architecture blueprint cites
|
||||
the ADRs it operationalises.
|
||||
|
||||
## Index
|
||||
|
||||
- [ADR-0001 — Content-Addressed Storage with Dual Digest](0001-content-addressed-storage.md) — accepted
|
||||
- [ADR-0002 — Append-Only Event Log as Source of Truth](0002-event-log-source-of-truth.md) — accepted
|
||||
- [ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)](0003-manifest-canonical-cbor.md) — accepted
|
||||
- [ADR-0004 — Control Plane / Data Plane Contract](0004-control-plane-data-plane-contract.md) — accepted
|
||||
- [ADR-0005 — V1 Technology Stack](0005-v1-tech-stack.md) — accepted
|
||||
- [ADR-0006 — OCI Artifact Compatibility Kept Reachable](0006-oci-compatibility-reachable.md) — accepted
|
||||
|
||||
## Conventions
|
||||
|
||||
- Filenames: `NNNN-kebab-case-slug.md`, numbered in acceptance order.
|
||||
- Status transitions: `proposed → accepted → (superseded | deprecated)`.
|
||||
- Supersession is explicit: the new ADR links the old; the old ADR links
|
||||
forward and changes status. Never delete an ADR.
|
||||
- Each ADR is short. If it is long, it is wrong: split it.
|
||||
@@ -1,7 +1,7 @@
|
||||
---
|
||||
id: ARTIFACT-STORE-WP-0001
|
||||
type: workplan
|
||||
title: "Artifact Store Service Baseline"
|
||||
title: "Foundation: Scaffold, Core Kernels, Local FS Backend"
|
||||
repo: artifact-store
|
||||
domain: stack
|
||||
status: active
|
||||
@@ -14,51 +14,53 @@ updated: "2026-05-15"
|
||||
state_hub_workstream_id: "aebf996c-8721-4e8c-9e56-61d5e4bf8dcb"
|
||||
---
|
||||
|
||||
# ARTIFACT-STORE-WP-0001: Artifact Store Service Baseline
|
||||
# ARTIFACT-STORE-WP-0001: Foundation — Scaffold, Core Kernels, Local FS Backend
|
||||
|
||||
## Purpose
|
||||
|
||||
Implement the first usable artifact registry and storage gateway. The service
|
||||
should preserve artifact packages, index their metadata, delegate bytes to a
|
||||
configured storage backend, apply default retention rules, and expose stable
|
||||
package identifiers that Statehub and producer repositories can link to.
|
||||
Stand up the smallest credible `artifact-store` core. By the end of
|
||||
this workplan, the library can ingest a directory of files into a
|
||||
package, compute dual digests, write canonical-CBOR manifests, persist
|
||||
state through the append-only event log, store bytes on local
|
||||
filesystem, and replay materialised views from the event log. No HTTP
|
||||
API yet (that lands in WP-0002); a `/health` endpoint exists so that
|
||||
the dev loop has something to hit.
|
||||
|
||||
The first producer target is a guide-board assessment run, including OpenCMIS TCK
|
||||
reports and raw assessment artifacts.
|
||||
The shape is **library-first** (ffmpeg-style). HTTP server and CLI are
|
||||
explicitly thin consumers of `artifactstore.registry`.
|
||||
|
||||
## Background
|
||||
## Constraints (must satisfy)
|
||||
|
||||
Guide-board can already produce self-contained run directories with retention
|
||||
summaries, assessment packages, raw artifacts, scorecards, and log reviews. Those
|
||||
directories should not live only in `/tmp`, and committing raw evidence into
|
||||
producer repositories is the wrong long-term shape.
|
||||
|
||||
`artifact-store` becomes the shared preservation layer:
|
||||
|
||||
- producers generate files,
|
||||
- artifact-store registers and stores them,
|
||||
- Statehub records the work outcome and links to the registry package,
|
||||
- storage backends handle durable bytes.
|
||||
|
||||
Ceph is the likely self-hosted production backend through its S3-compatible RGW
|
||||
interface, but the service must keep the backend interface generic.
|
||||
|
||||
## Target Architecture
|
||||
|
||||
```text
|
||||
producer package
|
||||
-> registry API
|
||||
-> metadata database
|
||||
-> retention policy engine
|
||||
-> storage adapter
|
||||
-> local filesystem or S3-compatible object storage
|
||||
```
|
||||
- ADR-0001 — content-addressed storage with dual digest.
|
||||
- ADR-0002 — append-only event log as source of truth.
|
||||
- ADR-0003 — manifest canonicalisation = canonical CBOR.
|
||||
- ADR-0004 — control plane / data plane SPI named.
|
||||
- ADR-0005 — v1 technology stack pinned (Python 3.12, uv, FastAPI,
|
||||
SQLAlchemy Core, asyncpg, alembic, cbor2, blake3, ruff, mypy, pytest).
|
||||
- ADR-0006 — OCI compatibility kept reachable.
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` data model and module layout.
|
||||
|
||||
## Boundary
|
||||
|
||||
This workplan owns the first service implementation and API contract. It does
|
||||
not need to build a UI, implement cold-storage restore tiers, replace Statehub,
|
||||
or provide formal records-management certification.
|
||||
This workplan builds the library and a minimal `/health` endpoint. It
|
||||
does NOT implement: package CRUD HTTP API (WP-0002), retention rules
|
||||
beyond the seed (WP-0003), S3-compatible backend (WP-0004), guide-board
|
||||
producer wiring (WP-0005), GC of unreferenced bytes (WP-0006).
|
||||
|
||||
## Target architecture (this workplan)
|
||||
|
||||
```text
|
||||
artifactstore (library)
|
||||
identity ──┐
|
||||
manifest ──┼──> registry (orchestrator) ──> events (WAL + views)
|
||||
events ───┘ │
|
||||
retention (seed only) └──> dataplane.spi ──> dataplane.inproc ──> storage.spi ──> storage.backends.local
|
||||
audit (view) └──> filesystem
|
||||
storage.spi
|
||||
dataplane.spi + inproc
|
||||
api.http (just /health)
|
||||
cli (just `artifactstore version`, `artifactstore migrate`, `artifactstore replay`)
|
||||
```
|
||||
|
||||
## D1.1 - Service Scaffold And Repository Identity
|
||||
|
||||
@@ -71,14 +73,71 @@ state_hub_task_id: "84209430-ec3b-4c5e-924e-019c25434230"
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Replace the seed README with artifact-store service instructions.
|
||||
- Add a Python service scaffold with a clear package/module layout.
|
||||
- Provide a local development command.
|
||||
- Provide a test command.
|
||||
- Keep generated artifact bytes and local databases ignored by git.
|
||||
- Document required environment variables.
|
||||
- `pyproject.toml` with `hatchling` build backend, pinned dependencies
|
||||
per ADR-0005.
|
||||
- `uv.lock` committed.
|
||||
- `Makefile` exposes: `make dev`, `make test`, `make lint`, `make
|
||||
type`, `make migrate`. Each target is a thin shim, no logic inline.
|
||||
- `src/artifactstore/` package skeleton matches ADR-0005's layout
|
||||
(empty `__init__.py` and one placeholder module per top-level
|
||||
concern: `identity`, `manifest`, `events`, `retention`, `audit`,
|
||||
`storage`, `dataplane`, `registry`, `api/http`, `cli`, `config`).
|
||||
- `tests/{unit,integration}/conftest.py` in place.
|
||||
- `.env.example` documents required environment variables:
|
||||
`ARTIFACTSTORE_DATABASE_URL`, `ARTIFACTSTORE_STORAGE_LOCAL_ROOT`,
|
||||
`ARTIFACTSTORE_LOG_LEVEL`.
|
||||
- CI-equivalent local commands: `make lint && make type && make test`
|
||||
pass on a clean checkout.
|
||||
- `README.md` replaces the seed README: install with `uv sync`, run
|
||||
with `make dev`, test with `make test`, links to ADRs and blueprint.
|
||||
|
||||
## D1.2 - Registry Data Model
|
||||
## D1.2 - Digest Abstraction And Content Address
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T009
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `identity.Digest` value type with `algorithm: str` and `hex: str`,
|
||||
immutable, hashable.
|
||||
- `identity.ContentAddress` — string-form `<algorithm>:<hex>` with
|
||||
validating parser and emitter.
|
||||
- `identity.digest_stream(reader) -> {primary: Digest, sha256: Digest}` —
|
||||
single-pass dual-hash over an `AsyncIterator[bytes]`. Default primary
|
||||
algorithm: `blake3`.
|
||||
- Algorithm registry with `blake3` and `sha256` registered at import.
|
||||
- Property test: digest over random byte sequences round-trips through
|
||||
serialisation; `sha256` matches `hashlib.sha256(...).hexdigest()`;
|
||||
`blake3` matches `blake3.blake3(...).hexdigest()`.
|
||||
|
||||
## D1.3 - Manifest Codec (Canonical CBOR + JCS Projection)
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T010
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `manifest.Manifest` dataclass with the v1 fields enumerated in the
|
||||
blueprint (`manifest_version=1`, package, files, storage_receipts,
|
||||
retention_summary, provenance).
|
||||
- `manifest.codec.encode(m) -> bytes` produces canonical CBOR
|
||||
(RFC 8949 §4.2.2): definite-length, shortest-form integers,
|
||||
sorted map keys.
|
||||
- `manifest.codec.decode(b) -> Manifest`.
|
||||
- `manifest.projection.jcs(m) -> bytes` produces RFC 8785 canonical
|
||||
JSON.
|
||||
- Property test: `decode(encode(m)) == m` for randomly-generated
|
||||
manifests; `encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
|
||||
- Manifest digest helper: `manifest_digest(m) -> ContentAddress` using
|
||||
BLAKE3 over the canonical CBOR bytes.
|
||||
|
||||
## D1.4 - Registry Data Model And Migrations
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T002
|
||||
@@ -89,16 +148,44 @@ state_hub_task_id: "e5249a39-46a2-4b56-813e-0339c52cd14e"
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Define persistent models for artifact packages, files, storage locations,
|
||||
retention rules, retention events, and audit events.
|
||||
- Store package metadata as structured JSON while keeping core query fields
|
||||
explicit.
|
||||
- Record package lifecycle status: created, uploading, finalized, deleted, and
|
||||
failed.
|
||||
- Record file `sha256`, size, media type, and logical relative path.
|
||||
- Add migrations or a reproducible schema initialization path.
|
||||
- Alembic configured with `migrations/` directory; `alembic upgrade
|
||||
head` works against both SQLite (dev) and PostgreSQL (prod).
|
||||
- `events`, `artifact_packages`, `artifact_files`, `storage_locations`,
|
||||
`retention_classes`, `retention_state`, `metadata_schemas` tables
|
||||
match the blueprint schema.
|
||||
- Seed migration populates `retention_classes` with the five v1 entries.
|
||||
- A `make migrate` and `make migrate-fresh` target work end-to-end on
|
||||
a clean DB.
|
||||
- All schema columns required by ADR-0001 (`digest_algorithm`,
|
||||
`digest_primary`, `digest_sha256`, `content_address`), ADR-0002
|
||||
(full `events` table), and the blueprint's `retrieval_tier` and
|
||||
`restore_status` are present.
|
||||
|
||||
## D1.3 - Local Filesystem Storage Backend
|
||||
## D1.5 - Event Log Persistence And Replay
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T011
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `events.write(transaction, Event)` writes one row in the given DB
|
||||
transaction. Sequence numbers are assigned by the DB
|
||||
(`BIGSERIAL`) and are guaranteed monotonic and gapless within a
|
||||
registry instance.
|
||||
- `events.tail(since_sequence) -> AsyncIterator[Event]` long-polls
|
||||
the table (notify-style on PostgreSQL via `LISTEN/NOTIFY`,
|
||||
poll-style on SQLite).
|
||||
- `events.replay(into=ViewWriter)` rebuilds all materialised view
|
||||
tables from `events` deterministically.
|
||||
- Test: ingesting a fixed sequence of events, then rebuilding the
|
||||
views from scratch, yields byte-identical materialised state.
|
||||
- Event payloads use canonical CBOR (`manifest.codec`) so the same
|
||||
bytes flow through registry → DB → tail consumer without re-encoding.
|
||||
|
||||
## D1.6 - Storage Adapter SPI And Local Filesystem Backend
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T003
|
||||
@@ -109,90 +196,81 @@ state_hub_task_id: "68f9a752-0012-4cc1-8768-ec3f75295e7a"
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Implement a storage adapter interface.
|
||||
- Implement a local filesystem backend for development and tests.
|
||||
- Store objects under deterministic package/file keys.
|
||||
- Prevent path traversal and accidental writes outside the configured storage
|
||||
root.
|
||||
- Add backend health reporting.
|
||||
- Add tests for put, get, head, and delete operations.
|
||||
- `storage.spi.StorageBackend` Protocol matches the blueprint.
|
||||
- `storage.backends.local.LocalBackend` implements the SPI:
|
||||
- Object key layout `<root>/<algo>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
||||
- Atomic write via `fsync(tmpfile) + rename`.
|
||||
- Path traversal rejected at the SPI boundary.
|
||||
- `health()` returns disk usage and root accessibility.
|
||||
- Backend registry resolves by `backend_id` string (per ADR-0004).
|
||||
- Unit tests cover: put, get, head, delete, double-put idempotency,
|
||||
delete-of-missing, range read.
|
||||
|
||||
## D1.4 - Package Ingestion API
|
||||
## D1.7 - Data Plane SPI And In-Process Implementation
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T004
|
||||
id: ARTIFACT-STORE-WP-0001-T012
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "e3879111-4be9-4731-8aea-15abb874f960"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Add endpoints to create a package, upload files, finalize a package, retrieve
|
||||
package metadata, list packages, and download files.
|
||||
- Compute file hashes server-side during ingestion.
|
||||
- Reject duplicate logical paths within one package unless explicitly replacing
|
||||
a non-finalized file.
|
||||
- Produce a package manifest after finalization.
|
||||
- Add API tests covering successful ingestion and validation failures.
|
||||
- `dataplane.spi.DataPlane` Protocol matches ADR-0004.
|
||||
- `dataplane.inproc.InProcessDataPlane` implements all five operations
|
||||
on top of a configured `StorageBackend`.
|
||||
- `ingest_stream` computes both digests in a single pass, writes to
|
||||
the backend keyed by the primary content address, and returns an
|
||||
`IngestResult` containing both digests, size, and the
|
||||
`StorageReceipt`.
|
||||
- `serve_object` and `verify_object` re-read bytes through the
|
||||
backend; `verify_object` re-digests and returns mismatches if any.
|
||||
- Lint rule (or test): no code outside `dataplane.*` imports
|
||||
`storage.backends.*` directly.
|
||||
|
||||
## D1.5 - Retention Baseline
|
||||
## D1.8 - Registry Orchestrator (Library Surface)
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T005
|
||||
id: ARTIFACT-STORE-WP-0001-T013
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "2d6cbd83-c348-45ad-a223-7870a3412225"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Seed default retention classes for transient, raw-evidence, summary-evidence,
|
||||
release-evidence, and permanent-record.
|
||||
- Apply a default `expires_at` when a package is created or finalized.
|
||||
- Add endpoints to extend retention and apply or release holds.
|
||||
- Record retention changes as retention events and audit events.
|
||||
- Expose deletion eligibility without deleting bytes automatically in the first
|
||||
implementation.
|
||||
- `registry.Registry` exposes: `create_package`, `ingest_file`,
|
||||
`finalize_package`, `get_manifest_bytes` (CBOR + JCS), `get_file`,
|
||||
`tail_events`. Plus stubs for the retention operations that lighten
|
||||
WP-0003.
|
||||
- Each mutating operation is one DB transaction that writes events
|
||||
AND updates materialised views.
|
||||
- Finalisation writes one `v1.package.finalized` event whose payload
|
||||
*is* the canonical CBOR manifest, and stamps `manifest_digest` on
|
||||
`artifact_packages`.
|
||||
- Duplicate `relative_path` within one not-yet-finalised package is
|
||||
rejected unless an explicit replace is requested.
|
||||
- Integration test: end-to-end ingest of a 3-file package against
|
||||
local backend → finalize → read manifest → verify digests
|
||||
→ tail events → replay rebuilds identical state.
|
||||
|
||||
## D1.6 - S3-Compatible Backend Design Hook
|
||||
## D1.9 - Minimal HTTP App And CLI
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T006
|
||||
id: ARTIFACT-STORE-WP-0001-T014
|
||||
status: todo
|
||||
priority: medium
|
||||
state_hub_task_id: "7b980a55-2364-48c3-98ac-081629a8d2b7"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Define configuration fields for an S3-compatible backend.
|
||||
- Keep the adapter contract compatible with Ceph RGW.
|
||||
- Add an implementation stub or feature-flagged backend if dependencies are not
|
||||
ready.
|
||||
- Document expected Ceph/S3 configuration without requiring a live Ceph service
|
||||
for baseline tests.
|
||||
- `api.http.app` is a FastAPI app with one route: `GET /health`
|
||||
reporting registry liveness, DB connectivity, and backend health.
|
||||
- `cli` exposes `artifactstore version`, `artifactstore migrate`,
|
||||
`artifactstore replay`, `artifactstore health`.
|
||||
- `make dev` starts the API on `127.0.0.1:8000` with SQLite +
|
||||
local FS backend by default.
|
||||
|
||||
## D1.7 - Guide-Board Pilot Ingestion
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T007
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "eb822821-353c-4cd2-95bf-acb2f084b7ea"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Provide a CLI helper or documented curl flow to register a guide-board run
|
||||
directory as one package.
|
||||
- Preserve guide-board run metadata: run id, target profile, assessment profile,
|
||||
evidence result counts, finding counts, source commits, and report paths.
|
||||
- Ingest the CMIS pilot run shape, including scorecard and log-review reports.
|
||||
- Return a package id suitable for recording in Statehub.
|
||||
- Add a fixture-based test that does not require the real OpenCMIS TCK.
|
||||
|
||||
## D1.8 - Operator Documentation And Handoff
|
||||
## D1.10 - Operator Documentation And ADR Cross-Linking
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0001-T008
|
||||
@@ -203,27 +281,33 @@ state_hub_task_id: "9b60036c-61f2-4c22-ad31-7213473d42d0"
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Document local run, test, and package ingestion commands.
|
||||
- Document retention behavior and extension flow.
|
||||
- Document the boundary between artifact-store and Statehub.
|
||||
- Include a dev-agent handoff section listing the first implementation order.
|
||||
- Keep architecture docs aligned with the implemented API.
|
||||
- `README.md` updated with current run / test / migrate commands.
|
||||
- `AGENTS.md` "Current Repo Shape" section reflects the scaffold.
|
||||
- An `docs/OPERATOR.md` page documents environment variables, local
|
||||
vs PostgreSQL setup, replay command, and a smoke-test recipe.
|
||||
- Every ADR is cross-linked from at least one of: blueprint, this
|
||||
workplan, or `OPERATOR.md`.
|
||||
|
||||
## Suggested Implementation Order
|
||||
## Suggested implementation order
|
||||
|
||||
1. Service scaffold, test harness, and README.
|
||||
2. Metadata models and local database setup.
|
||||
3. Local filesystem storage adapter.
|
||||
4. Package create/upload/finalize/download API.
|
||||
5. Retention defaults, extension, hold, and audit events.
|
||||
6. Guide-board run ingestion helper.
|
||||
7. S3-compatible backend configuration and Ceph notes.
|
||||
1. T001 — scaffold and tooling (no other task can start without this).
|
||||
2. T009 — digest abstraction (unblocks T010, T012).
|
||||
3. T010 — manifest codec (unblocks T013).
|
||||
4. T002 — schema and migrations (unblocks T011, T013).
|
||||
5. T011 — event log + replay.
|
||||
6. T003 — storage SPI + local backend.
|
||||
7. T012 — data plane SPI + in-process impl.
|
||||
8. T013 — registry orchestrator.
|
||||
9. T014 — minimal HTTP app and CLI.
|
||||
10. T008 — docs.
|
||||
|
||||
## First Pilot Success Criteria
|
||||
## Success criteria
|
||||
|
||||
- A completed guide-board CMIS run can be ingested from a local directory.
|
||||
- The package manifest lists every stored file with SHA-256 and size.
|
||||
- The registry returns a stable package id.
|
||||
- Files can be downloaded through the service.
|
||||
- Default retention is visible and can be extended.
|
||||
- Statehub can record the package id and summary without storing artifact bytes.
|
||||
- `make dev && make test` round-trips on a clean checkout.
|
||||
- A scripted integration test ingests a directory of fixture files,
|
||||
finalises the package, reads the manifest, downloads each file, and
|
||||
verifies digests end-to-end against the local backend.
|
||||
- Replaying events from sequence 1 reproduces the materialised view
|
||||
state byte-for-byte.
|
||||
- The library can be imported and exercised without an HTTP server
|
||||
running (embedding test).
|
||||
|
||||
150
workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md
Normal file
150
workplans/ARTIFACT-STORE-WP-0002-ingestion-api.md
Normal file
@@ -0,0 +1,150 @@
|
||||
---
|
||||
id: ARTIFACT-STORE-WP-0002
|
||||
type: workplan
|
||||
title: "Ingestion API And Manifest Surface"
|
||||
repo: artifact-store
|
||||
domain: stack
|
||||
status: planned
|
||||
owner: codex
|
||||
topic_slug: stack
|
||||
planning_priority: high
|
||||
planning_order: 2
|
||||
created: "2026-05-15"
|
||||
updated: "2026-05-15"
|
||||
---
|
||||
|
||||
# ARTIFACT-STORE-WP-0002: Ingestion API And Manifest Surface
|
||||
|
||||
## Purpose
|
||||
|
||||
Expose the WP-0001 library as a complete HTTP API. Producers can create
|
||||
packages, ingest files (single-shot or via the upload-session resource
|
||||
shape), finalise to produce a manifest, list and search packages,
|
||||
download files, and tail the event stream.
|
||||
|
||||
## Constraints
|
||||
|
||||
- ADR-0001, ADR-0002, ADR-0003, ADR-0004, ADR-0005, ADR-0006.
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` API shape section.
|
||||
- All handlers must be thin: translate transport → `registry.*` calls.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- WP-0001 done (library is functional against local backend).
|
||||
|
||||
## D2.1 - Package CRUD Endpoints
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0002-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "e3879111-4be9-4731-8aea-15abb874f960"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `POST /packages`, `GET /packages` (filterable by producer / subject /
|
||||
retention_class / metadata key), `GET /packages/{id}`,
|
||||
`POST /packages/{id}/files` (single-shot multipart),
|
||||
`POST /packages/{id}/finalize`.
|
||||
- `GET /packages/{id}/manifest` (`Accept: application/cbor`) and
|
||||
`GET /packages/{id}/manifest.json` (JCS projection).
|
||||
- Validation errors return RFC 7807 problem documents.
|
||||
- OpenAPI is generated automatically (FastAPI default) and served at
|
||||
`/openapi.json` + `/docs`.
|
||||
|
||||
## D2.2 - File Download And Range Reads
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0002-T002
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `GET /files/{file_id}` returns metadata.
|
||||
- `GET /files/{file_id}/download` streams bytes; supports `Range`
|
||||
request headers (single contiguous range; multi-range is out of
|
||||
scope for v1).
|
||||
- ETag is the file's primary content address; `If-None-Match` returns
|
||||
`304`.
|
||||
- Streaming uses `AsyncIterator[bytes]` end-to-end; no full-file
|
||||
buffering.
|
||||
|
||||
## D2.3 - Upload Session Resource (Wire Shape Pinned)
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0002-T003
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `POST /uploads` opens a session, returns an upload id and content
|
||||
upload URL.
|
||||
- `PATCH /uploads/{upload_id}` accepts a body with `Content-Range`;
|
||||
v1 implementation may accept the whole body in one call.
|
||||
- `POST /uploads/{upload_id}/complete` promotes the upload into a
|
||||
file under a given package id and relative path.
|
||||
- Implementation is allowed to be single-shot internally; the wire
|
||||
shape and resource lifecycle must be the final one (per
|
||||
PLATFORM-AMBITION A6).
|
||||
|
||||
## D2.4 - Event Stream Long-Poll
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0002-T004
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `GET /events?since=<sequence>&limit=N` returns events in order with
|
||||
a long-poll wait when the tail is reached.
|
||||
- Events are CBOR by default; `Accept: application/json` returns the
|
||||
JCS projection of each event payload.
|
||||
- Test: a consumer that tails from sequence 1 never misses an event
|
||||
produced during the test.
|
||||
|
||||
## D2.5 - Auth Scaffolding (Shared-Secret Bearer)
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0002-T005
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Bearer token auth on all mutating endpoints; configurable per-tenant
|
||||
token list via env / config file.
|
||||
- Read endpoints are also gated by default; an explicit
|
||||
`ARTIFACTSTORE_ANON_READ=true` opt-in for dev.
|
||||
- Health endpoint remains anonymous.
|
||||
|
||||
## D2.6 - Integration Tests Through The Full HTTP Surface
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0002-T006
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- httpx-based test suite exercises every endpoint.
|
||||
- A scripted test ingests a 50-file package, finalises it, downloads
|
||||
every file, verifies digests, and tails events.
|
||||
- A property-based test fuzzes the upload session lifecycle.
|
||||
|
||||
## Success criteria
|
||||
|
||||
- A producer can run the full ingest-and-retrieve flow against
|
||||
`make dev` with curl.
|
||||
- All blueprint endpoints in the v1 native surface are implemented.
|
||||
- The CLI gains `artifactstore push <dir>` and
|
||||
`artifactstore manifest <package_id>` subcommands as thin clients
|
||||
over the HTTP API.
|
||||
132
workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md
Normal file
132
workplans/ARTIFACT-STORE-WP-0003-retention-lifecycle.md
Normal file
@@ -0,0 +1,132 @@
|
||||
---
|
||||
id: ARTIFACT-STORE-WP-0003
|
||||
type: workplan
|
||||
title: "Retention Lifecycle: Defaults, Extensions, Holds, Deletion Eligibility"
|
||||
repo: artifact-store
|
||||
domain: stack
|
||||
status: planned
|
||||
owner: codex
|
||||
topic_slug: stack
|
||||
planning_priority: high
|
||||
planning_order: 3
|
||||
created: "2026-05-15"
|
||||
updated: "2026-05-15"
|
||||
---
|
||||
|
||||
# ARTIFACT-STORE-WP-0003: Retention Lifecycle
|
||||
|
||||
## Purpose
|
||||
|
||||
Implement the retention engine. By the end of this workplan, every
|
||||
package has a computed `expires_at`, operators can extend retention or
|
||||
apply / release holds, and the system can mark expired packages as
|
||||
eligible for deletion — without actually deleting bytes (GC is
|
||||
WP-0006).
|
||||
|
||||
## Constraints
|
||||
|
||||
- ADR-0002 (every retention change is an event).
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` retention sections.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- WP-0001 done (`retention_classes` seeded, `retention_state` view
|
||||
exists).
|
||||
- WP-0002 done (HTTP surface exists to attach the new endpoints to).
|
||||
|
||||
## D3.1 - Default Retention Application
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0003-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "2d6cbd83-c348-45ad-a223-7870a3412225"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- On `POST /packages`, the requested `retention_class` is validated
|
||||
and the `v1.retention.default_applied` event is written with the
|
||||
computed `expires_at`.
|
||||
- Default durations per class are operator-configurable via a
|
||||
config file (TOML); the file path is documented in `OPERATOR.md`.
|
||||
- `permanent-record` packages have `expires_at = NULL` and
|
||||
`eligible_for_deletion = false`.
|
||||
|
||||
## D3.2 - Retention Extensions
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0003-T002
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `POST /packages/{id}/retention/extensions` accepts
|
||||
`{new_expires_at, reason}`. The new value must be strictly later
|
||||
than the current; reason is mandatory.
|
||||
- Each extension writes a `v1.retention.extended` event;
|
||||
`retention_state.current_expires_at` updates on the same
|
||||
transaction.
|
||||
- A package's full extension history is recoverable from `events`.
|
||||
|
||||
## D3.3 - Holds (Apply And Release)
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0003-T003
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `POST /packages/{id}/retention/holds` records a hold with a reason
|
||||
and actor; emits `v1.retention.hold_applied`.
|
||||
- A package with at least one active hold is never
|
||||
`eligible_for_deletion` regardless of `expires_at`.
|
||||
- `POST /packages/{id}/retention/holds/{hold_id}/release` requires a
|
||||
reason; emits `v1.retention.hold_released`.
|
||||
- Test: hold applied → expiry passes → eligibility stays `false`;
|
||||
hold released → eligibility flips to `true`.
|
||||
|
||||
## D3.4 - Deletion Eligibility Sweeper
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0003-T004
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A scheduled task (cron-style configurable interval; default 1 hour)
|
||||
scans packages whose `expires_at` has passed and no active hold
|
||||
exists, and emits `v1.retention.deletion_eligible` events.
|
||||
- The sweeper is idempotent: events are emitted at most once per
|
||||
package per eligibility transition.
|
||||
- The sweeper is invokable as a CLI subcommand for tests:
|
||||
`artifactstore retention sweep`.
|
||||
|
||||
## D3.5 - Audit Surface For Retention
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0003-T005
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `GET /packages/{id}/retention/history` returns the ordered list of
|
||||
retention events for a package.
|
||||
- The default response is the JCS projection; CBOR is available via
|
||||
`Accept: application/cbor`.
|
||||
|
||||
## Success criteria
|
||||
|
||||
- A guide-board run can be ingested, given `release-evidence`, later
|
||||
extended once, held for a quarter, released, swept, and marked
|
||||
eligible — all visible through both `retention_state` and the
|
||||
event log.
|
||||
- No bytes are deleted by this workplan; that is WP-0006.
|
||||
131
workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md
Normal file
131
workplans/ARTIFACT-STORE-WP-0004-s3-compatible-backend.md
Normal file
@@ -0,0 +1,131 @@
|
||||
---
|
||||
id: ARTIFACT-STORE-WP-0004
|
||||
type: workplan
|
||||
title: "S3-Compatible Backend (Ceph RGW Target)"
|
||||
repo: artifact-store
|
||||
domain: stack
|
||||
status: planned
|
||||
owner: codex
|
||||
topic_slug: stack
|
||||
planning_priority: medium
|
||||
planning_order: 4
|
||||
created: "2026-05-15"
|
||||
updated: "2026-05-15"
|
||||
---
|
||||
|
||||
# ARTIFACT-STORE-WP-0004: S3-Compatible Backend
|
||||
|
||||
## Purpose
|
||||
|
||||
Add a second concrete storage backend that speaks the S3 protocol.
|
||||
Validated targets: Ceph RGW (primary self-hosted production target),
|
||||
MinIO (dev / CI), AWS S3 (interop check). The backend must satisfy
|
||||
the storage SPI without any leaks of S3-specific concepts into the
|
||||
registry.
|
||||
|
||||
## Constraints
|
||||
|
||||
- `storage.spi.StorageBackend` Protocol from WP-0001 is the contract.
|
||||
- No S3 vocabulary leaks into `registry.*` or `api.*`.
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` storage-backend section.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- WP-0001 done (SPI exists, local backend exists as a reference).
|
||||
|
||||
## D4.1 - Configuration Surface
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0004-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "7b980a55-2364-48c3-98ac-081629a8d2b7"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `s3` backend configuration accepts: `endpoint_url`, `region`,
|
||||
`bucket`, `key_prefix`, `access_key_ref`, `secret_key_ref`,
|
||||
`storage_class`, `sse` (optional), `multipart_threshold_bytes`,
|
||||
`multipart_chunk_bytes`.
|
||||
- Credential references resolve from env vars or mounted files; never
|
||||
from request bodies.
|
||||
- Documented Ceph RGW configuration example checked in under
|
||||
`docs/OPERATOR.md`.
|
||||
|
||||
## D4.2 - S3 Backend Implementation
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0004-T002
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `storage.backends.s3.S3Backend` implements the SPI using `aioboto3`
|
||||
or `aiobotocore` (decision recorded in the workplan; whichever is
|
||||
better-maintained at implementation time).
|
||||
- Object key layout
|
||||
`<key_prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
||||
- `put` uses multipart for objects above the configured threshold.
|
||||
- `get` supports `Range`.
|
||||
- `head`, `delete`, `health` implemented.
|
||||
- `delete` is idempotent (delete-of-missing returns success).
|
||||
|
||||
## D4.3 - Backend Selection And Routing
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0004-T003
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A registry can have multiple backends configured; package creation
|
||||
records which backend a file is stored in.
|
||||
- Per-package backend selection rule: configurable function of
|
||||
`retention_class` + producer; default routes everything to a single
|
||||
backend.
|
||||
- `storage_locations.backend_id` reflects the actual storage.
|
||||
|
||||
## D4.4 - Test Strategy: MinIO In CI, RGW As Documented Manual Smoke
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0004-T004
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- Integration tests run against MinIO via `testcontainers-python`
|
||||
(or a docker-compose fixture if testcontainers fights the WSL2
|
||||
environment).
|
||||
- A documented manual procedure tests against a real Ceph RGW
|
||||
endpoint; results recorded in `docs/OPERATOR.md`.
|
||||
- No CI dependency on a live Ceph or AWS account.
|
||||
|
||||
## D4.5 - Verification Pass
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0004-T005
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `artifactstore storage verify --backend s3` re-reads every object in
|
||||
the backend, recomputes its primary digest, and emits
|
||||
`v1.storage.location_verified` events.
|
||||
- Mismatches are reported as `failed` locations and surfaced via the
|
||||
health endpoint.
|
||||
|
||||
## Success criteria
|
||||
|
||||
- The same package ingestion flow that worked against `local` in
|
||||
WP-0001 works unchanged against `s3`.
|
||||
- Switching backend by config — without code changes in the registry
|
||||
or API layers — is the smoke test.
|
||||
146
workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md
Normal file
146
workplans/ARTIFACT-STORE-WP-0005-guide-board-pilot.md
Normal file
@@ -0,0 +1,146 @@
|
||||
---
|
||||
id: ARTIFACT-STORE-WP-0005
|
||||
type: workplan
|
||||
title: "Guide-Board Pilot Ingestion"
|
||||
repo: artifact-store
|
||||
domain: stack
|
||||
status: planned
|
||||
owner: codex
|
||||
topic_slug: stack
|
||||
planning_priority: high
|
||||
planning_order: 5
|
||||
created: "2026-05-15"
|
||||
updated: "2026-05-15"
|
||||
---
|
||||
|
||||
# ARTIFACT-STORE-WP-0005: Guide-Board Pilot Ingestion
|
||||
|
||||
## Purpose
|
||||
|
||||
Wire the first real producer end-to-end. A guide-board CMIS
|
||||
assessment run directory is registered as one artifact package, its
|
||||
files are stored through a configured backend, retention is applied,
|
||||
and Statehub records a stable package id and summary without storing
|
||||
bytes itself. This is the pilot success criterion in INTENT.md.
|
||||
|
||||
## Constraints
|
||||
|
||||
- WP-0001 — WP-0004 must be done.
|
||||
- `docs/ARCHITECTURE-BLUEPRINT.md` guide-board manifest fields.
|
||||
- No guide-board-specific code lives in `artifactstore.registry`;
|
||||
pilot-specific glue lives in `artifactstore.pilots.guide_board` or
|
||||
in a separate small package.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
- WP-0001, WP-0002, WP-0003 done. WP-0004 only required for the
|
||||
production target; local FS is sufficient for the pilot test.
|
||||
|
||||
## D5.1 - Pilot Metadata Schema Registration
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0005-T001
|
||||
status: todo
|
||||
priority: high
|
||||
state_hub_task_id: "eb822821-353c-4cd2-95bf-acb2f084b7ea"
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A JSON Schema for `guide-board.run.v1` package metadata is checked
|
||||
in under `schemas/guide-board.run.v1.json`.
|
||||
- A bootstrap script registers it via `POST /metadata-schemas`
|
||||
(an endpoint added in this workplan).
|
||||
- Required keys: `run_id`, `target_profile_ref`,
|
||||
`assessment_profile_ref`, `result_status`, `source_commits`
|
||||
(object of slug → SHA), `report_paths`, `evidence_counts`,
|
||||
`finding_counts`.
|
||||
|
||||
## D5.2 - Pilot Ingest Helper (CLI + Library Function)
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0005-T002
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `artifactstore guide-board ingest <run-dir>` walks a guide-board
|
||||
run directory, builds the package metadata from `run.json` and
|
||||
`retention-summary.json`, uploads every file declared in the
|
||||
assessment package manifest (and the manifest itself), and
|
||||
finalises the package.
|
||||
- Library entry point `pilots.guide_board.ingest_run(path, ...)`
|
||||
exposes the same behaviour for embedding.
|
||||
- Output: the package id (UUID) and the package manifest digest
|
||||
(`blake3:<hex>`).
|
||||
|
||||
## D5.3 - Fixture-Based Test
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0005-T003
|
||||
status: todo
|
||||
priority: high
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A trimmed-down guide-board run fixture (under 1 MB total) lives in
|
||||
`tests/fixtures/guide-board/` with realistic file shapes:
|
||||
`run.json`, `retention-summary.json`,
|
||||
`reports/assessment-package.json`, `reports/report.md`, one
|
||||
scorecard, one log-review summary, and a couple of raw artifact
|
||||
files.
|
||||
- The test runs the CLI / library helper end-to-end against an
|
||||
in-memory SQLite + tempdir local backend, then verifies:
|
||||
1. package id returned,
|
||||
2. manifest digest stable across two runs of the same fixture,
|
||||
3. every file downloadable with correct bytes,
|
||||
4. retention class applied as configured.
|
||||
|
||||
## D5.4 - Statehub Linkage Recipe
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0005-T004
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- `docs/OPERATOR.md` (or a new `docs/pilots/guide-board.md`)
|
||||
documents the exact `POST /progress/` or `record_decision` call
|
||||
shape Statehub clients should use to link a guide-board run to
|
||||
its artifact-store package id and manifest digest.
|
||||
- A reference Statehub client snippet is checked in, parameterised
|
||||
by env vars.
|
||||
|
||||
## D5.5 - Operator Smoke Procedure For The Real Producer
|
||||
|
||||
```task
|
||||
id: ARTIFACT-STORE-WP-0005-T005
|
||||
status: todo
|
||||
priority: medium
|
||||
```
|
||||
|
||||
Acceptance:
|
||||
|
||||
- A documented procedure ingests a real (non-fixture) guide-board run
|
||||
produced from `~/guide-board` / `~/open-cmis-tck`.
|
||||
- Procedure includes: starting `make dev`, registering the schema,
|
||||
running the ingest CLI, verifying the manifest, and
|
||||
recording the package id in Statehub.
|
||||
- Procedure runs end-to-end on a developer workstation under 5
|
||||
minutes.
|
||||
|
||||
## Success criteria
|
||||
|
||||
- A real guide-board CMIS run is ingested with one CLI invocation.
|
||||
- The package manifest lists every stored file with both digests and
|
||||
the canonical CBOR digest of the manifest itself.
|
||||
- Statehub records the package id and summary; no artifact bytes
|
||||
live in Statehub.
|
||||
- Retention can be extended on the package without touching bytes.
|
||||
- The pilot path validates the storage adapter swap: the same
|
||||
command works against `local` and against `s3` (if WP-0004 done).
|
||||
Reference in New Issue
Block a user