generated from coulomb/repo-seed
docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can start implementation without the schema-level inconsistencies the prior review surfaced. ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only event log as source of truth, canonical CBOR manifests, control/data-plane contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core + asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI compatibility kept reachable. Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module layout, materialised-view data model over the event log, upload-session and event-stream endpoints pinned, retrieval tiering promoted into the schema. Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005 created carrying the existing state_hub_task_ids forward semantically: ingestion API (T004), retention lifecycle (T005), S3-compatible backend (T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001 with refined acceptance. README and AGENTS.md refreshed to reflect the new repo shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,330 +1,378 @@
|
||||
# Artifact Store Architecture Blueprint
|
||||
# Architecture Blueprint
|
||||
|
||||
Status: draft
|
||||
Created: 2026-05-15
|
||||
Status: accepted (v2 — supersedes 2026-05-15 draft)
|
||||
Updated: 2026-05-15
|
||||
|
||||
## Purpose
|
||||
This document operationalises `INTENT.md`, the `docs/PLATFORM-AMBITION.md`
|
||||
thesis, and the decisions recorded in `docs/adr/`. Where a tension exists
|
||||
between this blueprint and an ADR, the ADR wins; raise an issue or
|
||||
supersede the ADR.
|
||||
|
||||
`artifact-store` provides a generic registry and storage gateway for durable
|
||||
generated artifacts. Producers register packages and files with metadata;
|
||||
storage adapters persist the bytes; retention policy decides how long artifacts
|
||||
remain eligible for retrieval.
|
||||
## Architecture in one paragraph
|
||||
|
||||
The design keeps artifact identity and lifecycle separate from storage
|
||||
implementation. This allows the first version to run against local filesystem
|
||||
storage while the production path can use S3-compatible object storage such as
|
||||
Ceph RGW.
|
||||
`artifact-store` is a **library-first** artifact registry and storage
|
||||
gateway. A small core library (`artifactstore`) implements identity,
|
||||
manifests, retention, the storage adapter SPI, the data plane SPI, and
|
||||
the registry orchestrator. The HTTP server and the CLI are thin
|
||||
consumers of that library. Bytes are addressed by content
|
||||
(`blake3:<hex>`) and stored through a pluggable adapter SPI. State is
|
||||
authoritative in an append-only event log; queryable tables are
|
||||
materialised views.
|
||||
|
||||
## Architecture Summary
|
||||
## Design lineage
|
||||
|
||||
The shape is deliberately borrowed from `ffmpeg` and `VLC`: a tight
|
||||
core of well-named modules with stable contracts, runtime-pluggable
|
||||
backends, a thin orchestration binary, and an explicit hot-path
|
||||
boundary that can be rewritten in faster code without changing the
|
||||
consumer API. See `docs/PLATFORM-AMBITION.md` for the reference table.
|
||||
|
||||
## Top-level shape
|
||||
|
||||
```text
|
||||
producer
|
||||
-> Artifact Registry API
|
||||
-> metadata database
|
||||
-> retention policy engine
|
||||
-> audit event log
|
||||
-> storage adapter interface
|
||||
-> local filesystem backend
|
||||
-> S3-compatible backend
|
||||
-> Ceph RGW deployment
|
||||
-> future cloud/blob/archive backends
|
||||
producers / operators / agents
|
||||
|
|
||||
v
|
||||
+------------------------+
|
||||
| HTTP API | CLI | <-- thin consumers
|
||||
+------------------------+
|
||||
|
|
||||
v
|
||||
+------------------------+
|
||||
| registry orchestrator |
|
||||
+------------------------+
|
||||
| | |
|
||||
v v v
|
||||
+----------+ +---------+ +---------+
|
||||
| identity | | events | |retention|
|
||||
|/manifest | | (log + | | policy |
|
||||
| | | views) | | engine |
|
||||
+----------+ +---------+ +---------+
|
||||
|
|
||||
v
|
||||
+-----------------------+
|
||||
| data plane SPI | <-- ADR-0004 contract
|
||||
+-----------------------+
|
||||
|
|
||||
v
|
||||
+-----------------------+
|
||||
| storage adapter SPI |
|
||||
+-----------------------+
|
||||
| | |
|
||||
v v v
|
||||
+-----+ +------+ +-------+
|
||||
|local| | S3 | | Ceph | ... future backends
|
||||
| FS | | RGW | | RGW |
|
||||
+-----+ +------+ +-------+
|
||||
```
|
||||
|
||||
The registry is the authority for artifact metadata and lifecycle. Backends are
|
||||
responsible for byte storage and retrieval.
|
||||
## Core modules
|
||||
|
||||
## Design Principles
|
||||
Mapped one-to-one to ADR-0005's project layout. Each module has a
|
||||
stable public surface; internals are free to evolve.
|
||||
|
||||
- Backend-neutral registry: no producer should know whether bytes live in Ceph,
|
||||
local disk, or a cloud bucket.
|
||||
- Content-addressable confidence: every stored file has a digest and size.
|
||||
- Retention by default: every package receives an expiry decision at ingestion.
|
||||
- Extensions are explicit: retention extensions and holds are audit events, not
|
||||
silent metadata edits.
|
||||
- Packages remain portable: a manifest should be enough to understand a package
|
||||
without calling the producer.
|
||||
- Statehub links, it does not store bytes: Statehub records artifact IDs and
|
||||
outcomes; artifact-store owns file persistence.
|
||||
- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion
|
||||
jobs must be auditable and reversible only when the backend still has data.
|
||||
### `identity`
|
||||
|
||||
## Components
|
||||
- `Digest(algorithm, hex)` — value object.
|
||||
- `ContentAddress` — `<algorithm>:<hex>` (ADR-0001).
|
||||
- `digest_stream(reader) -> {primary, sha256}` — single-pass dual digest.
|
||||
- Algorithm registry: `blake3` (default primary), `sha256` (always
|
||||
computed).
|
||||
|
||||
### Registry API
|
||||
### `manifest`
|
||||
|
||||
HTTP API for producers and operators.
|
||||
- `Manifest` — versioned dataclass: package metadata + ordered file list
|
||||
+ retention summary + provenance + storage receipts.
|
||||
- `manifest.codec.encode(manifest) -> bytes` — canonical CBOR
|
||||
(ADR-0003).
|
||||
- `manifest.codec.decode(bytes) -> Manifest`.
|
||||
- `manifest.projection.jcs(manifest) -> bytes` — canonical-JSON
|
||||
projection for display and signing-tool interop.
|
||||
- Round-trip invariant: `decode(encode(m)) == m` and
|
||||
`encode(decode(jcs_to_cbor(jcs(m)))) == encode(m)`.
|
||||
|
||||
Initial responsibilities:
|
||||
### `events`
|
||||
|
||||
- create artifact packages,
|
||||
- upload or ingest files,
|
||||
- finalize packages,
|
||||
- retrieve package metadata,
|
||||
- list/search packages by subject and producer metadata,
|
||||
- create retention extensions and holds,
|
||||
- expose download metadata or redirect/download endpoints,
|
||||
- expose health and backend status.
|
||||
- `events.write(transaction, event)` — appends one row with monotonic
|
||||
sequence (ADR-0002).
|
||||
- `events.tail(since_sequence) -> AsyncIterator[Event]` — long-poll.
|
||||
- `events.replay(into=ViewWriter)` — rebuild materialised views.
|
||||
- Event types (v1):
|
||||
`v1.package.created`, `v1.file.ingested`, `v1.package.finalized`,
|
||||
`v1.retention.default_applied`, `v1.retention.extended`,
|
||||
`v1.retention.hold_applied`, `v1.retention.hold_released`,
|
||||
`v1.retention.deletion_eligible`, `v1.storage.location_recorded`,
|
||||
`v1.storage.location_verified`, `v1.audit.access`,
|
||||
`v1.system.note`.
|
||||
|
||||
### Metadata Store
|
||||
### `retention`
|
||||
|
||||
Persistent database for registry state.
|
||||
- `retention.classes` — `transient`, `raw-evidence`, `summary-evidence`,
|
||||
`release-evidence`, `permanent-record`. Defined as data, not code.
|
||||
- `retention.policy.apply(package, class) -> RetentionDecision` —
|
||||
computes `expires_at` and the deletion eligibility rule.
|
||||
- `retention.extend(package, until, reason, actor)` — emits an event;
|
||||
the materialised view updates on commit.
|
||||
- `retention.hold(package, reason, actor)` /
|
||||
`retention.release_hold(hold_id, actor)`.
|
||||
|
||||
Initial implementation can use SQLite for local development and PostgreSQL for
|
||||
shared service deployments if that matches the surrounding service stack.
|
||||
### `audit`
|
||||
|
||||
Core tables:
|
||||
- A view over `events` filtered to access and lifecycle events. No
|
||||
separate write path; auditing happens by event emission elsewhere.
|
||||
|
||||
- `artifact_packages`
|
||||
- `artifact_files`
|
||||
- `storage_locations`
|
||||
- `retention_rules`
|
||||
- `retention_events`
|
||||
- `audit_events`
|
||||
### `storage` (adapter SPI)
|
||||
|
||||
### Storage Adapter Interface
|
||||
```python
|
||||
class StorageBackend(Protocol):
|
||||
backend_id: str
|
||||
async def put(self, content_address: ContentAddress, stream: AsyncIterator[bytes], size_hint: int | None) -> StorageReceipt: ...
|
||||
async def get(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
|
||||
async def head(self, content_address: ContentAddress) -> StorageObjectMetadata: ...
|
||||
async def delete(self, content_address: ContentAddress) -> DeletionResult: ...
|
||||
async def health(self) -> BackendStatus: ...
|
||||
```
|
||||
|
||||
Small backend contract used by the API service.
|
||||
- Backend registry: backends register at import time; selection is
|
||||
per-package by configuration.
|
||||
- v1 ships `local` (filesystem); `s3` ships in WP-0004.
|
||||
|
||||
Required operations:
|
||||
### `dataplane` (SPI per ADR-0004)
|
||||
|
||||
- `put(object_key, stream, metadata) -> storage_location`
|
||||
- `get(object_key) -> stream or signed_url`
|
||||
- `head(object_key) -> object_metadata`
|
||||
- `delete(object_key) -> deletion_result`
|
||||
- `health() -> backend_status`
|
||||
```python
|
||||
class DataPlane(Protocol):
|
||||
async def ingest_stream(self, stream: AsyncIterator[bytes], hints: IngestHints) -> IngestResult: ...
|
||||
async def serve_object(self, content_address: ContentAddress, byte_range: tuple[int, int] | None = None) -> AsyncIterator[bytes]: ...
|
||||
async def verify_object(self, content_address: ContentAddress) -> VerifyResult: ...
|
||||
async def delete_object(self, content_address: ContentAddress) -> DeletionResult: ...
|
||||
async def backend_health(self) -> BackendStatus: ...
|
||||
```
|
||||
|
||||
Initial backends:
|
||||
- v1 implementation: `dataplane.inproc` — wraps a `StorageBackend`,
|
||||
computes digests during streaming.
|
||||
- Future implementation: `dataplane.remote` — gRPC or
|
||||
framed-bincode-over-Unix-socket client to a Rust daemon.
|
||||
|
||||
- local filesystem backend for tests and development,
|
||||
- S3-compatible backend for Ceph RGW and cloud object stores.
|
||||
### `registry`
|
||||
|
||||
### Retention Policy Engine
|
||||
The orchestrator. Combines `identity + manifest + events + retention +
|
||||
dataplane` into the operations the HTTP API and CLI consume:
|
||||
`create_package`, `ingest_file`, `finalize_package`, `get_manifest`,
|
||||
`download_file`, `extend_retention`, `apply_hold`, `release_hold`,
|
||||
`mark_deletion_eligible`, `tail_events`. Each operation is one DB
|
||||
transaction that writes one or more events and updates materialised
|
||||
views.
|
||||
|
||||
Applies default rules at ingestion and records later changes.
|
||||
### `api.http` and `cli`
|
||||
|
||||
Initial retention classes:
|
||||
Thin. Their job is to translate transport (HTTP / argv) into calls on
|
||||
`registry`. No business logic.
|
||||
|
||||
- `transient`: short-lived scratch artifacts,
|
||||
- `raw-evidence`: raw logs and run output,
|
||||
- `summary-evidence`: compact reports and summaries,
|
||||
- `release-evidence`: release or customer-facing evidence packages,
|
||||
- `permanent-record`: manually held records with no automatic expiry.
|
||||
## Data model
|
||||
|
||||
Each package stores:
|
||||
All tables exist as **materialised views over `events`** (ADR-0002),
|
||||
except `events` itself, `retention_classes` (seed data), and
|
||||
`metadata_schemas` (config).
|
||||
|
||||
- selected retention class,
|
||||
- default retention rule,
|
||||
- computed `expires_at`,
|
||||
- extension records,
|
||||
- hold records,
|
||||
- deletion eligibility state.
|
||||
### `events` (source of truth)
|
||||
|
||||
### Audit Log
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `sequence` | `BIGSERIAL PRIMARY KEY` | monotonic, gapless |
|
||||
| `created_at` | `TIMESTAMPTZ NOT NULL` | UTC, set by DB default |
|
||||
| `event_type` | `TEXT NOT NULL` | versioned slug (`v1.…`) |
|
||||
| `subject_kind` | `TEXT NOT NULL` | `package` / `file` / `retention` / `storage` / `system` |
|
||||
| `subject_id` | `UUID NULL` | |
|
||||
| `actor` | `TEXT NOT NULL` | producer or operator identity |
|
||||
| `payload` | `BYTEA NOT NULL` | canonical CBOR |
|
||||
| `payload_digest` | `BYTEA NOT NULL` | BLAKE3 of `payload` |
|
||||
|
||||
Append-only record of important events:
|
||||
Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
|
||||
|
||||
- package created,
|
||||
- file uploaded,
|
||||
- package finalized,
|
||||
- retrieval requested,
|
||||
- retention extended,
|
||||
- hold applied or released,
|
||||
- deletion requested,
|
||||
- deletion completed or failed.
|
||||
### `artifact_packages` (materialised view)
|
||||
|
||||
The audit log does not need to be cryptographic in the first release, but the
|
||||
schema should leave room for signed events or external write-once storage later.
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `name` | `TEXT NOT NULL` | |
|
||||
| `producer` | `TEXT NOT NULL` | |
|
||||
| `subject` | `TEXT NOT NULL` | |
|
||||
| `retention_class` | `TEXT NOT NULL` | FK to `retention_classes` |
|
||||
| `metadata_schema_id` | `UUID NULL` | FK to `metadata_schemas` |
|
||||
| `metadata` | `JSONB NOT NULL` | validated against schema if present |
|
||||
| `status` | `TEXT NOT NULL` | `created` / `uploading` / `finalized` / `deletion_eligible` / `deleted` / `failed` |
|
||||
| `manifest_digest` | `BYTEA NULL` | populated on finalize |
|
||||
| `created_at`, `finalized_at`, `expires_at` | `TIMESTAMPTZ` | |
|
||||
| `last_event_sequence` | `BIGINT NOT NULL` | for replay bookkeeping |
|
||||
|
||||
## Data Model
|
||||
### `artifact_files` (materialised view)
|
||||
|
||||
### Artifact Package
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `package_id` | `UUID NOT NULL` | FK |
|
||||
| `relative_path` | `TEXT NOT NULL` | logical path; unique within package |
|
||||
| `media_type` | `TEXT NOT NULL` | required (ADR-0006) |
|
||||
| `size_bytes` | `BIGINT NOT NULL` | |
|
||||
| `digest_algorithm` | `TEXT NOT NULL` | `blake3` by default (ADR-0001) |
|
||||
| `digest_primary` | `BYTEA NOT NULL` | bytes of the primary digest |
|
||||
| `digest_sha256` | `BYTEA NOT NULL` | always populated for interop |
|
||||
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
|
||||
|
||||
Required fields:
|
||||
### `storage_locations` (materialised view)
|
||||
|
||||
- `id`
|
||||
- `name`
|
||||
- `producer`
|
||||
- `subject`
|
||||
- `retention_class`
|
||||
- `status`
|
||||
- `created_at`
|
||||
- `finalized_at`
|
||||
- `expires_at`
|
||||
- `metadata`
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `artifact_file_id` | `UUID NOT NULL` | FK |
|
||||
| `backend_id` | `TEXT NOT NULL` | |
|
||||
| `content_address` | `TEXT NOT NULL` | `<algo>:<hex>` |
|
||||
| `object_key` | `TEXT NOT NULL` | backend-specific, usually derived from `content_address` |
|
||||
| `storage_class` | `TEXT NULL` | backend-specific label |
|
||||
| `retrieval_tier` | `TEXT NOT NULL DEFAULT 'hot'` | `hot` / `warm` / `cold` / `archive` |
|
||||
| `restore_status` | `TEXT NULL` | `available` / `restore_requested` / `restoring` / `restored` / `expired` |
|
||||
| `status` | `TEXT NOT NULL` | `recorded` / `verified` / `failed` / `deleted` |
|
||||
| `created_at`, `last_verified_at` | `TIMESTAMPTZ` | |
|
||||
|
||||
Recommended metadata keys:
|
||||
### `retention_state` (materialised view)
|
||||
|
||||
- `repo_slug`
|
||||
- `run_id`
|
||||
- `assessment_id`
|
||||
- `target_profile_ref`
|
||||
- `assessment_profile_ref`
|
||||
- `source_commits`
|
||||
- `tool_versions`
|
||||
- `environment`
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `package_id` | `UUID PRIMARY KEY` | |
|
||||
| `current_expires_at` | `TIMESTAMPTZ NULL` | NULL = no expiry (permanent or held) |
|
||||
| `effective_class` | `TEXT NOT NULL` | |
|
||||
| `active_hold_id` | `UUID NULL` | |
|
||||
| `eligible_for_deletion` | `BOOLEAN NOT NULL` | |
|
||||
|
||||
### Artifact File
|
||||
### `retention_classes` (seed data, not derived)
|
||||
|
||||
Required fields:
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `class_id` | `TEXT PRIMARY KEY` | `transient` / `raw-evidence` / `summary-evidence` / `release-evidence` / `permanent-record` |
|
||||
| `default_duration` | `INTERVAL NULL` | NULL for `permanent-record` |
|
||||
| `deletion_strategy` | `TEXT NOT NULL` | `mark_eligible` / `auto_delete_after_grace` (v1 only uses the former) |
|
||||
|
||||
- `id`
|
||||
- `package_id`
|
||||
- `relative_path`
|
||||
- `media_type`
|
||||
- `size_bytes`
|
||||
- `sha256`
|
||||
- `created_at`
|
||||
### `metadata_schemas` (config table)
|
||||
|
||||
### Storage Location
|
||||
| Column | Type | Notes |
|
||||
|---|---|---|
|
||||
| `id` | `UUID PRIMARY KEY` | |
|
||||
| `slug` | `TEXT NOT NULL UNIQUE` | e.g. `guide-board.run.v1` |
|
||||
| `json_schema` | `JSONB NOT NULL` | |
|
||||
| `created_at` | `TIMESTAMPTZ NOT NULL` | |
|
||||
|
||||
Required fields:
|
||||
## API shape
|
||||
|
||||
- `id`
|
||||
- `artifact_file_id`
|
||||
- `backend_id`
|
||||
- `object_key`
|
||||
- `storage_class`
|
||||
- `status`
|
||||
- `created_at`
|
||||
- `last_verified_at`
|
||||
|
||||
### Retention Event
|
||||
|
||||
Required fields:
|
||||
|
||||
- `id`
|
||||
- `package_id`
|
||||
- `event_type`
|
||||
- `reason`
|
||||
- `created_by`
|
||||
- `created_at`
|
||||
- `previous_expires_at`
|
||||
- `new_expires_at`
|
||||
|
||||
Event types:
|
||||
|
||||
- `default_rule_applied`
|
||||
- `extended`
|
||||
- `hold_applied`
|
||||
- `hold_released`
|
||||
- `deletion_eligible`
|
||||
- `deleted`
|
||||
|
||||
## API Shape
|
||||
|
||||
Initial endpoints:
|
||||
### Native v1 surface
|
||||
|
||||
```text
|
||||
GET /health
|
||||
GET /backends
|
||||
POST /packages
|
||||
GET /packages
|
||||
GET /packages/{package_id}
|
||||
POST /packages/{package_id}/files
|
||||
POST /packages/{package_id}/finalize
|
||||
GET /packages/{package_id}/manifest
|
||||
GET /files/{file_id}/download
|
||||
POST /packages/{package_id}/retention/extensions
|
||||
POST /packages/{package_id}/retention/holds
|
||||
POST /packages/{package_id}/retention/holds/{hold_id}/release
|
||||
GET /health
|
||||
GET /backends
|
||||
GET /retention-classes
|
||||
|
||||
POST /packages # create
|
||||
GET /packages # list, query by metadata
|
||||
GET /packages/{package_id} # metadata
|
||||
POST /packages/{package_id}/files # single-shot file upload
|
||||
POST /packages/{package_id}/finalize # produce manifest
|
||||
GET /packages/{package_id}/manifest # canonical CBOR (Accept: application/cbor)
|
||||
GET /packages/{package_id}/manifest.json # JCS projection (Accept: application/json)
|
||||
|
||||
GET /files/{file_id} # metadata
|
||||
GET /files/{file_id}/download # bytes
|
||||
|
||||
POST /uploads # open an upload session (resource shape pinned now)
|
||||
PATCH /uploads/{upload_id} # range body
|
||||
POST /uploads/{upload_id}/complete # promote to /packages/.../files
|
||||
|
||||
POST /packages/{package_id}/retention/extensions
|
||||
POST /packages/{package_id}/retention/holds
|
||||
POST /packages/{package_id}/retention/holds/{hold_id}/release
|
||||
|
||||
GET /events?since={sequence} # long-poll registry change feed
|
||||
```
|
||||
|
||||
The first ingestion path can accept multipart file uploads. A later trusted-local
|
||||
operator endpoint may ingest from server-local paths, but it should be disabled
|
||||
by default because path ingestion changes the security boundary.
|
||||
The `POST /uploads/...` resource shape is committed now even if v1
|
||||
implements it as single-shot internally; ADR per `PLATFORM-AMBITION` A6.
|
||||
|
||||
## Package Manifest
|
||||
### Deferred / not v1
|
||||
|
||||
Every finalized package should expose a JSON manifest containing:
|
||||
- `/v2/…` OCI Distribution endpoints (ADR-0006).
|
||||
- gRPC API.
|
||||
- Streaming CDC topic (NATS / Kafka).
|
||||
- Multi-tenant namespacing in URLs.
|
||||
|
||||
- package metadata,
|
||||
- retention summary,
|
||||
- file list,
|
||||
- file digests and sizes,
|
||||
- storage backend references,
|
||||
- source metadata,
|
||||
- created/finalized timestamps.
|
||||
## Package manifest content (v1)
|
||||
|
||||
For guide-board runs, the manifest should preserve links to:
|
||||
A finalised manifest carries:
|
||||
|
||||
- `run.json`
|
||||
- `retention-summary.json`
|
||||
- `reports/assessment-package.json`
|
||||
- `reports/report.md`
|
||||
- extension-generated scorecards or log reviews,
|
||||
- raw artifact files captured by the assessment package manifest.
|
||||
- `manifest_version: 1`
|
||||
- `package`: id, name, producer, subject, retention class, created_at,
|
||||
finalized_at, expires_at, metadata, metadata_schema_id (nullable).
|
||||
- `files`: ordered list of `{id, relative_path, media_type, size_bytes,
|
||||
digest_algorithm, digest_primary_hex, digest_sha256_hex}`.
|
||||
- `storage_receipts`: ordered list of `{file_id, backend_id,
|
||||
content_address, retrieval_tier, status}` per stored copy.
|
||||
- `retention_summary`: current class, expires_at, holds, last
|
||||
retention event.
|
||||
- `provenance`: `{source_commits, tool_versions, environment,
|
||||
ingest_actor, ingest_timestamps}`. Schema-driven; freeform under a
|
||||
registered schema or empty if none.
|
||||
|
||||
## Guide-Board Pilot Flow
|
||||
The manifest digest (`blake3:<hex>`) is the package's canonical
|
||||
external identifier.
|
||||
|
||||
```text
|
||||
guide-board run directory
|
||||
-> open-cmis-tck scorecard/log review
|
||||
-> artifact-store package create
|
||||
-> upload run files
|
||||
-> finalize manifest
|
||||
-> Statehub record links package id and summary
|
||||
```
|
||||
## Storage backends
|
||||
|
||||
The artifact package should carry:
|
||||
### Local filesystem (v1)
|
||||
|
||||
- run id,
|
||||
- target profile reference,
|
||||
- assessment profile reference,
|
||||
- result status,
|
||||
- source commits for guide-board, open-cmis-tck, and the assessed repository,
|
||||
- important report paths,
|
||||
- retention class `raw-evidence` or `release-evidence`.
|
||||
- Root: configured directory.
|
||||
- Object key layout: `<root>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
||||
- Atomic write via `fsync(tmpfile) + rename`. No partial states visible.
|
||||
- Path traversal prevented at the SPI boundary; the local backend
|
||||
rejects any key that does not match the expected layout.
|
||||
|
||||
## Ceph And S3-Compatible Storage
|
||||
### S3-compatible / Ceph RGW (WP-0004)
|
||||
|
||||
Ceph should be introduced through the S3-compatible adapter, not as a special
|
||||
case in producer logic.
|
||||
- Endpoint, bucket, region, access key ref, secret key ref, key
|
||||
prefix, storage class label, optional SSE config.
|
||||
- Object key: `<prefix>/<digest_algorithm>/<hex[0:2]>/<hex[2:4]>/<hex>`.
|
||||
- Multipart upload for objects above a configurable threshold.
|
||||
|
||||
Configuration should support:
|
||||
## Security boundary (v1)
|
||||
|
||||
- endpoint URL,
|
||||
- bucket,
|
||||
- region,
|
||||
- access key reference,
|
||||
- secret key reference,
|
||||
- optional server-side encryption settings,
|
||||
- object key prefix,
|
||||
- storage class label.
|
||||
- Internal service. No anonymous public access.
|
||||
- Authenticated producer / operator API. v1 ships shared-secret bearer
|
||||
tokens; OIDC integration is its own workplan.
|
||||
- No secret values in artifact metadata.
|
||||
- Upload paths are logical; never trusted filesystem paths. The
|
||||
`/uploads/...` path-ingest endpoint is *not* offered in v1.
|
||||
- Download authorisation is checked at the registry layer, never at
|
||||
the backend.
|
||||
|
||||
The service should never require credentials in producer request bodies. Use
|
||||
environment variables, mounted secret files, or a local secret provider.
|
||||
## Resolved open questions
|
||||
|
||||
## Future Retrieval Tiers
|
||||
- **Deduplication scope.** Global by content address (ADR-0001).
|
||||
Reference-counted deletion via a GC pass (WP-0006, TBD).
|
||||
- **Deletion ordering.** Mark records `deletion_eligible` first via an
|
||||
event. Byte deletion is a separate, audited operation that emits a
|
||||
second event. Reverse order is forbidden.
|
||||
- **Metadata schemas.** Open JSON with optional producer-registered
|
||||
JSON Schema; validation at ingest (ADR-0005, `metadata_schemas`).
|
||||
- **Statehub integration scope.** Statehub keeps package IDs and
|
||||
summary; never bytes. The `/events` long-poll is the integration
|
||||
point.
|
||||
|
||||
The initial API can treat all stored files as immediately retrievable. Later,
|
||||
storage locations can include:
|
||||
## Outstanding open questions (not blocking v1)
|
||||
|
||||
- `retrieval_tier`: hot, warm, cold, archive,
|
||||
- `restore_status`: available, restore_requested, restoring, restored, expired,
|
||||
- `restore_requested_at`,
|
||||
- `restore_expires_at`.
|
||||
- Identity provider for shared deployments.
|
||||
- Default retention durations per class (operator-configurable; needs
|
||||
one round of stakeholder input).
|
||||
- WASM plugin host design (deferred to its own workplan; see
|
||||
`PLATFORM-AMBITION`).
|
||||
- Federation / mirroring protocol (post-OCI-endpoint workplan).
|
||||
|
||||
The registry API should be able to return "not immediately available" without
|
||||
changing artifact identity.
|
||||
## Roadmap pointer
|
||||
|
||||
## Security Boundary
|
||||
|
||||
Initial service assumptions:
|
||||
|
||||
- internal service, not public internet exposed,
|
||||
- authenticated producer/operator API before shared deployment,
|
||||
- no secret values stored in artifact metadata,
|
||||
- package paths are logical paths, not trusted filesystem paths,
|
||||
- download authorization should be checked at the registry layer.
|
||||
|
||||
Files may contain sensitive evidence. The service must treat metadata and bytes
|
||||
as confidential by default.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Which identity provider should guard shared deployments?
|
||||
- Should package metadata schemas be open-ended JSON or typed by producer?
|
||||
- Should deduplication be package-local only or global by content hash?
|
||||
- Should deletion first mark records deleted, then delete bytes, or reverse that
|
||||
order with compensating events?
|
||||
- How much Statehub integration belongs in this repo versus in Statehub clients?
|
||||
The implementation sequence is in `docs/ROADMAP.md`. The first
|
||||
workplan is `workplans/ARTIFACT-STORE-WP-0001-foundation.md`.
|
||||
|
||||
Reference in New Issue
Block a user