Files
artifact-store/docs/ARCHITECTURE-BLUEPRINT.md
2026-05-15 20:08:32 +02:00

8.2 KiB

Artifact Store Architecture Blueprint

Status: draft Created: 2026-05-15

Purpose

artifact-store provides a generic registry and storage gateway for durable generated artifacts. Producers register packages and files with metadata; storage adapters persist the bytes; retention policy decides how long artifacts remain eligible for retrieval.

The design keeps artifact identity and lifecycle separate from storage implementation. This allows the first version to run against local filesystem storage while the production path can use S3-compatible object storage such as Ceph RGW.

Architecture Summary

producer
  -> Artifact Registry API
    -> metadata database
    -> retention policy engine
    -> audit event log
    -> storage adapter interface
      -> local filesystem backend
      -> S3-compatible backend
      -> Ceph RGW deployment
      -> future cloud/blob/archive backends

The registry is the authority for artifact metadata and lifecycle. Backends are responsible for byte storage and retrieval.

Design Principles

  • Backend-neutral registry: no producer should know whether bytes live in Ceph, local disk, or a cloud bucket.
  • Content-addressable confidence: every stored file has a digest and size.
  • Retention by default: every package receives an expiry decision at ingestion.
  • Extensions are explicit: retention extensions and holds are audit events, not silent metadata edits.
  • Packages remain portable: a manifest should be enough to understand a package without calling the producer.
  • Statehub links, it does not store bytes: Statehub records artifact IDs and outcomes; artifact-store owns file persistence.
  • Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion jobs must be auditable and reversible only when the backend still has data.

Components

Registry API

HTTP API for producers and operators.

Initial responsibilities:

  • create artifact packages,
  • upload or ingest files,
  • finalize packages,
  • retrieve package metadata,
  • list/search packages by subject and producer metadata,
  • create retention extensions and holds,
  • expose download metadata or redirect/download endpoints,
  • expose health and backend status.

Metadata Store

Persistent database for registry state.

Initial implementation can use SQLite for local development and PostgreSQL for shared service deployments if that matches the surrounding service stack.

Core tables:

  • artifact_packages
  • artifact_files
  • storage_locations
  • retention_rules
  • retention_events
  • audit_events

Storage Adapter Interface

Small backend contract used by the API service.

Required operations:

  • put(object_key, stream, metadata) -> storage_location
  • get(object_key) -> stream or signed_url
  • head(object_key) -> object_metadata
  • delete(object_key) -> deletion_result
  • health() -> backend_status

Initial backends:

  • local filesystem backend for tests and development,
  • S3-compatible backend for Ceph RGW and cloud object stores.

Retention Policy Engine

Applies default rules at ingestion and records later changes.

Initial retention classes:

  • transient: short-lived scratch artifacts,
  • raw-evidence: raw logs and run output,
  • summary-evidence: compact reports and summaries,
  • release-evidence: release or customer-facing evidence packages,
  • permanent-record: manually held records with no automatic expiry.

Each package stores:

  • selected retention class,
  • default retention rule,
  • computed expires_at,
  • extension records,
  • hold records,
  • deletion eligibility state.

Audit Log

Append-only record of important events:

  • package created,
  • file uploaded,
  • package finalized,
  • retrieval requested,
  • retention extended,
  • hold applied or released,
  • deletion requested,
  • deletion completed or failed.

The audit log does not need to be cryptographic in the first release, but the schema should leave room for signed events or external write-once storage later.

Data Model

Artifact Package

Required fields:

  • id
  • name
  • producer
  • subject
  • retention_class
  • status
  • created_at
  • finalized_at
  • expires_at
  • metadata

Recommended metadata keys:

  • repo_slug
  • run_id
  • assessment_id
  • target_profile_ref
  • assessment_profile_ref
  • source_commits
  • tool_versions
  • environment

Artifact File

Required fields:

  • id
  • package_id
  • relative_path
  • media_type
  • size_bytes
  • sha256
  • created_at

Storage Location

Required fields:

  • id
  • artifact_file_id
  • backend_id
  • object_key
  • storage_class
  • status
  • created_at
  • last_verified_at

Retention Event

Required fields:

  • id
  • package_id
  • event_type
  • reason
  • created_by
  • created_at
  • previous_expires_at
  • new_expires_at

Event types:

  • default_rule_applied
  • extended
  • hold_applied
  • hold_released
  • deletion_eligible
  • deleted

API Shape

Initial endpoints:

GET  /health
GET  /backends
POST /packages
GET  /packages
GET  /packages/{package_id}
POST /packages/{package_id}/files
POST /packages/{package_id}/finalize
GET  /packages/{package_id}/manifest
GET  /files/{file_id}/download
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release

The first ingestion path can accept multipart file uploads. A later trusted-local operator endpoint may ingest from server-local paths, but it should be disabled by default because path ingestion changes the security boundary.

Package Manifest

Every finalized package should expose a JSON manifest containing:

  • package metadata,
  • retention summary,
  • file list,
  • file digests and sizes,
  • storage backend references,
  • source metadata,
  • created/finalized timestamps.

For guide-board runs, the manifest should preserve links to:

  • run.json
  • retention-summary.json
  • reports/assessment-package.json
  • reports/report.md
  • extension-generated scorecards or log reviews,
  • raw artifact files captured by the assessment package manifest.

Guide-Board Pilot Flow

guide-board run directory
  -> open-cmis-tck scorecard/log review
  -> artifact-store package create
  -> upload run files
  -> finalize manifest
  -> Statehub record links package id and summary

The artifact package should carry:

  • run id,
  • target profile reference,
  • assessment profile reference,
  • result status,
  • source commits for guide-board, open-cmis-tck, and the assessed repository,
  • important report paths,
  • retention class raw-evidence or release-evidence.

Ceph And S3-Compatible Storage

Ceph should be introduced through the S3-compatible adapter, not as a special case in producer logic.

Configuration should support:

  • endpoint URL,
  • bucket,
  • region,
  • access key reference,
  • secret key reference,
  • optional server-side encryption settings,
  • object key prefix,
  • storage class label.

The service should never require credentials in producer request bodies. Use environment variables, mounted secret files, or a local secret provider.

Future Retrieval Tiers

The initial API can treat all stored files as immediately retrievable. Later, storage locations can include:

  • retrieval_tier: hot, warm, cold, archive,
  • restore_status: available, restore_requested, restoring, restored, expired,
  • restore_requested_at,
  • restore_expires_at.

The registry API should be able to return "not immediately available" without changing artifact identity.

Security Boundary

Initial service assumptions:

  • internal service, not public internet exposed,
  • authenticated producer/operator API before shared deployment,
  • no secret values stored in artifact metadata,
  • package paths are logical paths, not trusted filesystem paths,
  • download authorization should be checked at the registry layer.

Files may contain sensitive evidence. The service must treat metadata and bytes as confidential by default.

Open Questions

  • Which identity provider should guard shared deployments?
  • Should package metadata schemas be open-ended JSON or typed by producer?
  • Should deduplication be package-local only or global by content hash?
  • Should deletion first mark records deleted, then delete bytes, or reverse that order with compensating events?
  • How much Statehub integration belongs in this repo versus in Statehub clients?