8.2 KiB
Artifact Store Architecture Blueprint
Status: draft Created: 2026-05-15
Purpose
artifact-store provides a generic registry and storage gateway for durable
generated artifacts. Producers register packages and files with metadata;
storage adapters persist the bytes; retention policy decides how long artifacts
remain eligible for retrieval.
The design keeps artifact identity and lifecycle separate from storage implementation. This allows the first version to run against local filesystem storage while the production path can use S3-compatible object storage such as Ceph RGW.
Architecture Summary
producer
-> Artifact Registry API
-> metadata database
-> retention policy engine
-> audit event log
-> storage adapter interface
-> local filesystem backend
-> S3-compatible backend
-> Ceph RGW deployment
-> future cloud/blob/archive backends
The registry is the authority for artifact metadata and lifecycle. Backends are responsible for byte storage and retrieval.
Design Principles
- Backend-neutral registry: no producer should know whether bytes live in Ceph, local disk, or a cloud bucket.
- Content-addressable confidence: every stored file has a digest and size.
- Retention by default: every package receives an expiry decision at ingestion.
- Extensions are explicit: retention extensions and holds are audit events, not silent metadata edits.
- Packages remain portable: a manifest should be enough to understand a package without calling the producer.
- Statehub links, it does not store bytes: Statehub records artifact IDs and outcomes; artifact-store owns file persistence.
- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion jobs must be auditable and reversible only when the backend still has data.
Components
Registry API
HTTP API for producers and operators.
Initial responsibilities:
- create artifact packages,
- upload or ingest files,
- finalize packages,
- retrieve package metadata,
- list/search packages by subject and producer metadata,
- create retention extensions and holds,
- expose download metadata or redirect/download endpoints,
- expose health and backend status.
Metadata Store
Persistent database for registry state.
Initial implementation can use SQLite for local development and PostgreSQL for shared service deployments if that matches the surrounding service stack.
Core tables:
artifact_packagesartifact_filesstorage_locationsretention_rulesretention_eventsaudit_events
Storage Adapter Interface
Small backend contract used by the API service.
Required operations:
put(object_key, stream, metadata) -> storage_locationget(object_key) -> stream or signed_urlhead(object_key) -> object_metadatadelete(object_key) -> deletion_resulthealth() -> backend_status
Initial backends:
- local filesystem backend for tests and development,
- S3-compatible backend for Ceph RGW and cloud object stores.
Retention Policy Engine
Applies default rules at ingestion and records later changes.
Initial retention classes:
transient: short-lived scratch artifacts,raw-evidence: raw logs and run output,summary-evidence: compact reports and summaries,release-evidence: release or customer-facing evidence packages,permanent-record: manually held records with no automatic expiry.
Each package stores:
- selected retention class,
- default retention rule,
- computed
expires_at, - extension records,
- hold records,
- deletion eligibility state.
Audit Log
Append-only record of important events:
- package created,
- file uploaded,
- package finalized,
- retrieval requested,
- retention extended,
- hold applied or released,
- deletion requested,
- deletion completed or failed.
The audit log does not need to be cryptographic in the first release, but the schema should leave room for signed events or external write-once storage later.
Data Model
Artifact Package
Required fields:
idnameproducersubjectretention_classstatuscreated_atfinalized_atexpires_atmetadata
Recommended metadata keys:
repo_slugrun_idassessment_idtarget_profile_refassessment_profile_refsource_commitstool_versionsenvironment
Artifact File
Required fields:
idpackage_idrelative_pathmedia_typesize_bytessha256created_at
Storage Location
Required fields:
idartifact_file_idbackend_idobject_keystorage_classstatuscreated_atlast_verified_at
Retention Event
Required fields:
idpackage_idevent_typereasoncreated_bycreated_atprevious_expires_atnew_expires_at
Event types:
default_rule_appliedextendedhold_appliedhold_releaseddeletion_eligibledeleted
API Shape
Initial endpoints:
GET /health
GET /backends
POST /packages
GET /packages
GET /packages/{package_id}
POST /packages/{package_id}/files
POST /packages/{package_id}/finalize
GET /packages/{package_id}/manifest
GET /files/{file_id}/download
POST /packages/{package_id}/retention/extensions
POST /packages/{package_id}/retention/holds
POST /packages/{package_id}/retention/holds/{hold_id}/release
The first ingestion path can accept multipart file uploads. A later trusted-local operator endpoint may ingest from server-local paths, but it should be disabled by default because path ingestion changes the security boundary.
Package Manifest
Every finalized package should expose a JSON manifest containing:
- package metadata,
- retention summary,
- file list,
- file digests and sizes,
- storage backend references,
- source metadata,
- created/finalized timestamps.
For guide-board runs, the manifest should preserve links to:
run.jsonretention-summary.jsonreports/assessment-package.jsonreports/report.md- extension-generated scorecards or log reviews,
- raw artifact files captured by the assessment package manifest.
Guide-Board Pilot Flow
guide-board run directory
-> open-cmis-tck scorecard/log review
-> artifact-store package create
-> upload run files
-> finalize manifest
-> Statehub record links package id and summary
The artifact package should carry:
- run id,
- target profile reference,
- assessment profile reference,
- result status,
- source commits for guide-board, open-cmis-tck, and the assessed repository,
- important report paths,
- retention class
raw-evidenceorrelease-evidence.
Ceph And S3-Compatible Storage
Ceph should be introduced through the S3-compatible adapter, not as a special case in producer logic.
Configuration should support:
- endpoint URL,
- bucket,
- region,
- access key reference,
- secret key reference,
- optional server-side encryption settings,
- object key prefix,
- storage class label.
The service should never require credentials in producer request bodies. Use environment variables, mounted secret files, or a local secret provider.
Future Retrieval Tiers
The initial API can treat all stored files as immediately retrievable. Later, storage locations can include:
retrieval_tier: hot, warm, cold, archive,restore_status: available, restore_requested, restoring, restored, expired,restore_requested_at,restore_expires_at.
The registry API should be able to return "not immediately available" without changing artifact identity.
Security Boundary
Initial service assumptions:
- internal service, not public internet exposed,
- authenticated producer/operator API before shared deployment,
- no secret values stored in artifact metadata,
- package paths are logical paths, not trusted filesystem paths,
- download authorization should be checked at the registry layer.
Files may contain sensitive evidence. The service must treat metadata and bytes as confidential by default.
Open Questions
- Which identity provider should guard shared deployments?
- Should package metadata schemas be open-ended JSON or typed by producer?
- Should deduplication be package-local only or global by content hash?
- Should deletion first mark records deleted, then delete bytes, or reverse that order with compensating events?
- How much Statehub integration belongs in this repo versus in Statehub clients?