generated from coulomb/repo-seed
Bootstraping the repo
This commit is contained in:
330
docs/ARCHITECTURE-BLUEPRINT.md
Normal file
330
docs/ARCHITECTURE-BLUEPRINT.md
Normal file
@@ -0,0 +1,330 @@
|
||||
# Artifact Store Architecture Blueprint
|
||||
|
||||
Status: draft
|
||||
Created: 2026-05-15
|
||||
|
||||
## Purpose
|
||||
|
||||
`artifact-store` provides a generic registry and storage gateway for durable
|
||||
generated artifacts. Producers register packages and files with metadata;
|
||||
storage adapters persist the bytes; retention policy decides how long artifacts
|
||||
remain eligible for retrieval.
|
||||
|
||||
The design keeps artifact identity and lifecycle separate from storage
|
||||
implementation. This allows the first version to run against local filesystem
|
||||
storage while the production path can use S3-compatible object storage such as
|
||||
Ceph RGW.
|
||||
|
||||
## Architecture Summary
|
||||
|
||||
```text
|
||||
producer
|
||||
-> Artifact Registry API
|
||||
-> metadata database
|
||||
-> retention policy engine
|
||||
-> audit event log
|
||||
-> storage adapter interface
|
||||
-> local filesystem backend
|
||||
-> S3-compatible backend
|
||||
-> Ceph RGW deployment
|
||||
-> future cloud/blob/archive backends
|
||||
```
|
||||
|
||||
The registry is the authority for artifact metadata and lifecycle. Backends are
|
||||
responsible for byte storage and retrieval.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- Backend-neutral registry: no producer should know whether bytes live in Ceph,
|
||||
local disk, or a cloud bucket.
|
||||
- Content-addressable confidence: every stored file has a digest and size.
|
||||
- Retention by default: every package receives an expiry decision at ingestion.
|
||||
- Extensions are explicit: retention extensions and holds are audit events, not
|
||||
silent metadata edits.
|
||||
- Packages remain portable: a manifest should be enough to understand a package
|
||||
without calling the producer.
|
||||
- Statehub links, it does not store bytes: Statehub records artifact IDs and
|
||||
outcomes; artifact-store owns file persistence.
|
||||
- Deletion is deliberate: expiry makes artifacts eligible for deletion; deletion
|
||||
jobs must be auditable and reversible only when the backend still has data.
|
||||
|
||||
## Components
|
||||
|
||||
### Registry API
|
||||
|
||||
HTTP API for producers and operators.
|
||||
|
||||
Initial responsibilities:
|
||||
|
||||
- create artifact packages,
|
||||
- upload or ingest files,
|
||||
- finalize packages,
|
||||
- retrieve package metadata,
|
||||
- list/search packages by subject and producer metadata,
|
||||
- create retention extensions and holds,
|
||||
- expose download metadata or redirect/download endpoints,
|
||||
- expose health and backend status.
|
||||
|
||||
### Metadata Store
|
||||
|
||||
Persistent database for registry state.
|
||||
|
||||
Initial implementation can use SQLite for local development and PostgreSQL for
|
||||
shared service deployments if that matches the surrounding service stack.
|
||||
|
||||
Core tables:
|
||||
|
||||
- `artifact_packages`
|
||||
- `artifact_files`
|
||||
- `storage_locations`
|
||||
- `retention_rules`
|
||||
- `retention_events`
|
||||
- `audit_events`
|
||||
|
||||
### Storage Adapter Interface
|
||||
|
||||
Small backend contract used by the API service.
|
||||
|
||||
Required operations:
|
||||
|
||||
- `put(object_key, stream, metadata) -> storage_location`
|
||||
- `get(object_key) -> stream or signed_url`
|
||||
- `head(object_key) -> object_metadata`
|
||||
- `delete(object_key) -> deletion_result`
|
||||
- `health() -> backend_status`
|
||||
|
||||
Initial backends:
|
||||
|
||||
- local filesystem backend for tests and development,
|
||||
- S3-compatible backend for Ceph RGW and cloud object stores.
|
||||
|
||||
### Retention Policy Engine
|
||||
|
||||
Applies default rules at ingestion and records later changes.
|
||||
|
||||
Initial retention classes:
|
||||
|
||||
- `transient`: short-lived scratch artifacts,
|
||||
- `raw-evidence`: raw logs and run output,
|
||||
- `summary-evidence`: compact reports and summaries,
|
||||
- `release-evidence`: release or customer-facing evidence packages,
|
||||
- `permanent-record`: manually held records with no automatic expiry.
|
||||
|
||||
Each package stores:
|
||||
|
||||
- selected retention class,
|
||||
- default retention rule,
|
||||
- computed `expires_at`,
|
||||
- extension records,
|
||||
- hold records,
|
||||
- deletion eligibility state.
|
||||
|
||||
### Audit Log
|
||||
|
||||
Append-only record of important events:
|
||||
|
||||
- package created,
|
||||
- file uploaded,
|
||||
- package finalized,
|
||||
- retrieval requested,
|
||||
- retention extended,
|
||||
- hold applied or released,
|
||||
- deletion requested,
|
||||
- deletion completed or failed.
|
||||
|
||||
The audit log does not need to be cryptographic in the first release, but the
|
||||
schema should leave room for signed events or external write-once storage later.
|
||||
|
||||
## Data Model
|
||||
|
||||
### Artifact Package
|
||||
|
||||
Required fields:
|
||||
|
||||
- `id`
|
||||
- `name`
|
||||
- `producer`
|
||||
- `subject`
|
||||
- `retention_class`
|
||||
- `status`
|
||||
- `created_at`
|
||||
- `finalized_at`
|
||||
- `expires_at`
|
||||
- `metadata`
|
||||
|
||||
Recommended metadata keys:
|
||||
|
||||
- `repo_slug`
|
||||
- `run_id`
|
||||
- `assessment_id`
|
||||
- `target_profile_ref`
|
||||
- `assessment_profile_ref`
|
||||
- `source_commits`
|
||||
- `tool_versions`
|
||||
- `environment`
|
||||
|
||||
### Artifact File
|
||||
|
||||
Required fields:
|
||||
|
||||
- `id`
|
||||
- `package_id`
|
||||
- `relative_path`
|
||||
- `media_type`
|
||||
- `size_bytes`
|
||||
- `sha256`
|
||||
- `created_at`
|
||||
|
||||
### Storage Location
|
||||
|
||||
Required fields:
|
||||
|
||||
- `id`
|
||||
- `artifact_file_id`
|
||||
- `backend_id`
|
||||
- `object_key`
|
||||
- `storage_class`
|
||||
- `status`
|
||||
- `created_at`
|
||||
- `last_verified_at`
|
||||
|
||||
### Retention Event
|
||||
|
||||
Required fields:
|
||||
|
||||
- `id`
|
||||
- `package_id`
|
||||
- `event_type`
|
||||
- `reason`
|
||||
- `created_by`
|
||||
- `created_at`
|
||||
- `previous_expires_at`
|
||||
- `new_expires_at`
|
||||
|
||||
Event types:
|
||||
|
||||
- `default_rule_applied`
|
||||
- `extended`
|
||||
- `hold_applied`
|
||||
- `hold_released`
|
||||
- `deletion_eligible`
|
||||
- `deleted`
|
||||
|
||||
## API Shape
|
||||
|
||||
Initial endpoints:
|
||||
|
||||
```text
|
||||
GET /health
|
||||
GET /backends
|
||||
POST /packages
|
||||
GET /packages
|
||||
GET /packages/{package_id}
|
||||
POST /packages/{package_id}/files
|
||||
POST /packages/{package_id}/finalize
|
||||
GET /packages/{package_id}/manifest
|
||||
GET /files/{file_id}/download
|
||||
POST /packages/{package_id}/retention/extensions
|
||||
POST /packages/{package_id}/retention/holds
|
||||
POST /packages/{package_id}/retention/holds/{hold_id}/release
|
||||
```
|
||||
|
||||
The first ingestion path can accept multipart file uploads. A later trusted-local
|
||||
operator endpoint may ingest from server-local paths, but it should be disabled
|
||||
by default because path ingestion changes the security boundary.
|
||||
|
||||
## Package Manifest
|
||||
|
||||
Every finalized package should expose a JSON manifest containing:
|
||||
|
||||
- package metadata,
|
||||
- retention summary,
|
||||
- file list,
|
||||
- file digests and sizes,
|
||||
- storage backend references,
|
||||
- source metadata,
|
||||
- created/finalized timestamps.
|
||||
|
||||
For guide-board runs, the manifest should preserve links to:
|
||||
|
||||
- `run.json`
|
||||
- `retention-summary.json`
|
||||
- `reports/assessment-package.json`
|
||||
- `reports/report.md`
|
||||
- extension-generated scorecards or log reviews,
|
||||
- raw artifact files captured by the assessment package manifest.
|
||||
|
||||
## Guide-Board Pilot Flow
|
||||
|
||||
```text
|
||||
guide-board run directory
|
||||
-> open-cmis-tck scorecard/log review
|
||||
-> artifact-store package create
|
||||
-> upload run files
|
||||
-> finalize manifest
|
||||
-> Statehub record links package id and summary
|
||||
```
|
||||
|
||||
The artifact package should carry:
|
||||
|
||||
- run id,
|
||||
- target profile reference,
|
||||
- assessment profile reference,
|
||||
- result status,
|
||||
- source commits for guide-board, open-cmis-tck, and the assessed repository,
|
||||
- important report paths,
|
||||
- retention class `raw-evidence` or `release-evidence`.
|
||||
|
||||
## Ceph And S3-Compatible Storage
|
||||
|
||||
Ceph should be introduced through the S3-compatible adapter, not as a special
|
||||
case in producer logic.
|
||||
|
||||
Configuration should support:
|
||||
|
||||
- endpoint URL,
|
||||
- bucket,
|
||||
- region,
|
||||
- access key reference,
|
||||
- secret key reference,
|
||||
- optional server-side encryption settings,
|
||||
- object key prefix,
|
||||
- storage class label.
|
||||
|
||||
The service should never require credentials in producer request bodies. Use
|
||||
environment variables, mounted secret files, or a local secret provider.
|
||||
|
||||
## Future Retrieval Tiers
|
||||
|
||||
The initial API can treat all stored files as immediately retrievable. Later,
|
||||
storage locations can include:
|
||||
|
||||
- `retrieval_tier`: hot, warm, cold, archive,
|
||||
- `restore_status`: available, restore_requested, restoring, restored, expired,
|
||||
- `restore_requested_at`,
|
||||
- `restore_expires_at`.
|
||||
|
||||
The registry API should be able to return "not immediately available" without
|
||||
changing artifact identity.
|
||||
|
||||
## Security Boundary
|
||||
|
||||
Initial service assumptions:
|
||||
|
||||
- internal service, not public internet exposed,
|
||||
- authenticated producer/operator API before shared deployment,
|
||||
- no secret values stored in artifact metadata,
|
||||
- package paths are logical paths, not trusted filesystem paths,
|
||||
- download authorization should be checked at the registry layer.
|
||||
|
||||
Files may contain sensitive evidence. The service must treat metadata and bytes
|
||||
as confidential by default.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- Which identity provider should guard shared deployments?
|
||||
- Should package metadata schemas be open-ended JSON or typed by producer?
|
||||
- Should deduplication be package-local only or global by content hash?
|
||||
- Should deletion first mark records deleted, then delete bytes, or reverse that
|
||||
order with compensating events?
|
||||
- How much Statehub integration belongs in this repo versus in Statehub clients?
|
||||
Reference in New Issue
Block a user