docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans

Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-15 21:16:17 +02:00
parent 403d903585
commit 747afc27a6
16 changed files with 1761 additions and 404 deletions

View File

@@ -0,0 +1,80 @@
# ADR-0001 — Content-Addressed Storage with Dual Digest
Status: accepted
Date: 2026-05-15
Supersedes: —
Related: ADR-0003, ADR-0006, `docs/PLATFORM-AMBITION.md` commitments A1, A2, A9
## Context
The architecture blueprint as originally drafted addresses stored bytes by
logical `(package, relative_path)`. That is sufficient for v1 ingestion but
forecloses global deduplication, Merkle integrity proofs, partial
replication, federation, and OCI artifact compatibility — all of which the
platform ambition requires to remain reachable.
Independently, the original blueprint pins SHA-256 as the only file digest.
SHA-256 with SHA-NI on modern x86 reaches ~1.52 GB/s/core. BLAKE3 on the
same hardware reaches 610+ GB/s/core, parallelises across cores, and its
construction *is* a Merkle tree — package-level integrity becomes free.
SHA-256 remains the lingua franca of SLSA, in-toto, cosign, and OCI; we
cannot drop it.
## Decision
1. The canonical storage key for any byte sequence is its content address
in the form `<algorithm>:<lowercase-hex-digest>`. Storage backends store
and retrieve by this key. `relative_path` is logical metadata recorded
in the manifest, not a storage-layer concept.
2. Every `artifact_files` row carries two digest columns:
- `digest_primary` — the native digest; default algorithm `blake3`.
- `digest_sha256` — always populated for interop, even when `blake3`
is the primary.
Both are computed in a single ingest pass (one read of the input).
3. The schema also carries a `digest_algorithm` column naming the primary
algorithm. Additional algorithms are added by new columns or a side
table, never by overloading `digest_primary`.
4. Storage backend object keys are derived from `digest_primary` only.
Migrations between primary algorithms are explicit and audited; they
are not silent.
## Consequences
Positive:
- Global deduplication is automatic — two identical files in two packages
share one backend object.
- Merkle integrity over a package is free with BLAKE3 (use the tree mode).
- Federation, partial mirrors, and OCI compatibility (ADR-0006) become
reachable without schema migration.
- Verification of a single file does not require fetching its package.
Negative:
- Two digests must be computed per ingest. Mitigated by streaming both
through one buffer; the bottleneck is I/O, not hashing.
- Reference counting: deletion of an `artifact_file` row cannot
unconditionally delete the backend object. A garbage-collector pass
reconciles references before deleting bytes. This is correct anyway
(deletion should be deliberate, per the blueprint).
- Producers requesting "store these N bytes at path P" must understand
that their P is logical. This is a documentation problem, not a
technical one.
## Implementation notes
- v1 ships BLAKE3 via the `blake3` PyPI wheel (Rust core, SIMD-accelerated;
no asm we maintain).
- v1 ships SHA-256 via stdlib `hashlib` (SHA-NI used when the CPython
build links against OpenSSL with SHA-NI support).
- A `Digest` value object wraps `(algorithm, hex)`; serialised forms
always include the algorithm prefix.
- A garbage-collector workplan is filed at WP-0006 (TBD); v1 does not
delete bytes automatically — it marks them eligible.
## Status of the original blueprint pin
The pre-cleanup blueprint's `artifact_files.sha256` column is replaced by
`digest_algorithm`, `digest_primary`, `digest_sha256`. The pre-cleanup
blueprint's implicit path-keyed storage is replaced by content-keyed
storage. These changes are absorbed into `docs/ARCHITECTURE-BLUEPRINT.md`.

View File

@@ -0,0 +1,76 @@
# ADR-0002 — Append-Only Event Log as Source of Truth
Status: accepted
Date: 2026-05-15
Related: `docs/PLATFORM-AMBITION.md` commitment A3
## Context
The original blueprint defines `audit_events` and `retention_events` as
separate tables. Both are useful, but neither is a complete authoritative
record of how registry state was produced. Several downstream needs share
one underlying primitive:
- audit (who did what when, with what result),
- change-data-capture feed for downstream consumers (Statehub, search),
- replication and federation between instances,
- point-in-time replay and disaster recovery,
- materialised view rebuilds when schemas evolve.
Each can be served by an append-only log of registry events with a
monotonic sequence number. Two separate tables cannot.
## Decision
1. The registry persists an append-only `events` table. Every state-
changing operation writes one row in the same database transaction as
the operation. Once written, rows are immutable.
2. Each row has a strictly monotonic, gapless sequence number scoped to
the registry instance, and a UTC ingest timestamp.
3. The current `artifact_packages`, `artifact_files`, `storage_locations`,
and `retention_state` tables are materialised views over `events`.
They are rebuildable by replay.
4. Event payloads are stored as canonical CBOR (ADR-0003), keyed by
`event_type` (string slug). The `event_type` namespace is versioned
(`v1.package.created`, `v1.file.ingested`, `v1.retention.extended`,
etc.).
5. `audit_events` and `retention_events` cease to exist as standalone
tables; their semantics are subsets of `events` filtered by
`event_type`.
## Consequences
Positive:
- One primitive serves audit, CDC, replication, replay, and rebuild.
- A consumer can tail by `sequence > N` and never miss an event.
- Forward-compatibility: new view columns can be derived from existing
events by adding a replay path; no migration required.
- Signed event chains are reachable later by adding a signature column.
Negative:
- Replays cost wall-clock time on large datasets. Snapshots of
materialised views (with the highest applied sequence stamped on them)
are used to bound replay cost.
- Schema migrations on materialised views still happen; they just no
longer touch the source of truth.
- Discipline required: any write that bypasses the event log is a bug.
Enforced by code review and a runtime invariant check on the
materialised tables.
## Implementation notes
- `events` schema (v1):
- `sequence BIGSERIAL PRIMARY KEY`
- `created_at TIMESTAMPTZ NOT NULL DEFAULT now()`
- `event_type TEXT NOT NULL`
- `subject_kind TEXT NOT NULL``package` | `file` | `retention` | `storage` | `system`
- `subject_id UUID` — nullable for system-level events
- `actor TEXT NOT NULL` — producer or operator identity
- `payload BYTEA NOT NULL` — canonical CBOR
- `payload_digest BYTEA NOT NULL` — BLAKE3 of `payload`
- Indexes: `(subject_kind, subject_id)`, `(event_type, sequence)`.
- Replay tool ships in v1 as a CLI subcommand (`artifactstore replay`).
- Outbound CDC stream (NATS / Kafka) is its own workplan; v1 only exposes
long-poll over `GET /events?since=<sequence>`.

View File

@@ -0,0 +1,78 @@
# ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)
Status: accepted
Date: 2026-05-15
Related: ADR-0001, ADR-0002, ADR-0006, `docs/PLATFORM-AMBITION.md` commitment A4
## Context
Manifests describe a package's identity, contents, retention, and
provenance. They are the durable, portable, signable summary of a package.
Three downstream features depend on byte-identical manifest serialisation:
1. Manifest digest (used as the package's content address — ADR-0001).
2. Signatures (cosign, Sigstore, in-toto, SLSA).
3. Cross-language / cross-version reproducibility (any client must be
able to verify a manifest produced by any other client).
JSON does not guarantee byte-identical output without an explicit
canonicalisation profile. The candidates are:
- **JCS** (JSON Canonicalization Scheme, RFC 8785) — JSON-shaped, widely
available, text-format, signs cleanly.
- **Canonical CBOR** (RFC 8949 §4.2.2) — binary, smaller, lower overhead
to canonicalise, native in cosign / Sigstore tooling, used by COSE.
- **DAG-CBOR** (IPLD profile) — canonical CBOR plus content-addressing
conventions; useful if we later integrate with IPLD/IPFS, but pulls in
ecosystem assumptions we don't yet need.
Canonical CBOR wins on size, parser surface, and direct compatibility
with the tooling we will adopt for signing (ADR commitments A4, A9). JCS
is a reasonable alternative; we keep an emit-JCS path for human-readable
display but the signed form is CBOR.
## Decision
1. Manifests are serialised as **canonical CBOR** per RFC 8949 §4.2.2:
- definite-length encoding throughout,
- shortest-form integer encoding,
- map keys sorted bytewise lexicographically,
- no floating-point unless explicitly required (we do not require it),
- no semantic tags except those we explicitly enumerate.
2. The manifest's content address is `blake3:<hex>` of its canonical
CBOR bytes. This is the package's primary identifier in storage.
3. A canonical JSON projection (JCS) of the same manifest is available
for display, signing-tool interop, and human inspection. The
projection is deterministic: round-tripping through it must yield
byte-identical CBOR.
4. The manifest schema is itself versioned (`manifest_version: 1`).
Unknown fields are preserved on read and re-emitted on write (forward
compatibility); breaking schema changes bump the version.
## Consequences
Positive:
- Manifests are signable today by any tool that consumes CBOR (cosign,
ssh-keygen `-Y sign`, COSE libraries).
- The manifest digest is stable across languages, OS, and compiler.
- Smaller on disk and on the wire than JSON.
- Replay (ADR-0002) is unambiguous because event payloads are also CBOR.
Negative:
- Less human-readable in raw form; the CLI must offer a `pretty` projection.
- One more dependency (a CBOR library). We pin one in ADR-0005.
- Future schema evolution requires the same canonicalisation discipline.
Enforced by a property-based test: any manifest must round-trip
CBOR → JCS → CBOR with byte equality.
## Implementation notes
- v1 library: `cbor2` (PyPI; pure-Python with optional C extension).
Wrapped behind `artifactstore.manifest.codec` so swapping to a faster
impl is transparent.
- JCS projection: `jcs` (PyPI) or hand-rolled — decision deferred to
WP-0001-T003.
- A `Manifest` value class enforces field order on emit, not just on
encode. This catches non-canonical producers at the API boundary.

View File

@@ -0,0 +1,79 @@
# ADR-0004 — Control Plane / Data Plane Contract
Status: accepted
Date: 2026-05-15
Related: ADR-0005, `docs/PLATFORM-AMBITION.md` commitment A5,
`docs/ASSEMBLY-EXPERIMENT.md`
## Context
The platform ambition expects a Rust (eventually asm-tuned) data plane
to handle hot ingest paths — hashing, chunking, optional compression and
encryption, storage backend I/O. The v1 service is written entirely in
Python (ADR-0005). The cost of conflating control and data planes at the
code level is that extracting the data plane later requires API churn,
test rework, and producer migrations.
The cost of separating them now is one named module boundary and one
in-process protocol shape. That cost is essentially free if taken
before any consumer exists.
## Decision
1. The Python package is organised so that *every byte-handling
operation* lives behind a named contract:
- `artifactstore.dataplane.spi` — the abstract surface (typed
dataclasses, async iterator protocols).
- `artifactstore.dataplane.inproc` — the v1 implementation, running
in the same process as the control plane.
2. The control plane (`artifactstore.registry`, `artifactstore.api.http`,
`artifactstore.retention`, `artifactstore.audit`) interacts with
bytes *only* through the SPI. No HTTP handler, no DB writer, no
retention rule ever reads or writes file bytes directly.
3. The SPI exposes exactly these operations:
- `ingest_stream(stream, hints) -> IngestResult` — consumes an
upload, returns content addresses, sizes, and storage receipts.
- `serve_object(content_address, range?) -> AsyncIterator[bytes]`
produces bytes for a download.
- `verify_object(content_address) -> VerifyResult` — re-reads bytes,
re-digests, returns mismatches.
- `delete_object(content_address) -> DeletionResult` — best-effort,
idempotent.
- `backend_health() -> BackendStatus` — readiness, latency, free
capacity.
4. The SPI surface is the contract a future Rust daemon must satisfy.
When that daemon ships, `artifactstore.dataplane.inproc` is replaced
by `artifactstore.dataplane.remote` (a thin gRPC or
framed-bincode-over-Unix-socket client). The control plane sees no
change.
5. SPI parameter and return types are CBOR-serialisable today, even when
nothing serialises them. This lets us toggle to RPC without rewriting
types.
## Consequences
Positive:
- The data plane can be rewritten in Rust later with zero API churn.
- Tests can fake the SPI cheaply; integration tests pin the contract.
- The CLI in `artifactstore.cli` is a second consumer of the SPI on
equal footing with the HTTP server.
- Operators with strong embedding requirements can use the in-process
data plane forever; nothing forces the RPC hop.
Negative:
- One extra abstraction layer in v1. Mitigated by the contract being
narrow (five operations).
- Discipline required: PRs that bypass the SPI are rejected. A linter
rule (forbidden import: `artifactstore.api.* -> filesystem`) makes
this mechanical.
## Implementation notes
- The SPI is a `Protocol` (typing.Protocol) in `dataplane/spi.py` so the
in-process and future remote impls don't share an inheritance tree.
- Streaming returns `AsyncIterator[bytes]` so neither full-file buffering
nor `sendfile()` zero-copy is foreclosed.
- The `IngestResult` payload is the canonical CBOR-able value used in
events (ADR-0002). The same byte sequence flows API → SPI → event.

View File

@@ -0,0 +1,117 @@
# ADR-0005 — V1 Technology Stack
Status: accepted
Date: 2026-05-15
Related: ADR-0001, ADR-0002, ADR-0003, ADR-0004
## Context
WP-0001 ("Foundation") cannot start without a pinned stack. The decision
needs to balance:
- ffmpeg / VLC philosophy: minimal dependency budget, sharp boundaries,
native code at the hot edges, plain tools.
- Python is already implied by `.gitignore` and ecosystem fit (StateHub,
guide-board, open-cmis-tck are all Python-leaning).
- The data plane will eventually be Rust (ADR-0004); the control plane
stays in Python and must stay approachable.
## Decision
| Concern | Choice | Rationale |
|---|---|---|
| Language (control plane) | **Python 3.12+** | Async ecosystem, type hints, matches sibling repos. 3.12 specifically: PEP 695 generics, faster CPython, `sys.monitoring`. |
| Package / project manager | **uv** | Single static binary, fast resolver, lockfile-first, replaces `pip + pip-tools + venv + pipx` in one tool. |
| Build backend | **hatchling** (via `pyproject.toml`) | Standards-track PEP 517 backend. No magic. |
| HTTP framework | **FastAPI** (Starlette + Pydantic v2) | OpenAPI generation, async-native, broad community. |
| ASGI server | **uvicorn** (dev), **gunicorn + uvicorn workers** (prod) | Plain, well-understood. |
| Database (prod) | **PostgreSQL 16+** | Source-of-truth event log (ADR-0002) wants `BIGSERIAL`, `BYTEA`, advisory locks, logical replication. |
| Database (dev/embedded) | **SQLite (WAL mode)** | Zero-dependency local. Schema is portable when we use SQLAlchemy Core. |
| DB access | **SQLAlchemy 2.0 Core** + **asyncpg** (prod) / **aiosqlite** (dev) | Core, not ORM — explicit SQL, async drivers. Migrations live below the API surface. |
| Migrations | **Alembic** | Standard, integrates with SQLAlchemy Core, supports both pg and sqlite. |
| Hashing | stdlib **`hashlib`** for SHA-256, **`blake3`** PyPI wheel for BLAKE3 | `blake3` wheel embeds the SIMD-tuned Rust impl with no build-time toolchain. |
| Serialisation | **`cbor2`** for canonical CBOR (ADR-0003); stdlib `json` for JCS or `jcs` PyPI | Smallest deps that satisfy ADR-0003. |
| CLI | **typer** (atop click) | Sits on FastAPI's Pydantic types cleanly; type-driven CLI surface. |
| Tests | **pytest** + **httpx** + **trio-asyncio**-free `pytest-asyncio` | Standard. |
| Lint / format | **ruff** (lint + format) | One tool replaces black + isort + flake8 + pyupgrade. |
| Type checker | **mypy** in `--strict` | Pyright is acceptable for editor support; CI gate is mypy. |
| Logging | stdlib `logging` + `structlog` for structured output | No exotic deps. |
| Metrics / tracing | OpenTelemetry SDK (deferred to its own workplan) | Listed for forward-compatibility; not a v1 dep. |
### Project layout
```
artifact-store/
├── pyproject.toml
├── uv.lock
├── Makefile # thin shim: make dev / test / lint / type / migrate
├── alembic.ini
├── src/
│ └── artifactstore/
│ ├── __init__.py
│ ├── identity/ # content address, digest abstraction (ADR-0001)
│ ├── manifest/ # canonical CBOR, JCS projection (ADR-0003)
│ ├── events/ # append-only log + replayer (ADR-0002)
│ ├── retention/ # policy engine
│ ├── audit/ # audit emission as event subset
│ ├── storage/ # adapter SPI + backend registry
│ │ ├── spi.py
│ │ └── backends/
│ │ ├── local.py # filesystem backend
│ │ └── s3.py # placeholder, WP-0004
│ ├── dataplane/ # SPI + in-process impl (ADR-0004)
│ │ ├── spi.py
│ │ └── inproc.py
│ ├── registry/ # high-level orchestrator
│ ├── api/
│ │ └── http/ # FastAPI app
│ ├── cli/ # typer CLI (thin)
│ └── config.py
├── tests/
│ ├── unit/
│ ├── integration/
│ └── conftest.py
├── migrations/ # alembic
└── docs/
```
### Commands (T001 acceptance)
```
make dev # uvicorn with reload, sqlite backend, local FS storage
make test # pytest -q
make lint # ruff check + ruff format --check
make type # mypy --strict src tests
make migrate # alembic upgrade head
artifactstore # CLI entry point installed by uv
```
## Consequences
Positive:
- Dependency budget is small and each dep is best-in-class for its slot.
- The same toolchain works on Linux, macOS, and CI without special cases.
- `uv.lock` is checked in; builds are reproducible.
- Every layer maps one-to-one to a docs concept (identity, manifest,
events, dataplane, etc.), so the codebase remains navigable.
Negative:
- Pydantic v2 is the heaviest non-DB dep; acceptable for the OpenAPI win.
- Choosing SQLAlchemy Core over ORM costs some convenience; we accept
it because explicit SQL is easier to migrate to Rust later (ADR-0004).
- mypy `--strict` is a per-PR tax; bounded by keeping the codebase small.
## Revision policy
This ADR is the most likely candidate for revision once we have profile
data from real ingestion. Candidates we are already watching:
- Replace `cbor2` with a Rust-backed CBOR codec if profile shows it on
the hot path.
- Replace `uvicorn` with `granian` (Rust ASGI server) if perf demands.
- Replace `SQLAlchemy Core` with raw `asyncpg` + a tiny query builder
if Core's abstractions show up in flame graphs.
Each replacement is its own ADR. None of them are v1 work.

View File

@@ -0,0 +1,69 @@
# ADR-0006 — OCI Artifact Compatibility Kept Reachable
Status: accepted
Date: 2026-05-15
Related: ADR-0001, ADR-0003, `docs/PLATFORM-AMBITION.md` commitment A9
## Context
The OCI Distribution Specification and the OCI Artifact Manifest define
a widely-deployed wire format for content-addressed artifact exchange.
The ecosystem includes `oras`, `cosign`, `crane`, Helm, ChartMuseum,
ML-model packaging tools, and most container registries. Compatibility
with this ecosystem is the single highest-leverage opportunity in
`docs/PLATFORM-AMBITION.md`.
We do not implement OCI compatibility in v1. We do refuse to take any
v1 decision that prevents it.
## Decision
1. The internal data model is structurally compatible with an OCI
artifact manifest. Concretely:
- Storage addresses content as `<algorithm>:<lowercase-hex>`
(ADR-0001). OCI requires exactly this shape.
- Manifests have a `config` blob plus an ordered list of `layers`,
each with `mediaType`, `digest`, `size`, and optional
`annotations`. Our `Manifest` value class includes all of these
fields, even when v1 has no use for `mediaType` or `annotations`.
- Manifest serialisation produces byte-identical output across
callers (ADR-0003). OCI requires this for the manifest digest.
2. The native API may be richer than OCI, but v1 reviews every schema
change against the OCI spec and rejects changes that would block
later OCI compatibility.
3. A future `/v2/` namespace will speak the OCI Distribution Spec on
top of the same storage. This is its own workplan; it does not
modify v1 endpoints, only add new ones.
## Consequences
Positive:
- `oras push`, `cosign sign`, `crane copy`, Helm `chart pull` become
reachable additions, not rewrites.
- Customers who already speak OCI can adopt incrementally.
- The `mediaType` discipline forces v1 producers to label their files,
which improves the manifest's value as a portable record.
Negative:
- v1 carries some otherwise-unnecessary manifest fields. Acceptable;
the cost is bytes, not complexity.
- The OCI manifest model uses SHA-256 as the canonical digest in
practice. ADR-0001's `digest_sha256` column satisfies this; the
native primary digest can still be BLAKE3.
## What this ADR does NOT commit to
- It does not commit to implementing OCI Distribution in v1.
- It does not commit to OCI as the *only* wire format. The native API
remains the richer interface.
- It does not commit to specific OCI media types for evidence packages.
Media-type assignment is the subject of a later workplan.
## Review trigger
Every schema-affecting workplan (anything that touches the data model
or the manifest shape) must include an explicit one-paragraph review
against this ADR. Reject changes that introduce OCI-incompatible
invariants without superseding this ADR.

32
docs/adr/README.md Normal file
View File

@@ -0,0 +1,32 @@
# Architecture Decision Records
This directory holds the architectural decisions that govern `artifact-store`.
Each ADR is a small Markdown file with a status (`proposed`, `accepted`,
`superseded`, `deprecated`), a concise statement of the decision, the
forces that pushed it, and the consequences.
ADRs are the canonical home for "we are doing X" statements that survive
multiple workplans. `INTENT.md` says what we build; `SCOPE.md` says where
the boundary is; `docs/PLATFORM-AMBITION.md` says where we are pointed;
ADRs say how — and they are the only document that records a *changeable*
decision in a form that can be superseded cleanly.
Workplans cite the ADRs they depend on. The architecture blueprint cites
the ADRs it operationalises.
## Index
- [ADR-0001 — Content-Addressed Storage with Dual Digest](0001-content-addressed-storage.md) — accepted
- [ADR-0002 — Append-Only Event Log as Source of Truth](0002-event-log-source-of-truth.md) — accepted
- [ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)](0003-manifest-canonical-cbor.md) — accepted
- [ADR-0004 — Control Plane / Data Plane Contract](0004-control-plane-data-plane-contract.md) — accepted
- [ADR-0005 — V1 Technology Stack](0005-v1-tech-stack.md) — accepted
- [ADR-0006 — OCI Artifact Compatibility Kept Reachable](0006-oci-compatibility-reachable.md) — accepted
## Conventions
- Filenames: `NNNN-kebab-case-slug.md`, numbered in acceptance order.
- Status transitions: `proposed → accepted → (superseded | deprecated)`.
- Supersession is explicit: the new ADR links the old; the old ADR links
forward and changes status. Never delete an ADR.
- Each ADR is short. If it is long, it is wrong: split it.