Files
artifact-store/docs/adr/0003-manifest-canonical-cbor.md
tegwick 747afc27a6 docs+plans: reconcile blueprint with ambition, add ADRs, sequence workplans
Aligns the v1 architecture with the longer-horizon platform thesis so we can
start implementation without the schema-level inconsistencies the prior
review surfaced.

ADRs (docs/adr/0001..0006): content-addressed dual-digest storage, append-only
event log as source of truth, canonical CBOR manifests, control/data-plane
contract, v1 tech stack (Python 3.12 / uv / FastAPI / SQLAlchemy Core +
asyncpg / Alembic / cbor2 / blake3 / ruff / mypy / pytest / typer), OCI
compatibility kept reachable.

Architecture blueprint rewritten to v2: library-first (ffmpeg-shaped) module
layout, materialised-view data model over the event log, upload-session and
event-stream endpoints pinned, retrieval tiering promoted into the schema.

Roadmap added (docs/ROADMAP.md) with three phases. WP-0001 rewritten as the
Foundation plan (scaffold + kernels + local FS + minimal app). WP-0002..0005
created carrying the existing state_hub_task_ids forward semantically:
ingestion API (T004), retention lifecycle (T005), S3-compatible backend
(T006), guide-board pilot (T007). T001/T002/T003/T008 remain in WP-0001
with refined acceptance.

README and AGENTS.md refreshed to reflect the new repo shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 21:16:17 +02:00

3.4 KiB

ADR-0003 — Manifest Canonicalisation = Canonical CBOR (RFC 8949 §4.2.2)

Status: accepted Date: 2026-05-15 Related: ADR-0001, ADR-0002, ADR-0006, docs/PLATFORM-AMBITION.md commitment A4

Context

Manifests describe a package's identity, contents, retention, and provenance. They are the durable, portable, signable summary of a package. Three downstream features depend on byte-identical manifest serialisation:

  1. Manifest digest (used as the package's content address — ADR-0001).
  2. Signatures (cosign, Sigstore, in-toto, SLSA).
  3. Cross-language / cross-version reproducibility (any client must be able to verify a manifest produced by any other client).

JSON does not guarantee byte-identical output without an explicit canonicalisation profile. The candidates are:

  • JCS (JSON Canonicalization Scheme, RFC 8785) — JSON-shaped, widely available, text-format, signs cleanly.
  • Canonical CBOR (RFC 8949 §4.2.2) — binary, smaller, lower overhead to canonicalise, native in cosign / Sigstore tooling, used by COSE.
  • DAG-CBOR (IPLD profile) — canonical CBOR plus content-addressing conventions; useful if we later integrate with IPLD/IPFS, but pulls in ecosystem assumptions we don't yet need.

Canonical CBOR wins on size, parser surface, and direct compatibility with the tooling we will adopt for signing (ADR commitments A4, A9). JCS is a reasonable alternative; we keep an emit-JCS path for human-readable display but the signed form is CBOR.

Decision

  1. Manifests are serialised as canonical CBOR per RFC 8949 §4.2.2:
    • definite-length encoding throughout,
    • shortest-form integer encoding,
    • map keys sorted bytewise lexicographically,
    • no floating-point unless explicitly required (we do not require it),
    • no semantic tags except those we explicitly enumerate.
  2. The manifest's content address is blake3:<hex> of its canonical CBOR bytes. This is the package's primary identifier in storage.
  3. A canonical JSON projection (JCS) of the same manifest is available for display, signing-tool interop, and human inspection. The projection is deterministic: round-tripping through it must yield byte-identical CBOR.
  4. The manifest schema is itself versioned (manifest_version: 1). Unknown fields are preserved on read and re-emitted on write (forward compatibility); breaking schema changes bump the version.

Consequences

Positive:

  • Manifests are signable today by any tool that consumes CBOR (cosign, ssh-keygen -Y sign, COSE libraries).
  • The manifest digest is stable across languages, OS, and compiler.
  • Smaller on disk and on the wire than JSON.
  • Replay (ADR-0002) is unambiguous because event payloads are also CBOR.

Negative:

  • Less human-readable in raw form; the CLI must offer a pretty projection.
  • One more dependency (a CBOR library). We pin one in ADR-0005.
  • Future schema evolution requires the same canonicalisation discipline. Enforced by a property-based test: any manifest must round-trip CBOR → JCS → CBOR with byte equality.

Implementation notes

  • v1 library: cbor2 (PyPI; pure-Python with optional C extension). Wrapped behind artifactstore.manifest.codec so swapping to a faster impl is transparent.
  • JCS projection: jcs (PyPI) or hand-rolled — decision deferred to WP-0001-T003.
  • A Manifest value class enforces field order on emit, not just on encode. This catches non-canonical producers at the API boundary.