content-addressed blob storage: blob_storage.py, memory, local, and S3 adapters

2026-05-07 03:51:25 +02:00
parent c2bc7071d7
commit ebace73761
22 changed files with 1489 additions and 47 deletions
--- a/docs/architecture-blueprint.md
+++ b/docs/architecture-blueprint.md
@@ -190,7 +190,7 @@ Required MVP ports:

 - Repository port for assets, representations, metadata, relationships,
  versions, runs, audit events, and exports.
- Object/content store port for source, normalized, and derived content payloads.
+- Blob/content store port for source, normalized, and derived content payloads.
 - Search index port for lexical search and later semantic/hybrid retrieval.
 - Extractor port for format-specific normalization.
 - Connector port for source systems.
@@ -211,6 +211,10 @@ Adapter rules:
  Markitect where useful, but they are not the canonical engine identity or
  storage model. The canonical layer remains asset, representation, metadata,
  lifecycle, policy, lineage, and audit state.
+- Blob storage is infrastructure behind `AssetRepresentation.storage_ref`.
+  Whole-object content addressing, digest verification, and chunked byte
+  streaming belong behind the blob port. Local filesystem and S3 are adapters,
+  not different domain models.
 - `llm-connect` or equivalent is an adapter for LLM providers.
 - `phase-memory` is an adjacent memory runtime; this engine may exchange opaque
  memory references or context packages but should not implement memory phases.
@@ -251,6 +255,9 @@ Recommended storage style:
  adapter-specific payloads.
 - Separate content/object references for large source, normalized, or derived
  payloads.
+- Store blob bytes outside repository rows when content is non-trivial. Keep
+  representation digest, size, media type, kind, producer, and storage ref in
+  the repository, and let blob adapters handle byte persistence and dedupe.
 - Append-only audit events and change records.
 - Deterministic ordering fields for pagination and tests.

--- a/docs/blob-storage-content-streaming-workplan.md
+++ b/docs/blob-storage-content-streaming-workplan.md
@@ -2,7 +2,7 @@

 Date: 2026-05-07

-Status: planned.
+Status: implemented.

 ## Purpose

@@ -11,23 +11,25 @@ normalized, and derived representations can reference real content bytes
 without duplicating storage. Expose those bytes through engine-native
 interfaces and CMIS content stream routes.

-## Current State
+## Implemented State

-The engine already records representation metadata:
+The engine records representation metadata:

 - digest,
 - size,
 - media type,
 - representation kind,
- opaque `storage_ref`.
+- `storage_ref`.

-It does not yet provide:
+It now provides:

- a content-addressed blob store,
- deduplicating writes,
- blob read/stream interfaces,
- reference accounting or garbage collection,
- CMIS byte-stream download semantics.
+- a content-addressed blob storage port,
+- in-memory, local filesystem, and optional S3 adapters,
+- deduplicating writes by `sha256:<hex>` digest,
+- whole-byte reads plus chunked `iter_bytes(...)` streaming,
+- representation-level content service governance,
+- reference accounting and dry-run/active cleanup,
+- CMIS Browser Binding content stream byte routes.

 ## Target Architecture

@@ -35,7 +37,7 @@ It does not yet provide:
 bytes
  -> digest/size verification
  -> BlobStoragePort
-  -> content-addressed adapter
+  -> content-addressed adapter (memory/local/S3)
  -> AssetRepresentation storage_ref
  -> governed representation service
  -> service API / CMIS content stream
@@ -60,29 +62,52 @@ justifies the complexity.

 ## Interfaces

-Planned engine-native interfaces:
+Engine-native interfaces:

 - `BlobStoragePort.put_bytes(...)`
- `BlobStoragePort.open_bytes(...)`
+- `BlobStoragePort.read_bytes(...)`
+- `BlobStoragePort.iter_bytes(...)`
 - `BlobStoragePort.stat(...)`
 - `BlobStoragePort.exists(...)`
 - `BlobStoragePort.delete_unreferenced(...)`
 - `RepresentationContentService.add_representation_from_bytes(...)`
 - `RepresentationContentService.get_content_stream(...)`
+- `RepresentationContentService.stream_content(...)`

-Planned CMIS integration:
+CMIS integration:

 - `getContentStream` returns actual bytes/stream with content headers,
 - `setContentStream` stores through deduplicating representation service,
 - content stream changes produce versions and audit events,
 - descriptors remain available for clients that only need metadata.

+## Storage Backends
+
+- `InMemoryBlobStorage` supports deterministic unit tests and default runtime
+  wiring.
+- `LocalBlobStorage` stores content under digest-derived paths and uses atomic
+  temporary writes.
+- `S3BlobStorage` is available through the optional `kontextual-engine[s3]`
+  extra and keeps S3 concerns behind the same blob port. It uses digest-derived
+  object keys and streams object bodies in chunks.
+
+The engine stores only the returned `storage_ref` on representations. Backend
+selection is therefore a deployment concern, not a domain-model fork.
+
+## Migration Posture
+
+Existing opaque `storage_ref` values remain valid metadata, but content bytes
+can only be streamed when the configured blob adapter can resolve the reference.
+Migration should import external content through
+`RepresentationContentService.add_representation_from_bytes(...)` so dedupe,
+digest verification, policy, versions, and audit events are preserved.
+
 ## Risks

- Large files may require streaming APIs rather than in-memory bytes.
+- Very large files may require upload-side streaming beyond the current
+  byte-based write API.
 - Local filesystem adapters need atomic writes and digest verification.
 - Garbage collection must never delete referenced blobs.
 - Security must treat blob bytes as governed content, not public storage.
 - Existing `storage_ref` values may point to external sources and should remain
  valid as opaque references.
-