Blob Storage And Content Streaming Workplan

Date: 2026-05-07

Status: implemented.

Purpose

Add efficient, governed blob handling to kontextual-engine so source, normalized, and derived representations can reference real content bytes without duplicating storage. Expose those bytes through engine-native interfaces and CMIS content stream routes.

Implemented State

The engine records representation metadata:

digest,
size,
media type,
representation kind,
storage_ref.

It now provides:

a content-addressed blob storage port,
in-memory, local filesystem, and optional S3 adapters,
deduplicating writes by sha256:<hex> digest,
whole-byte reads plus chunked iter_bytes(...) streaming,
representation-level content service governance,
reference accounting and dry-run/active cleanup,
CMIS Browser Binding content stream byte routes.

Target Architecture

bytes
  -> digest/size verification
  -> BlobStoragePort
  -> content-addressed adapter (memory/local/S3)
  -> AssetRepresentation storage_ref
  -> governed representation service
  -> service API / CMIS content stream

Core rule: blob storage is infrastructure. AssetRepresentation remains the engine-level content reference with provenance, digest, media type, kind, version, policy, and audit context.

Dedupe Strategy

Use whole-object content-addressing first:

digest: sha256:<hex>,
storage key: digest-derived path or adapter key,
idempotent write: if digest already exists, reuse blob,
verify size and digest before returning a blob reference,
representation rows may duplicate references but must not duplicate bytes.

Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity.

Interfaces

Engine-native interfaces:

BlobStoragePort.put_bytes(...)
BlobStoragePort.read_bytes(...)
BlobStoragePort.iter_bytes(...)
BlobStoragePort.stat(...)
BlobStoragePort.exists(...)
BlobStoragePort.delete_unreferenced(...)
RepresentationContentService.add_representation_from_bytes(...)
RepresentationContentService.get_content_stream(...)
RepresentationContentService.stream_content(...)

CMIS integration:

getContentStream returns actual bytes/stream with content headers,
setContentStream stores through deduplicating representation service,
appendContentStream composes the current stream plus appended bytes and stores the resulting representation through the same deduplicating service,
content stream changes produce versions and audit events,
descriptors remain available for clients that only need metadata.

Storage Backends

InMemoryBlobStorage supports deterministic unit tests and default runtime wiring.
LocalBlobStorage stores content under digest-derived paths and uses atomic temporary writes.
S3BlobStorage is available through the optional kontextual-engine[s3] extra and keeps S3 concerns behind the same blob port. It uses digest-derived object keys and streams object bodies in chunks.

The engine stores only the returned storage_ref on representations. Backend selection is therefore a deployment concern, not a domain-model fork.

Migration Posture

Existing opaque storage_ref values remain valid metadata, but content bytes can only be streamed when the configured blob adapter can resolve the reference. Migration should import external content through RepresentationContentService.add_representation_from_bytes(...) so dedupe, digest verification, policy, versions, and audit events are preserved.

Risks

Very large files may require upload-side streaming beyond the current byte-based write API.
Local filesystem adapters need atomic writes and digest verification.
Garbage collection must never delete referenced blobs.
Security must treat blob bytes as governed content, not public storage.
Existing storage_ref values may point to external sources and should remain valid as opaque references.

3.9 KiB Raw Blame History