Blob Storage And Content Streaming Workplan

Date: 2026-05-07

Status: planned.

Purpose

Add efficient, governed blob handling to kontextual-engine so source, normalized, and derived representations can reference real content bytes without duplicating storage. Expose those bytes through engine-native interfaces and CMIS content stream routes.

Current State

The engine already records representation metadata:

digest,
size,
media type,
representation kind,
opaque storage_ref.

It does not yet provide:

a content-addressed blob store,
deduplicating writes,
blob read/stream interfaces,
reference accounting or garbage collection,
CMIS byte-stream download semantics.

Target Architecture

bytes
  -> digest/size verification
  -> BlobStoragePort
  -> content-addressed adapter
  -> AssetRepresentation storage_ref
  -> governed representation service
  -> service API / CMIS content stream

Core rule: blob storage is infrastructure. AssetRepresentation remains the engine-level content reference with provenance, digest, media type, kind, version, policy, and audit context.

Dedupe Strategy

Use whole-object content-addressing first:

digest: sha256:<hex>,
storage key: digest-derived path or adapter key,
idempotent write: if digest already exists, reuse blob,
verify size and digest before returning a blob reference,
representation rows may duplicate references but must not duplicate bytes.

Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity.

Interfaces

Planned engine-native interfaces:

BlobStoragePort.put_bytes(...)
BlobStoragePort.open_bytes(...)
BlobStoragePort.stat(...)
BlobStoragePort.exists(...)
BlobStoragePort.delete_unreferenced(...)
RepresentationContentService.add_representation_from_bytes(...)
RepresentationContentService.get_content_stream(...)

Planned CMIS integration:

getContentStream returns actual bytes/stream with content headers,
setContentStream stores through deduplicating representation service,
content stream changes produce versions and audit events,
descriptors remain available for clients that only need metadata.

Risks

Large files may require streaming APIs rather than in-memory bytes.
Local filesystem adapters need atomic writes and digest verification.
Garbage collection must never delete referenced blobs.
Security must treat blob bytes as governed content, not public storage.
Existing storage_ref values may point to external sources and should remain valid as opaque references.

2.6 KiB Raw Blame History