Files
kontextual-engine/docs/blob-storage-content-streaming-workplan.md

2.6 KiB

Blob Storage And Content Streaming Workplan

Date: 2026-05-07

Status: planned.

Purpose

Add efficient, governed blob handling to kontextual-engine so source, normalized, and derived representations can reference real content bytes without duplicating storage. Expose those bytes through engine-native interfaces and CMIS content stream routes.

Current State

The engine already records representation metadata:

  • digest,
  • size,
  • media type,
  • representation kind,
  • opaque storage_ref.

It does not yet provide:

  • a content-addressed blob store,
  • deduplicating writes,
  • blob read/stream interfaces,
  • reference accounting or garbage collection,
  • CMIS byte-stream download semantics.

Target Architecture

bytes
  -> digest/size verification
  -> BlobStoragePort
  -> content-addressed adapter
  -> AssetRepresentation storage_ref
  -> governed representation service
  -> service API / CMIS content stream

Core rule: blob storage is infrastructure. AssetRepresentation remains the engine-level content reference with provenance, digest, media type, kind, version, policy, and audit context.

Dedupe Strategy

Use whole-object content-addressing first:

  • digest: sha256:<hex>,
  • storage key: digest-derived path or adapter key,
  • idempotent write: if digest already exists, reuse blob,
  • verify size and digest before returning a blob reference,
  • representation rows may duplicate references but must not duplicate bytes.

Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity.

Interfaces

Planned engine-native interfaces:

  • BlobStoragePort.put_bytes(...)
  • BlobStoragePort.open_bytes(...)
  • BlobStoragePort.stat(...)
  • BlobStoragePort.exists(...)
  • BlobStoragePort.delete_unreferenced(...)
  • RepresentationContentService.add_representation_from_bytes(...)
  • RepresentationContentService.get_content_stream(...)

Planned CMIS integration:

  • getContentStream returns actual bytes/stream with content headers,
  • setContentStream stores through deduplicating representation service,
  • content stream changes produce versions and audit events,
  • descriptors remain available for clients that only need metadata.

Risks

  • Large files may require streaming APIs rather than in-memory bytes.
  • Local filesystem adapters need atomic writes and digest verification.
  • Garbage collection must never delete referenced blobs.
  • Security must treat blob bytes as governed content, not public storage.
  • Existing storage_ref values may point to external sources and should remain valid as opaque references.