Files
kontextual-engine/docs/blob-storage-content-streaming-workplan.md
2026-05-14 01:02:25 +02:00

3.9 KiB

Blob Storage And Content Streaming Workplan

Date: 2026-05-07

Status: implemented.

Purpose

Add efficient, governed blob handling to kontextual-engine so source, normalized, and derived representations can reference real content bytes without duplicating storage. Expose those bytes through engine-native interfaces and CMIS content stream routes.

Implemented State

The engine records representation metadata:

  • digest,
  • size,
  • media type,
  • representation kind,
  • storage_ref.

It now provides:

  • a content-addressed blob storage port,
  • in-memory, local filesystem, and optional S3 adapters,
  • deduplicating writes by sha256:<hex> digest,
  • whole-byte reads plus chunked iter_bytes(...) streaming,
  • representation-level content service governance,
  • reference accounting and dry-run/active cleanup,
  • CMIS Browser Binding content stream byte routes.

Target Architecture

bytes
  -> digest/size verification
  -> BlobStoragePort
  -> content-addressed adapter (memory/local/S3)
  -> AssetRepresentation storage_ref
  -> governed representation service
  -> service API / CMIS content stream

Core rule: blob storage is infrastructure. AssetRepresentation remains the engine-level content reference with provenance, digest, media type, kind, version, policy, and audit context.

Dedupe Strategy

Use whole-object content-addressing first:

  • digest: sha256:<hex>,
  • storage key: digest-derived path or adapter key,
  • idempotent write: if digest already exists, reuse blob,
  • verify size and digest before returning a blob reference,
  • representation rows may duplicate references but must not duplicate bytes.

Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity.

Interfaces

Engine-native interfaces:

  • BlobStoragePort.put_bytes(...)
  • BlobStoragePort.read_bytes(...)
  • BlobStoragePort.iter_bytes(...)
  • BlobStoragePort.stat(...)
  • BlobStoragePort.exists(...)
  • BlobStoragePort.delete_unreferenced(...)
  • RepresentationContentService.add_representation_from_bytes(...)
  • RepresentationContentService.get_content_stream(...)
  • RepresentationContentService.stream_content(...)

CMIS integration:

  • getContentStream returns actual bytes/stream with content headers,
  • setContentStream stores through deduplicating representation service,
  • appendContentStream composes the current stream plus appended bytes and stores the resulting representation through the same deduplicating service,
  • content stream changes produce versions and audit events,
  • descriptors remain available for clients that only need metadata.

Storage Backends

  • InMemoryBlobStorage supports deterministic unit tests and default runtime wiring.
  • LocalBlobStorage stores content under digest-derived paths and uses atomic temporary writes.
  • S3BlobStorage is available through the optional kontextual-engine[s3] extra and keeps S3 concerns behind the same blob port. It uses digest-derived object keys and streams object bodies in chunks.

The engine stores only the returned storage_ref on representations. Backend selection is therefore a deployment concern, not a domain-model fork.

Migration Posture

Existing opaque storage_ref values remain valid metadata, but content bytes can only be streamed when the configured blob adapter can resolve the reference. Migration should import external content through RepresentationContentService.add_representation_from_bytes(...) so dedupe, digest verification, policy, versions, and audit events are preserved.

Risks

  • Very large files may require upload-side streaming beyond the current byte-based write API.
  • Local filesystem adapters need atomic writes and digest verification.
  • Garbage collection must never delete referenced blobs.
  • Security must treat blob bytes as governed content, not public storage.
  • Existing storage_ref values may point to external sources and should remain valid as opaque references.