# Blob Storage And Content Streaming Workplan Date: 2026-05-07 Status: planned. ## Purpose Add efficient, governed blob handling to `kontextual-engine` so source, normalized, and derived representations can reference real content bytes without duplicating storage. Expose those bytes through engine-native interfaces and CMIS content stream routes. ## Current State The engine already records representation metadata: - digest, - size, - media type, - representation kind, - opaque `storage_ref`. It does not yet provide: - a content-addressed blob store, - deduplicating writes, - blob read/stream interfaces, - reference accounting or garbage collection, - CMIS byte-stream download semantics. ## Target Architecture ```text bytes -> digest/size verification -> BlobStoragePort -> content-addressed adapter -> AssetRepresentation storage_ref -> governed representation service -> service API / CMIS content stream ``` Core rule: blob storage is infrastructure. `AssetRepresentation` remains the engine-level content reference with provenance, digest, media type, kind, version, policy, and audit context. ## Dedupe Strategy Use whole-object content-addressing first: - digest: `sha256:`, - storage key: digest-derived path or adapter key, - idempotent write: if digest already exists, reuse blob, - verify size and digest before returning a blob reference, - representation rows may duplicate references but must not duplicate bytes. Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity. ## Interfaces Planned engine-native interfaces: - `BlobStoragePort.put_bytes(...)` - `BlobStoragePort.open_bytes(...)` - `BlobStoragePort.stat(...)` - `BlobStoragePort.exists(...)` - `BlobStoragePort.delete_unreferenced(...)` - `RepresentationContentService.add_representation_from_bytes(...)` - `RepresentationContentService.get_content_stream(...)` Planned CMIS integration: - `getContentStream` returns actual bytes/stream with content headers, - `setContentStream` stores through deduplicating representation service, - content stream changes produce versions and audit events, - descriptors remain available for clients that only need metadata. ## Risks - Large files may require streaming APIs rather than in-memory bytes. - Local filesystem adapters need atomic writes and digest verification. - Garbage collection must never delete referenced blobs. - Security must treat blob bytes as governed content, not public storage. - Existing `storage_ref` values may point to external sources and should remain valid as opaque references.