# Blob Storage And Content Streaming Workplan Date: 2026-05-07 Status: implemented. ## Purpose Add efficient, governed blob handling to `kontextual-engine` so source, normalized, and derived representations can reference real content bytes without duplicating storage. Expose those bytes through engine-native interfaces and CMIS content stream routes. ## Implemented State The engine records representation metadata: - digest, - size, - media type, - representation kind, - `storage_ref`. It now provides: - a content-addressed blob storage port, - in-memory, local filesystem, and optional S3 adapters, - deduplicating writes by `sha256:` digest, - whole-byte reads plus chunked `iter_bytes(...)` streaming, - representation-level content service governance, - reference accounting and dry-run/active cleanup, - CMIS Browser Binding content stream byte routes. ## Target Architecture ```text bytes -> digest/size verification -> BlobStoragePort -> content-addressed adapter (memory/local/S3) -> AssetRepresentation storage_ref -> governed representation service -> service API / CMIS content stream ``` Core rule: blob storage is infrastructure. `AssetRepresentation` remains the engine-level content reference with provenance, digest, media type, kind, version, policy, and audit context. ## Dedupe Strategy Use whole-object content-addressing first: - digest: `sha256:`, - storage key: digest-derived path or adapter key, - idempotent write: if digest already exists, reuse blob, - verify size and digest before returning a blob reference, - representation rows may duplicate references but must not duplicate bytes. Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity. ## Interfaces Engine-native interfaces: - `BlobStoragePort.put_bytes(...)` - `BlobStoragePort.read_bytes(...)` - `BlobStoragePort.iter_bytes(...)` - `BlobStoragePort.stat(...)` - `BlobStoragePort.exists(...)` - `BlobStoragePort.delete_unreferenced(...)` - `RepresentationContentService.add_representation_from_bytes(...)` - `RepresentationContentService.get_content_stream(...)` - `RepresentationContentService.stream_content(...)` CMIS integration: - `getContentStream` returns actual bytes/stream with content headers, - `setContentStream` stores through deduplicating representation service, - content stream changes produce versions and audit events, - descriptors remain available for clients that only need metadata. ## Storage Backends - `InMemoryBlobStorage` supports deterministic unit tests and default runtime wiring. - `LocalBlobStorage` stores content under digest-derived paths and uses atomic temporary writes. - `S3BlobStorage` is available through the optional `kontextual-engine[s3]` extra and keeps S3 concerns behind the same blob port. It uses digest-derived object keys and streams object bodies in chunks. The engine stores only the returned `storage_ref` on representations. Backend selection is therefore a deployment concern, not a domain-model fork. ## Migration Posture Existing opaque `storage_ref` values remain valid metadata, but content bytes can only be streamed when the configured blob adapter can resolve the reference. Migration should import external content through `RepresentationContentService.add_representation_from_bytes(...)` so dedupe, digest verification, policy, versions, and audit events are preserved. ## Risks - Very large files may require upload-side streaming beyond the current byte-based write API. - Local filesystem adapters need atomic writes and digest verification. - Garbage collection must never delete referenced blobs. - Security must treat blob bytes as governed content, not public storage. - Existing `storage_ref` values may point to external sources and should remain valid as opaque references.