3.9 KiB
Blob Storage And Content Streaming Workplan
Date: 2026-05-07
Status: implemented.
Purpose
Add efficient, governed blob handling to kontextual-engine so source,
normalized, and derived representations can reference real content bytes
without duplicating storage. Expose those bytes through engine-native
interfaces and CMIS content stream routes.
Implemented State
The engine records representation metadata:
- digest,
- size,
- media type,
- representation kind,
storage_ref.
It now provides:
- a content-addressed blob storage port,
- in-memory, local filesystem, and optional S3 adapters,
- deduplicating writes by
sha256:<hex>digest, - whole-byte reads plus chunked
iter_bytes(...)streaming, - representation-level content service governance,
- reference accounting and dry-run/active cleanup,
- CMIS Browser Binding content stream byte routes.
Target Architecture
bytes
-> digest/size verification
-> BlobStoragePort
-> content-addressed adapter (memory/local/S3)
-> AssetRepresentation storage_ref
-> governed representation service
-> service API / CMIS content stream
Core rule: blob storage is infrastructure. AssetRepresentation remains the
engine-level content reference with provenance, digest, media type, kind,
version, policy, and audit context.
Dedupe Strategy
Use whole-object content-addressing first:
- digest:
sha256:<hex>, - storage key: digest-derived path or adapter key,
- idempotent write: if digest already exists, reuse blob,
- verify size and digest before returning a blob reference,
- representation rows may duplicate references but must not duplicate bytes.
Chunk-level deduplication is explicitly deferred until large-file evidence justifies the complexity.
Interfaces
Engine-native interfaces:
BlobStoragePort.put_bytes(...)BlobStoragePort.read_bytes(...)BlobStoragePort.iter_bytes(...)BlobStoragePort.stat(...)BlobStoragePort.exists(...)BlobStoragePort.delete_unreferenced(...)RepresentationContentService.add_representation_from_bytes(...)RepresentationContentService.get_content_stream(...)RepresentationContentService.stream_content(...)
CMIS integration:
getContentStreamreturns actual bytes/stream with content headers,setContentStreamstores through deduplicating representation service,appendContentStreamcomposes the current stream plus appended bytes and stores the resulting representation through the same deduplicating service,- content stream changes produce versions and audit events,
- descriptors remain available for clients that only need metadata.
Storage Backends
InMemoryBlobStoragesupports deterministic unit tests and default runtime wiring.LocalBlobStoragestores content under digest-derived paths and uses atomic temporary writes.S3BlobStorageis available through the optionalkontextual-engine[s3]extra and keeps S3 concerns behind the same blob port. It uses digest-derived object keys and streams object bodies in chunks.
The engine stores only the returned storage_ref on representations. Backend
selection is therefore a deployment concern, not a domain-model fork.
Migration Posture
Existing opaque storage_ref values remain valid metadata, but content bytes
can only be streamed when the configured blob adapter can resolve the reference.
Migration should import external content through
RepresentationContentService.add_representation_from_bytes(...) so dedupe,
digest verification, policy, versions, and audit events are preserved.
Risks
- Very large files may require upload-side streaming beyond the current byte-based write API.
- Local filesystem adapters need atomic writes and digest verification.
- Garbage collection must never delete referenced blobs.
- Security must treat blob bytes as governed content, not public storage.
- Existing
storage_refvalues may point to external sources and should remain valid as opaque references.