Files
kontextual-engine/docs/blob-storage-content-streaming-workplan.md

89 lines
2.6 KiB
Markdown

# Blob Storage And Content Streaming Workplan
Date: 2026-05-07
Status: planned.
## Purpose
Add efficient, governed blob handling to `kontextual-engine` so source,
normalized, and derived representations can reference real content bytes
without duplicating storage. Expose those bytes through engine-native
interfaces and CMIS content stream routes.
## Current State
The engine already records representation metadata:
- digest,
- size,
- media type,
- representation kind,
- opaque `storage_ref`.
It does not yet provide:
- a content-addressed blob store,
- deduplicating writes,
- blob read/stream interfaces,
- reference accounting or garbage collection,
- CMIS byte-stream download semantics.
## Target Architecture
```text
bytes
-> digest/size verification
-> BlobStoragePort
-> content-addressed adapter
-> AssetRepresentation storage_ref
-> governed representation service
-> service API / CMIS content stream
```
Core rule: blob storage is infrastructure. `AssetRepresentation` remains the
engine-level content reference with provenance, digest, media type, kind,
version, policy, and audit context.
## Dedupe Strategy
Use whole-object content-addressing first:
- digest: `sha256:<hex>`,
- storage key: digest-derived path or adapter key,
- idempotent write: if digest already exists, reuse blob,
- verify size and digest before returning a blob reference,
- representation rows may duplicate references but must not duplicate bytes.
Chunk-level deduplication is explicitly deferred until large-file evidence
justifies the complexity.
## Interfaces
Planned engine-native interfaces:
- `BlobStoragePort.put_bytes(...)`
- `BlobStoragePort.open_bytes(...)`
- `BlobStoragePort.stat(...)`
- `BlobStoragePort.exists(...)`
- `BlobStoragePort.delete_unreferenced(...)`
- `RepresentationContentService.add_representation_from_bytes(...)`
- `RepresentationContentService.get_content_stream(...)`
Planned CMIS integration:
- `getContentStream` returns actual bytes/stream with content headers,
- `setContentStream` stores through deduplicating representation service,
- content stream changes produce versions and audit events,
- descriptors remain available for clients that only need metadata.
## Risks
- Large files may require streaming APIs rather than in-memory bytes.
- Local filesystem adapters need atomic writes and digest verification.
- Garbage collection must never delete referenced blobs.
- Security must treat blob bytes as governed content, not public storage.
- Existing `storage_ref` values may point to external sources and should remain
valid as opaque references.